1. hugging face数据集:约4GB
    1. https://huggingface.co/datasets/miracl/miracl-corpus
    2. https://huggingface.co/datasets/wangrui6/Zhihu-KOL
  2. OSCAR数据集:约30G
    1. https://huggingface.co/datasets/oscar-corpus/OSCAR-2201
  3. 悟道数据集:约1.9GB
    1. 北京智源人工智能研究院 (baai.ac.cn)
  1. 北理工张华平实验室数据集:共有约3.5GB
    1. http://www.nlpir.org/wordpress/2017/07/13/nlpir新闻语料库-2400万字/
    2. http://www.nlpir.org/wordpress/2017/10/02/文本分类语料库(复旦)测试语料/
    3. http://www.nlpir.org/wordpress/2021/10/11/中国外交部例行记者会语料库/
    4. http://www.nlpir.org/wordpress/2018/01/26/500万微博语料/
  2. 中国裁判文书数据集:共有约144GB https://wenshu.court.gov.cn/
  3. 阿里云天池数据集:共有11GB
    1. https://tianchi.aliyun.com/dataset/94521
    2. https://tianchi.aliyun.com/dataset/9717
    3. https://tianchi.aliyun.com/dataset/92110
  4. 百度飞桨ai studio:共有约14GB
    1. https://aistudio.baidu.com/aistudio/datasetdetail/96333
    2. https://aistudio.baidu.com/aistudio/datasetdetail/127041
    3. https://aistudio.baidu.com/aistudio/datasetdetail/106736
    4. https://aistudio.baidu.com/aistudio/datasetdetail/107866
    5. https://aistudio.baidu.com/aistudio/datasetdetail/107226
    6. https://aistudio.baidu.com/aistudio/datasetdetail/109273
    7. https://aistudio.baidu.com/aistudio/datasetdetail/109265
    8. https://aistudio.baidu.com/aistudio/datasetdetail/107317
    9. https://aistudio.baidu.com/aistudio/datasetdetail/108662
    10. https://aistudio.baidu.com/aistudio/datasetdetail/106266
    11. https://aistudio.baidu.com/aistudio/datasetdetail/107440
    12. https://aistudio.baidu.com/aistudio/datasetdetail/106733
    13. https://aistudio.baidu.com/aistudio/datasetdetail/107381
    14. https://aistudio.baidu.com/aistudio/datasetdetail/107229
    15. https://aistudio.baidu.com/aistudio/datasetdetail/109290
    16. https://aistudio.baidu.com/aistudio/datasetdetail/107274
    17. https://aistudio.baidu.com/aistudio/datasetdetail/107219
    18. https://aistudio.baidu.com/aistudio/datasetdetail/107225
    19. https://aistudio.baidu.com/aistudio/datasetdetail/109008
    20. https://aistudio.baidu.com/aistudio/datasetdetail/107438
    21. https://aistudio.baidu.com/aistudio/datasetdetail/107212
    22. https://aistudio.baidu.com/aistudio/datasetdetail/180720
  5. 电子书网站:共有约17GB(网站打不开):http://cn.epubee.com/books
  6. 自建数据集:共有约12GB
    1. https://github.com/ydli-ai/CSL
    2. https://github.com/baidu/DuReader/tree/master/DuReader-vis
    3. https://github.com/brightmart/nlp_chinese_corpus
    4. https://github.com/GeneralZh/Chinese_Corpus
    5. https://github.com/baidu/DuReader/tree/master/DuReader-2.0
    6. https://github.com/ymcui/Chinese-Cloze-RC
    7. https://github.com/JiangYanting/
    8. https://github.com/txtcn/data
    9. https://github.com/codemayq/chinese_chatbot_corpus
    10. https://github.com/fangj/rmrb
    11. https://github.com/wb14123/couplet-dataset
    12. https://github.com/SophonPlus/ChineseNlpCorpus
    13. https://github.com/wonderfulsuccess/chinese_abstractive_corpus
    14. https://github.com/nonamestreet/weixin_public_corpus
    15. https://github.com/guhhhhaa/4675-scifi
    16. https://github.com/fighting41love/funNLP
  7. 淘宝商户数据集:共有约23GB:https://item.taobao.com/item.htm?spm=a230r.1.14.8.7293393cSUL7i2&id=641561612393&ns=1&abbucket=7#detail
  8. 百度云盘数据集:共有约10GB
    1. https://pan.baidu.com/s/1OntQS9Y6Mf5oysJwxtRBsg
    2. https://pan.baidu.com/s/1hL8DPnFx7jZOLFeh3b19TQ?pwd=99r9
    3. https://pan.baidu.com/s/1mUknfwy1nhSM7XzH8xi7gQ

数据来源文件夹汇总

本表格包含了以上有links.txt的全部文件以及没有links.txt的全部文件,在路径 /data_turbo/datasets/mnbvc_links 处存放有经过整理后的软链接文件夹,如下表

软链接文件夹名 源文件总字节数 源文件总文件数 源文件总大小/GB
aliyun 665843909107 102722 620.1
zlibrary 182653137601 52004 170.1
baidu 240295462931 167090 223.8
github 2610996066750 19399 2431.7
epubee 35211280438 87 32.8
huggingface 907539966249 577 845.2
wenshu 155036380462 370 144.4
wangyou 8122498447 4604 7.6
wudao 215481918181 382 200.7
zlibaray 14477964833 14786 13.5
txtsk 36727271581 43090 34.2
financezhidao 257401588 1 0.24
thunlp 46716481 16 0.044
wikipedia 47198279033 93 43.96
wikihow 2030412726 6 1.89
zhihu 2403329518 6 2.2
mfa 13021229 2 0.012
nlpir 3499067831 11 3.26
unite 2466041612 95949 2.3
taobao 25944299673 54 24.2
ali 12316187972 21 11.5
afqmc 5425519 3 0.005
duzhe 33358277 613 0.03
riddle 7417088 1 0.007

软链接的命名方式采用 源文件夹子目录名.源文件名 方式命名,如 zlibaray.20230114.3.杂书.11582833.txt

预处理数据集

数据集名称 处理情况 存储位置
wenshu 按text和meta的格式化完成 /mnt/cos/cos_shanghai_1/raw_datasets/mnbvc_wenshu