MNBVC数据集来源汇总
本汇总来自于本汇总来自于MNBVC数据文件下,含有links.txt文件的文件夹的hugging face、悟道和oscar数据集
- hugging face数据集:约4GB
- OSCAR数据集:约30G
- 悟道数据集:约1.9GB
本汇总来自于MNBVC数据文件下,含有links.txt文件的文件夹,去除了来自hugging face、悟道和oscar数据集
- 北理工张华平实验室数据集:共有约3.5GB
- 中国裁判文书数据集:共有约144GB https://wenshu.court.gov.cn/
- 阿里云天池数据集:共有11GB
- 百度飞桨ai studio:共有约14GB
- https://aistudio.baidu.com/aistudio/datasetdetail/96333
- https://aistudio.baidu.com/aistudio/datasetdetail/127041
- https://aistudio.baidu.com/aistudio/datasetdetail/106736
- https://aistudio.baidu.com/aistudio/datasetdetail/107866
- https://aistudio.baidu.com/aistudio/datasetdetail/107226
- https://aistudio.baidu.com/aistudio/datasetdetail/109273
- https://aistudio.baidu.com/aistudio/datasetdetail/109265
- https://aistudio.baidu.com/aistudio/datasetdetail/107317
- https://aistudio.baidu.com/aistudio/datasetdetail/108662
- https://aistudio.baidu.com/aistudio/datasetdetail/106266
- https://aistudio.baidu.com/aistudio/datasetdetail/107440
- https://aistudio.baidu.com/aistudio/datasetdetail/106733
- https://aistudio.baidu.com/aistudio/datasetdetail/107381
- https://aistudio.baidu.com/aistudio/datasetdetail/107229
- https://aistudio.baidu.com/aistudio/datasetdetail/109290
- https://aistudio.baidu.com/aistudio/datasetdetail/107274
- https://aistudio.baidu.com/aistudio/datasetdetail/107219
- https://aistudio.baidu.com/aistudio/datasetdetail/107225
- https://aistudio.baidu.com/aistudio/datasetdetail/109008
- https://aistudio.baidu.com/aistudio/datasetdetail/107438
- https://aistudio.baidu.com/aistudio/datasetdetail/107212
- https://aistudio.baidu.com/aistudio/datasetdetail/180720
- 电子书网站:共有约17GB(网站打不开):http://cn.epubee.com/books
- 自建数据集:共有约12GB
- https://github.com/ydli-ai/CSL
- https://github.com/baidu/DuReader/tree/master/DuReader-vis
- https://github.com/brightmart/nlp_chinese_corpus
- https://github.com/GeneralZh/Chinese_Corpus
- https://github.com/baidu/DuReader/tree/master/DuReader-2.0
- https://github.com/ymcui/Chinese-Cloze-RC
- https://github.com/JiangYanting/
- https://github.com/txtcn/data
- https://github.com/codemayq/chinese_chatbot_corpus
- https://github.com/fangj/rmrb
- https://github.com/wb14123/couplet-dataset
- https://github.com/SophonPlus/ChineseNlpCorpus
- https://github.com/wonderfulsuccess/chinese_abstractive_corpus
- https://github.com/nonamestreet/weixin_public_corpus
- https://github.com/guhhhhaa/4675-scifi
- https://github.com/fighting41love/funNLP
- 淘宝商户数据集:共有约23GB:https://item.taobao.com/item.htm?spm=a230r.1.14.8.7293393cSUL7i2&id=641561612393&ns=1&abbucket=7#detail
- 百度云盘数据集:共有约10GB
数据来源文件夹汇总
本表格包含了以上有links.txt的全部文件以及没有links.txt的全部文件,在路径 /data_turbo/datasets/mnbvc_links
处存放有经过整理后的软链接文件夹,如下表
软链接文件夹名 | 源文件总字节数 | 源文件总文件数 | 源文件总大小/GB |
---|---|---|---|
aliyun | 665843909107 | 102722 | 620.1 |
zlibrary | 182653137601 | 52004 | 170.1 |
baidu | 240295462931 | 167090 | 223.8 |
github | 2610996066750 | 19399 | 2431.7 |
epubee | 35211280438 | 87 | 32.8 |
huggingface | 907539966249 | 577 | 845.2 |
wenshu | 155036380462 | 370 | 144.4 |
wangyou | 8122498447 | 4604 | 7.6 |
wudao | 215481918181 | 382 | 200.7 |
zlibaray | 14477964833 | 14786 | 13.5 |
txtsk | 36727271581 | 43090 | 34.2 |
financezhidao | 257401588 | 1 | 0.24 |
thunlp | 46716481 | 16 | 0.044 |
wikipedia | 47198279033 | 93 | 43.96 |
wikihow | 2030412726 | 6 | 1.89 |
zhihu | 2403329518 | 6 | 2.2 |
mfa | 13021229 | 2 | 0.012 |
nlpir | 3499067831 | 11 | 3.26 |
unite | 2466041612 | 95949 | 2.3 |
taobao | 25944299673 | 54 | 24.2 |
ali | 12316187972 | 21 | 11.5 |
afqmc | 5425519 | 3 | 0.005 |
duzhe | 33358277 | 613 | 0.03 |
riddle | 7417088 | 1 | 0.007 |
软链接的命名方式采用 源文件夹子目录名.源文件名 方式命名,如 zlibaray.20230114.3.杂书.11582833.txt
预处理数据集
数据集名称 | 处理情况 | 存储位置 |
---|---|---|
wenshu | 按text和meta的格式化完成 | /mnt/cos/cos_shanghai_1/raw_datasets/mnbvc_wenshu |
评论