ymcui / cmrc2018 Goto Github PK

View Code? Open in Web Editor NEW

407.0 12.0 88.0 6.33 MB

A Span-Extraction Dataset for Chinese Machine Reading Comprehension (CMRC 2018)

Home Page: https://ymcui.github.io/cmrc2018/

License: Creative Commons Attribution Share Alike 4.0 International

Python 100.00%

reading-comprehension question-answering natural-language-processing bert

cmrc2018's Introduction

中文说明 | English

This repository contains the data for The Second Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2018). We will present our paper on EMNLP 2019.

Title: A Span-Extraction Dataset for Chinese Machine Reading Comprehension
Authors: Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng Chen, Wentao Ma, Shijin Wang, Guoping Hu
Link: https://www.aclweb.org/anthology/D19-1600/
Venue: EMNLP-IJCNLP 2019

Open Challenge Leaderboard (New!)

Keep track of the latest state-of-the-art systems on CMRC 2018 dataset.
https://ymcui.github.io/cmrc2018/

CMRC 2018 Public Datasets

Please download CMRC 2018 public datasets via the following CodaLab Worksheet.
https://worksheets.codalab.org/worksheets/0x92a80d2fab4b4f79a2b4064f7ddca9ce

Submission Guidelines

If you would like to test your model on the hidden test and challenge set, please follow the instructions on how to submit your model via CodaLab worksheet.
https://worksheets.codalab.org/worksheets/0x96f61ee5e9914aee8b54bd11e66ec647/

**Note that the test set on CLUE is NOT the complete test set. If you wish to evaluate your model OFFICIALLY on CMRC 2018, you should follow the guidelines here. **

Quick Load Through 🤗datasets

You can also access this dataset as part of the HuggingFace datasets library library as follow:

!pip install datasets
from datasets import load_dataset
dataset = load_dataset('cmrc2018')

More details on the options and usage for this library can be found on the nlp repository at https://github.com/huggingface/nlp

Reference

If you wish to use our data in your research, please cite:

@inproceedings{cui-emnlp2019-cmrc2018,
    title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension",
    author = "Cui, Yiming  and
      Liu, Ting  and
      Che, Wanxiang  and
      Xiao, Li  and
      Chen, Zhipeng  and
      Ma, Wentao  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1600",
    doi = "10.18653/v1/D19-1600",
    pages = "5886--5891",
}

International Standard Language Resource Number (ISLRN)

ISLRN: 013-662-947-043-2

http://www.islrn.org/resources/resources_info/7952/

Official HFL WeChat Account

Follow Joint Laboratory of HIT and iFLYTEK Research (HFL) on WeChat.

Contact us

Please submit an issue.

cmrc2018's People

Contributors

Stargazers

Watchers

Forkers

sigmaquan shubhampachori12110095 wushicanasl yjfiejd manisaivecha341 aileenlin yangyang233 t-web 90217 ianliyi1996 zheng5yu9 awesome-archive weigoss shihuaxing zzoy cyzhangathit qianrenjian shannonyu frankchu0229 liaobowen hakanaku1234 zeweichu liuzhi136 wind91725 yfsunshine mozarturing debuluoyi nonva sanshanxiashi pluto-junzeng smallzh xhsun1997 wumingyao zhw12 zhangsanaixuexi lt15523290043 askintution super-storm mylv1222 wwenhui-03 lesliecheni lolowisc liyaosong-source yatsenz gmrqiang 612twilight yukyin jack00101 scanf3 muhammadharun786 fengyen-chang ljq2278 lichao88 grasshourse 20171001zzd runtianchen yifyang yueyedeai shaoboge jeffreylau521 w2q3q1 tslnihaogit smiletm novembersun sunyuqingannie haojincheng songmingjiu yanqiangmiffy light201212 andy-hhh-hub zhangwen0629 kunlun-zhu aaa2222339 yangjingyuan ckf2000 ucas010 forex24 18106574249 taotao033 wangwangwang978 chaochao2020 nv8300 jimjinchuanhe tancaidong keepsupers semeron hippoley yelban

cmrc2018's Issues

请问squad_style数据集中的answer_idx计算，有没有考虑context里存在多个与answer字符串相同的情况？是否是人工检查过确实是答案的那个substring？

请问为什么 File "cmrc2018_evaluate.py", line 33 if re.search(ur'[\u4e00-\u9fa5]', char) or char in sp_char: ^ SyntaxError: invalid syntax会有这个报错啊

请问input_span_mask是什么？为什么要在Input Features中增加这个?没看懂

CMRC 2018公开数据集, 开发集和测试集

CMRC 2018公开数据集（训练集，开发集），分别对应数据集的那些文件
网络有测试指标分别为开发集和测试集，数据集的文件为cmrc2018_dev.json，cmrc2018_trial.json，对应关系是啥？麻烦指导下。

cmrc2018_train.json是训练集，比较容易识别。

what is the trial data?

hi, what is the trial data? it is used for testing?
thanks

These is a problem for the documentation of baseline

If you use the data of CMRC 2018 (Simplified Chinese), the example of documentation of baseline will go wrong, because of inconsistent format between the data and standard format.

你好，我看论文里说测试集有4.9K条问题，为什么下载的测试集只有1k多条问题呢

数据集格式

您好，在下载的网页点击下载之后，出现的页面是json文件的乱码，key都是正常显示的，但value没有经过编码，如下：

[{"context_id": "DEV_0", "context_text": "\u300a\u6218\u56fd\u65e0\u53cc3\u300b\uff08\uff09\u662f\u7531\u5149\u8363\u548c\u03c9-force\u5f00\u53d1\u7684\u6218\u56fd\u65e0\u53cc\u7cfb\u5217\u7684\u6b63\u7edf\u7b2c\u4e09\u7eed\u4f5c\u3002\u672c\u4f5c\u4ee5\u4e09\u5927\u6545\u4e8b\u4e3a\u4e3b\u8f74\uff0c\u5206\u522b\u662f\u4ee5\u6b66\u7530\u4fe1\u7384\u7b49\u4eba\u4e3a\u4e3b\u7684\u300a\u5173\u4e1c\u4e09\u56fd\u5fd7\u300b\uff0c\u7ec7\u7530\u4fe1\u957f\u7b49\u4eba\u4e3a\u4e3b\u7684\u300a\u6218\u56fd\u4e09\u6770\u300b\uff0c\u77f3\u7530\u4e09\u6210\u7b49\u4eba\u4e3a\u4e3b\u7684\u300a\u5173\u539f\u7684\u5e74\u8f7b\u6b66\u8005\u300b\uff0c\u4e30\u5bcc\u6e38\u620f\u5185\u7684\u5267\u60c5\u3002\u6b64\u90e8\u4efd\u4e13\u95e8\u4ecb\u7ecd\u89d2\u8272\uff0c\u6b32\u77e5\u6b66\u5668\u60c5\u62a5\u3001\u5965\u4e49\u5b57\u6216\u64c5\u957f\u653b\u51fb\u7c7b\u578b\u7b49\uff0c\u8bf7\u81f3\u6218\u56fd\u65e0\u53cc\u7cfb\u52171.\u7531\u4e8e\u4e61\u91cc\u5927\u8f85\u5148\u751f\u56e0\u6545\u53bb\u4e16\uff0c\u4e0d\u5f97\u4e0d\u5bfb\u627e\u5176\u4ed6\u58f0\u4f18\u63a5\u624b\u3002\u4ece\u731b\u5c06\u4f20 and Z\u5f00\u59cb\u30022.\u6218\u56fd\u65e0\u53cc \u7f16\u5e74\u53f2\u7684\u539f\u521b\u7537\u5973\u4e3b\u89d2\u4ea6\u6709\u4e13\u5c5e\u58f0\u4f18\u3002\u6b64\u6a21\u5f0f\u662f\u4efb\u5929\u5802\u6e38\u620f\u8c1c\u4e4b\u6751\u96e8\u57ce\u6539\u7f16\u7684\u65b0\u589e\u6a21\u5f0f\u3002\u672c\u4f5c\u4e2d\u5171\u670920\u5f20\u6218\u573a\u5730\u56fe\uff08\u4e0d\u542b\u6751\u96e8\u57ce\uff09\uff0c\u540e\u6765\u53d1\u884c\u7684\u731b\u5c06\u4f20\u518d\u65b0\u589e3\u5f20\u6218\u573a\u5730\u56fe\u3002\u4f46\u6e38\u620f\u5185\u6218\u5f79\u6570\u91cf\u7e41\u591a\uff0c\u90e8\u5206\u5730\u56fe\u4f1a\u6709\u517c\u7528\u7684\u72b6\u51b5\uff0c\u6218\u5f79\u865a\u5b9e\u5219\u662f\u4ee5\u5149\u8363\u53d1\u884c\u76842\u672c\u300c\u6218\u56fd\u65e0\u53cc3 \u4eba\u7269\u771f\u4e66\u300d\u5185\u5bb9\u4e3a\u4e3b\uff0c\u4ee5\u4e0b\u662f\u76f8\u5173\u4ecb\u7ecd\u3002\uff08\u6ce8\uff1a\u524d\u65b9\u52a0\u2606\u8005\u4e3a\u731b\u5c06\u4f20\u65b0\u589e\u5173\u5361\u53ca\u5730\u56fe\u3002\uff09\u5408\u5e76\u672c\u7bc7\u548c\u731b\u5c06\u4f20\u7684\u5185\u5bb9\uff0c\u6751\u96e8\u57ce\u6a21\u5f0f\u5254\u9664\uff0c\u6218\u56fd\u53f2\u6a21\u5f0f\u53ef\u76f4\u63a5\u6e38\u73a9\u3002\u4e3b\u6253\u4e24\u5927\u6a21\u5f0f\u300c\u6218\u53f2\u6f14\u6b66\u300d&\u300c\u4e89\u9738\u6f14\u6b66\u300d\u3002\u7cfb\u5217\u4f5c\u54c1\u5916\u4f20\u4f5c\u54c1", "qas": [{"query_text": "\u300a\u6218\u56fd\u65e0\u53cc3\u300b\u662f\u7531\u54ea\u4e24\u4e2a\u516c\u53f8\u5408\u4f5c\u5f00\u53d1\u7684\uff1f", "query_id": "DEV_0_QUERY_0", "answers": ["\u5149\u8363\u548c\u03c9-force", "\u5149\u8363\u548c\u03c9-force", "\u5149\u8363\u548c\u03c9-force"]},

想问下应该如何正确下载json格式的文件呢？

bert_config读取错误

您好，tf1.1 python3.7这个报错是什么原因呢
tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile failed to Create/Open: ./BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_config.json : ϵͳ\udcd5Ҳ\udcbb\udcb5\udcbdָ\udcb6\udca8\udcb5\udcc4·\udcbe\udcb6\udca1\udca3
; No such process

请问完整测试集怎么下载。。

您好，calc_f1_score函数为什么返回的是max(f1_scores)呢？这部分可以解释一下吗，返回最大的，不是就是1了吗？这里没看明白

def calc_f1_score(answers, prediction):
f1_scores = []
for ans in answers:
ans_segs = mixed_segmentation(ans, rm_punc=True)
prediction_segs = mixed_segmentation(prediction, rm_punc=True)
lcs, lcs_len = find_lcs(ans_segs, prediction_segs)
if ans == "" and prediction == "" :
f1_scores.append(1)
else:
if lcs_len == 0:
f1_scores.append(0)
continue
precision = 1.0lcs_len/len(prediction_segs)
recall = 1.0lcs_len/len(ans_segs)
f1 = (2precisionrecall)/(precision+recall)
f1_scores.append(f1)
return max(f1_scores)

您好，想请问论文里说train跟dev分别是10321和3351个问题，但实际上github上的train跟dev分别是10142和3219个问题? 另外squad-style的数据有少部分的数据answer start有误?

论文里说train跟dev分别是10321和3351个问题
但实际上github上的train跟dev分别是10142和3219个问题 (huggingface上面也是10142和3219个问题)，想请问是为什么?

另外squad-style的数据比如./squad-style-data/cmrc2018_train.json，有少部分的数据的answer start跟answer text不匹配
比如TRAIN_3678_QUERY_4 的问题，answer_start对应context中的答案是"总统袁世凯将"，但text标注是"大总统袁世凯"

想请问一下，谢谢

数据标注用的什么工具

我想自己标注一部分数据，请问有什么标注工具推荐吗

tf 版本是什么，1.13抱错

关于试验集的疑问

您好：
网站 https://hfl-rc.github.io/cmrc2018/task/ 中提到的试验集（问题数1,002），和论文 https://arxiv.org/pdf/1906.08101.pdf 中CMRC的Challenge集是一样的么？为何EM/F1非常低，只有20/40左右？
以及现在test集是否可以提供呢？

Z-Reader

请问表里的Z-Reader是什么模型呢？可以给出论文链接么？

更改过数据集结构的吗

在你的baseline代码里发现的数据集结构跟现有github上的目录结构完全不一样

finetune/qa_tasks.py的585行有个错误

qa_tasks.py, line 585, in get_examples
input_data = json.load(f)["data"]
TypeError: list indices must be integers or slices, not str

请问一下，tf版本用的是哪个呢

ValueError: Unknown hparam output_dir

In running the program, I put in

python run_finetuning.py --data-dir=/home/pc/work/Chinese-ELECTRA-master/data-dir --model-name ELECTRA-180g-small, Chinese --hparams params_cmrc2018.json

File "/home/pc/work/Chinese-ELECTRA-master/configure_finetuning.py", line 175, in update
raise ValueError("Unknown hparam " + k)
ValueError: Unknown hparam output_dir

the params_cmrc2018.json is as follows

{
"model_name_or_path": "ELECTRA-180g-small, Chinese ",
"output_dir": "./output",
"train_file": "cmrc2018_train.json",
"predict_file": "cmrc2018_dev.json",
"max_seq_length": 512,
"doc_stride": 128,
"max_query_length": 64,
"per_gpu_train_batch_size": 8,
"per_gpu_eval_batch_size": 8,
"learning_rate": 2e-5,
"num_train_epochs": 3,
"logging_steps": 100,
"save_steps": 1000,
"warmup_steps": 1000,
"weight_decay": 0.01,
"adam_epsilon": 1e-6,
"max_grad_norm": 1.0,
"gradient_accumulation_steps": 1,
"n_best_size": 20,
"max_answer_length": 30,
"do_train": true,
"do_eval": true,
"evaluate_during_training": true,
"overwrite_output_dir": true,
"seed": 42
}

I wonder if anyone can help me solve the problem of ValueError, thanks a lot!

非本代码issue，个人实现中文数据集结果很差

您好，我最近在用transformers做中文qa的时候效果都没效果，跑出来em和f1都是个位数，read_examples时已经改成了中文的预处理，看了github上有些其他的中文实现也有这个问题，请问这是什么原因呢？还有什么关键部分需要调整的吗
对比您的代码和原始bert代码时候发现除了数据处理，还有一个区别是input_span_mask，但是这个影响感觉也不至于那么大

Evaluate error (issues from old repository)

@chiangyulun0914
https://github.com/ymcui/CMRC2018-DRCD-BERT/issues/1#issue-483718582

When I use the cmrc2018_evaluate.py to get EM/F1 for the DRCD_dev.json, I got this:
image

Is there any solution for that?

Here is my evaluate.sh:

#!/bin/bash

#### local path
DRCD_DIR=raw_data/
EVALUATE_DIR=BERT/bert/
PREDICT_RESULT=BERT/experiment/chinese_L-12_H-768_A-12_S-512_B-2/model_ckpt

 
python $EVALUATE_DIR/cmrc2018_evaluate.py $DRCD_DIR/DRCD_dev.json $PREDICT_RESULT/predictions.json

new() missing 2 required positional arguments: 'start_index' and 'end_index'

我修改该代码执行SQuAD2.0数据集，报错：
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /tf/NOC-QA/output_ch/model.ckpt-1166
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Processing example: 0
INFO:tensorflow:Processing example: 1000
INFO:tensorflow:Processing example: 2000
INFO:tensorflow:Processing example: 3000
INFO:tensorflow:Processing example: 4000
INFO:tensorflow:Processing example: 5000
INFO:tensorflow:prediction_loop marked as finished
INFO:tensorflow:prediction_loop marked as finished
INFO:tensorflow:Writing predictions to: /tf/NOC-QA/output_ch/dev_predictions.json
INFO:tensorflow:Writing nbest to: /tf/NOC-QA/output_ch/dev_nbest_predictions.json
Traceback (most recent call last):
File "/tf/NOC-QA/baseline/run_cmrc2018_drcd_baseline.py", line 1448, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/tf/NOC-QA/baseline/run_cmrc2018_drcd_baseline.py", line 1377, in main
output_nbest_file, output_null_log_odds_file)
File "/tf/NOC-QA/baseline/run_cmrc2018_drcd_baseline.py", line 962, in write_predictions
end_logit=null_end_logit)) # In very rare edge cases we could have no valid predictions. So we
TypeError: new() missing 2 required positional arguments: 'start_index' and 'end_index'