GithubHelp home page GithubHelp logo

ymcui / cmrc2018 Goto Github PK

View Code? Open in Web Editor NEW
407.0 12.0 88.0 6.33 MB

A Span-Extraction Dataset for Chinese Machine Reading Comprehension (CMRC 2018)

Home Page: https://ymcui.github.io/cmrc2018/

License: Creative Commons Attribution Share Alike 4.0 International

Python 100.00%
reading-comprehension question-answering natural-language-processing bert

cmrc2018's Introduction

中文说明 | English



GitHub

This repository contains the data for The Second Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC 2018). We will present our paper on EMNLP 2019.

Title: A Span-Extraction Dataset for Chinese Machine Reading Comprehension
Authors: Yiming Cui, Ting Liu, Wanxiang Che, Li Xiao, Zhipeng Chen, Wentao Ma, Shijin Wang, Guoping Hu
Link: https://www.aclweb.org/anthology/D19-1600/
Venue: EMNLP-IJCNLP 2019

Open Challenge Leaderboard (New!)

Keep track of the latest state-of-the-art systems on CMRC 2018 dataset.
https://ymcui.github.io/cmrc2018/

CMRC 2018 Public Datasets

Please download CMRC 2018 public datasets via the following CodaLab Worksheet.
https://worksheets.codalab.org/worksheets/0x92a80d2fab4b4f79a2b4064f7ddca9ce

Submission Guidelines

If you would like to test your model on the hidden test and challenge set, please follow the instructions on how to submit your model via CodaLab worksheet.
https://worksheets.codalab.org/worksheets/0x96f61ee5e9914aee8b54bd11e66ec647/

**Note that the test set on CLUE is NOT the complete test set. If you wish to evaluate your model OFFICIALLY on CMRC 2018, you should follow the guidelines here. **

Quick Load Through 🤗datasets

You can also access this dataset as part of the HuggingFace datasets library library as follow:

!pip install datasets
from datasets import load_dataset
dataset = load_dataset('cmrc2018')

More details on the options and usage for this library can be found on the nlp repository at https://github.com/huggingface/nlp

Reference

If you wish to use our data in your research, please cite:

@inproceedings{cui-emnlp2019-cmrc2018,
    title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension",
    author = "Cui, Yiming  and
      Liu, Ting  and
      Che, Wanxiang  and
      Xiao, Li  and
      Chen, Zhipeng  and
      Ma, Wentao  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1600",
    doi = "10.18653/v1/D19-1600",
    pages = "5886--5891",
}

International Standard Language Resource Number (ISLRN)

ISLRN: 013-662-947-043-2

http://www.islrn.org/resources/resources_info/7952/

Official HFL WeChat Account

Follow Joint Laboratory of HIT and iFLYTEK Research (HFL) on WeChat.

qrcode.png

Contact us

Please submit an issue.

cmrc2018's People

Contributors

mozarturing avatar ymcui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cmrc2018's Issues

CMRC 2018公开数据集, 开发集和测试集

CMRC 2018公开数据集(训练集,开发集),分别对应数据集的那些文件
网络有测试指标分别为开发集和测试集,数据集的文件为cmrc2018_dev.json,cmrc2018_trial.json,对应关系是啥? 麻烦指导下。

cmrc2018_train.json是训练集,比较容易识别。

数据集格式

您好,在下载的网页点击下载之后,出现的页面是json文件的乱码,key都是正常显示的,但value没有经过编码,如下:

[{"context_id": "DEV_0", "context_text": "\u300a\u6218\u56fd\u65e0\u53cc3\u300b\uff08\uff09\u662f\u7531\u5149\u8363\u548c\u03c9-force\u5f00\u53d1\u7684\u6218\u56fd\u65e0\u53cc\u7cfb\u5217\u7684\u6b63\u7edf\u7b2c\u4e09\u7eed\u4f5c\u3002\u672c\u4f5c\u4ee5\u4e09\u5927\u6545\u4e8b\u4e3a\u4e3b\u8f74\uff0c\u5206\u522b\u662f\u4ee5\u6b66\u7530\u4fe1\u7384\u7b49\u4eba\u4e3a\u4e3b\u7684\u300a\u5173\u4e1c\u4e09\u56fd\u5fd7\u300b\uff0c\u7ec7\u7530\u4fe1\u957f\u7b49\u4eba\u4e3a\u4e3b\u7684\u300a\u6218\u56fd\u4e09\u6770\u300b\uff0c\u77f3\u7530\u4e09\u6210\u7b49\u4eba\u4e3a\u4e3b\u7684\u300a\u5173\u539f\u7684\u5e74\u8f7b\u6b66\u8005\u300b\uff0c\u4e30\u5bcc\u6e38\u620f\u5185\u7684\u5267\u60c5\u3002\u6b64\u90e8\u4efd\u4e13\u95e8\u4ecb\u7ecd\u89d2\u8272\uff0c\u6b32\u77e5\u6b66\u5668\u60c5\u62a5\u3001\u5965\u4e49\u5b57\u6216\u64c5\u957f\u653b\u51fb\u7c7b\u578b\u7b49\uff0c\u8bf7\u81f3\u6218\u56fd\u65e0\u53cc\u7cfb\u52171.\u7531\u4e8e\u4e61\u91cc\u5927\u8f85\u5148\u751f\u56e0\u6545\u53bb\u4e16\uff0c\u4e0d\u5f97\u4e0d\u5bfb\u627e\u5176\u4ed6\u58f0\u4f18\u63a5\u624b\u3002\u4ece\u731b\u5c06\u4f20 and Z\u5f00\u59cb\u30022.\u6218\u56fd\u65e0\u53cc \u7f16\u5e74\u53f2\u7684\u539f\u521b\u7537\u5973\u4e3b\u89d2\u4ea6\u6709\u4e13\u5c5e\u58f0\u4f18\u3002\u6b64\u6a21\u5f0f\u662f\u4efb\u5929\u5802\u6e38\u620f\u8c1c\u4e4b\u6751\u96e8\u57ce\u6539\u7f16\u7684\u65b0\u589e\u6a21\u5f0f\u3002\u672c\u4f5c\u4e2d\u5171\u670920\u5f20\u6218\u573a\u5730\u56fe\uff08\u4e0d\u542b\u6751\u96e8\u57ce\uff09\uff0c\u540e\u6765\u53d1\u884c\u7684\u731b\u5c06\u4f20\u518d\u65b0\u589e3\u5f20\u6218\u573a\u5730\u56fe\u3002\u4f46\u6e38\u620f\u5185\u6218\u5f79\u6570\u91cf\u7e41\u591a\uff0c\u90e8\u5206\u5730\u56fe\u4f1a\u6709\u517c\u7528\u7684\u72b6\u51b5\uff0c\u6218\u5f79\u865a\u5b9e\u5219\u662f\u4ee5\u5149\u8363\u53d1\u884c\u76842\u672c\u300c\u6218\u56fd\u65e0\u53cc3 \u4eba\u7269\u771f\u4e66\u300d\u5185\u5bb9\u4e3a\u4e3b\uff0c\u4ee5\u4e0b\u662f\u76f8\u5173\u4ecb\u7ecd\u3002\uff08\u6ce8\uff1a\u524d\u65b9\u52a0\u2606\u8005\u4e3a\u731b\u5c06\u4f20\u65b0\u589e\u5173\u5361\u53ca\u5730\u56fe\u3002\uff09\u5408\u5e76\u672c\u7bc7\u548c\u731b\u5c06\u4f20\u7684\u5185\u5bb9\uff0c\u6751\u96e8\u57ce\u6a21\u5f0f\u5254\u9664\uff0c\u6218\u56fd\u53f2\u6a21\u5f0f\u53ef\u76f4\u63a5\u6e38\u73a9\u3002\u4e3b\u6253\u4e24\u5927\u6a21\u5f0f\u300c\u6218\u53f2\u6f14\u6b66\u300d&\u300c\u4e89\u9738\u6f14\u6b66\u300d\u3002\u7cfb\u5217\u4f5c\u54c1\u5916\u4f20\u4f5c\u54c1", "qas": [{"query_text": "\u300a\u6218\u56fd\u65e0\u53cc3\u300b\u662f\u7531\u54ea\u4e24\u4e2a\u516c\u53f8\u5408\u4f5c\u5f00\u53d1\u7684\uff1f", "query_id": "DEV_0_QUERY_0", "answers": ["\u5149\u8363\u548c\u03c9-force", "\u5149\u8363\u548c\u03c9-force", "\u5149\u8363\u548c\u03c9-force"]}, 

想问下应该如何正确下载json格式的文件呢?

bert_config读取错误

您好,tf1.1 python3.7这个报错是什么原因呢
tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile failed to Create/Open: ./BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_config.json : ϵͳ\udcd5Ҳ\udcbb\udcb5\udcbdָ\udcb6\udca8\udcb5\udcc4·\udcbe\udcb6\udca1\udca3
; No such process

您好,calc_f1_score函数为什么返回的是max(f1_scores)呢?这部分可以解释一下吗,返回最大的,不是就是1了吗?这里没看明白

def calc_f1_score(answers, prediction):
f1_scores = []
for ans in answers:
ans_segs = mixed_segmentation(ans, rm_punc=True)
prediction_segs = mixed_segmentation(prediction, rm_punc=True)
lcs, lcs_len = find_lcs(ans_segs, prediction_segs)
if ans == "" and prediction == "" :
f1_scores.append(1)
else:
if lcs_len == 0:
f1_scores.append(0)
continue
precision = 1.0lcs_len/len(prediction_segs)
recall = 1.0
lcs_len/len(ans_segs)
f1 = (2precisionrecall)/(precision+recall)
f1_scores.append(f1)
return max(f1_scores)

您好,想请问论文里说train跟dev分别是10321和3351个问题,但实际上github上的train跟dev分别是10142和3219个问题? 另外squad-style的数据有少部分的数据answer start有误?

论文里说train跟dev分别是10321和3351个问题
但实际上github上的train跟dev分别是10142和3219个问题 (huggingface上面也是10142和3219个问题),想请问是为什么?

另外squad-style的数据比如./squad-style-data/cmrc2018_train.json,有少部分的数据的answer start跟answer text不匹配
比如TRAIN_3678_QUERY_4 的问题,answer_start对应context中的答案是"总统袁世凯将",但text标注是"大总统袁世凯"

想请问一下,谢谢

Z-Reader

请问表里的Z-Reader是什么模型呢?可以给出论文链接么?

ValueError: Unknown hparam output_dir

In running the program, I put in

python run_finetuning.py --data-dir=/home/pc/work/Chinese-ELECTRA-master/data-dir --model-name ELECTRA-180g-small, Chinese --hparams params_cmrc2018.json

File "/home/pc/work/Chinese-ELECTRA-master/configure_finetuning.py", line 175, in update
raise ValueError("Unknown hparam " + k)
ValueError: Unknown hparam output_dir

the params_cmrc2018.json is as follows

{
"model_name_or_path": "ELECTRA-180g-small, Chinese ",
"output_dir": "./output",
"train_file": "cmrc2018_train.json",
"predict_file": "cmrc2018_dev.json",
"max_seq_length": 512,
"doc_stride": 128,
"max_query_length": 64,
"per_gpu_train_batch_size": 8,
"per_gpu_eval_batch_size": 8,
"learning_rate": 2e-5,
"num_train_epochs": 3,
"logging_steps": 100,
"save_steps": 1000,
"warmup_steps": 1000,
"weight_decay": 0.01,
"adam_epsilon": 1e-6,
"max_grad_norm": 1.0,
"gradient_accumulation_steps": 1,
"n_best_size": 20,
"max_answer_length": 30,
"do_train": true,
"do_eval": true,
"evaluate_during_training": true,
"overwrite_output_dir": true,
"seed": 42
}

I wonder if anyone can help me solve the problem of ValueError, thanks a lot!

非本代码issue,个人实现中文数据集结果很差

您好,我最近在用transformers做中文qa的时候效果都没效果,跑出来em和f1都是个位数,read_examples时已经改成了中文的预处理,看了github上有些其他的中文实现也有这个问题,请问这是什么原因呢?还有什么关键部分需要调整的吗
对比您的代码和原始bert代码时候发现除了数据处理,还有一个区别是input_span_mask,但是这个影响感觉也不至于那么大

Evaluate error (issues from old repository)

@chiangyulun0914
https://github.com/ymcui/CMRC2018-DRCD-BERT/issues/1#issue-483718582

When I use the cmrc2018_evaluate.py to get EM/F1 for the DRCD_dev.json, I got this:
image

Is there any solution for that?

Here is my evaluate.sh:

#!/bin/bash

#### local path
DRCD_DIR=raw_data/
EVALUATE_DIR=BERT/bert/
PREDICT_RESULT=BERT/experiment/chinese_L-12_H-768_A-12_S-512_B-2/model_ckpt

 
python $EVALUATE_DIR/cmrc2018_evaluate.py $DRCD_DIR/DRCD_dev.json $PREDICT_RESULT/predictions.json

__new__() missing 2 required positional arguments: 'start_index' and 'end_index'

我修改该代码执行SQuAD2.0数据集,报错:
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /tf/NOC-QA/output_ch/model.ckpt-1166
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Processing example: 0
INFO:tensorflow:Processing example: 1000
INFO:tensorflow:Processing example: 2000
INFO:tensorflow:Processing example: 3000
INFO:tensorflow:Processing example: 4000
INFO:tensorflow:Processing example: 5000
INFO:tensorflow:prediction_loop marked as finished
INFO:tensorflow:prediction_loop marked as finished
INFO:tensorflow:Writing predictions to: /tf/NOC-QA/output_ch/dev_predictions.json
INFO:tensorflow:Writing nbest to: /tf/NOC-QA/output_ch/dev_nbest_predictions.json
Traceback (most recent call last):
File "/tf/NOC-QA/baseline/run_cmrc2018_drcd_baseline.py", line 1448, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/tf/NOC-QA/baseline/run_cmrc2018_drcd_baseline.py", line 1377, in main
output_nbest_file, output_null_log_odds_file)
File "/tf/NOC-QA/baseline/run_cmrc2018_drcd_baseline.py", line 962, in write_predictions
end_logit=null_end_logit)) # In very rare edge cases we could have no valid predictions. So we
TypeError: new() missing 2 required positional arguments: 'start_index' and 'end_index'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.