baidu / dureader Goto Github PK

View Code? Open in Web Editor NEW

1.1K 49.0 309.0 6.58 MB

Baseline Systems of DuReader Dataset

Home Page: http://ai.baidu.com/broad/subordinate?dataset=dureader

Shell 5.21% Python 94.79%

dureader's Introduction

DuReader

DuReader focus on the benchmarks and models of machine reading comprehension for question answering.

Dataset:

DuReader-vis: The first Chinese Open-domain Document Visual Question Answering (Open-Domain DocVQA) dataset. [Paper]

DuReader Retrieval: A large-scale Chinese dataset for passage retrieval. [Paper][Code] [Leaderboard]

DuQM: Linguistically Perturbed Natural Questions for Evaluating the Robustness of Question Matching Models.[Paper][Code] [Leaderboard]

DuReader Checklist: A dataset challenging model understanding capabilities in vocabulary, phrase, semantic role, reasoning. [Code] [Leaderboard]

DuReader Yes/No: A dataset challenging models in opinion polarity judgment. [Code] [Leaderboard]

DuReader Robust: A dataset challenging models in (1)over-sensitivity, (2)over-stability and (3)generalization. [Paper] [Code] [Learderboard]

DuReader 2.0: A new large-scale real-world and human sourced MRC dataset [Paper] [Code] [Leaderboard]

DuReader Robust, DuReader Yes/No, DuReader Checklist, DuQMcan be downloaded at qianyan official website. DuReader-vis can be downloaded by following the method in DuReader-vis/README.md at this repository. DuReader 2.0 can be downloaded by following the method in DuReader-2.0/README.md at this repository.

Models:

KT-NET: A machine reading comprehension (MRC) model which integrates knowledge from knowledge bases (KBs) into pre-trained contextualized representations. [Paper] [Code] [Learderboard]

D-NET: A simple pre-training and fine-tuning framework which focused on the generalization of machine reading comprehension (MRC) models. [Paper] [Code] [Learderboard]

News

May 2022, DuReader-vis was accepted by ACL 2022 Findings.
March 2022, DuReader Retrieval was released, holding the Passage retrieval challenge.
September 2021, we released DuQM that is a Chinese dataset of linguistically perturbed natural questions for evaluating the robustness of question matching models, and it was included in qianyan.
June 2021, DuReader Robust, DuReader Yes/No and DuReader Checklist were included in qianyan.
May 2021, DuReader Robust was accepted by ACL 2021.
March 2021, DuReader Checklist was released, holding the DuReader Checklist challenge.
March 2020, DuReader Robust was released, holding the DuReader Robust challenge.
December 2019, DuReader Yes/No was released, holding the DuReader Yes/No challenge. After that, DuReader Yes/No Individual Challenge and Team Challenge were held.
August 2019, D-NET was released and ranked at top 1 of the MRQA-2019 shared task.
July 2019, KT-NET was accepted by ACL 2019.
March 2019, the second MRC challenge was held based on DuReader 2.0, including hard samples in the test set.
April 2018, DuReader 2.0 was accepted by ACL 2018 at the Workshop on Machine Reading for Question Answering.
March 2018, the first MRC challenge was held based on DuReader 2.0

Detailed Description

DuReader contains four datasets: DuReader 2.0, DuReader Robust, DuReader Yes/No , DuReader Checklist and DuReader-vis. The main features of these datasets include:

Real question, Real article, Real answer, Real application scenario;
Rich question types, including entity, number, opinion, etc;
Various task types, including span-based tasks and classification tasks;
Rich task challenges, including model retrieval capability, model robustness, model checklist etc.

DuReader 2.0 : Real question, Real article, Real answer

[Paper] [Code] [Leaderboard]

DuReader is a new large-scale real-world and human sourced MRC dataset in Chinese. DuReader focuses on real-world open-domain question answering. The advantages of DuReader over existing datasets are concluded as follows: Real question, Real article, Real answer, Real application scenario and Rich annotation.

KT-NET: Integrate knowledge into pre-trained LMs.

[Paper] [Code] [Learderboard]

KT-NET (Knowledge and Text fusion NET) is a machine reading comprehension (MRC) model which integrates knowledge from knowledge bases (KBs) into pre-trained contextualized representations. The model is proposed in ACL2019 paper Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension.

D-NET: Model generalization

[Paper] [Code] [Learderboard]

D-NET is a simple system Baidu submitted for MRQA (Machine Reading for Question Answering) 2019 Shared Task that focused on generalization of machine reading comprehension (MRC) models. The system is built on a framework of pretraining and fine-tuning. The techniques of pre-trained language models and multi-task learning are explored to improve the generalization of MRC models. D-NET is ranked at top 1 of all the participants in terms of averaged F1 score.

DuReader Robust: Model Robustness

[Paper] [Code] [Learderboard]

DuReader Robust is designed to challenge MRC models from the following aspects: (1) over-sensitivity, (2) over-stability and (3) generalization. Besides, DuReader Robust has another advantage over previous datasets: questions and documents are from Baidu Search. It presents the robustness issues of MRC models when applying them to real-world scenarios.

DuReader Yes/No: Opinion Yes/No Questions

[Code] [Leaderboard]

Span-based MRC tasks adopt F1 and EM metrics to measure the difference between predicted answers and labeled answers. However, the task about opinion polarity cannot be well measured by these metrics. DuReader Yes/No is proposed to challenge MRC models in opinion polarity, which will complement the disadvantages of existing MRC tasks and evaluate the effectiveness of existing models more reasonably.

DuReader Checklist: Natural Language Understanding Capabilities

[Code] [Leaderboard]

DuReader Checklist is a high-quality Chinese machine reading comprehension dataset for real application scenarios. It is designed to challenge the natural language understanding capabilities from multi-aspect via systematic evaluation (i.e. checklist), including understanding of vocabulary, phrase, semantic role, reasoning and so on.

DuQM: Linguistically Perturbed Natural Questions for Evaluating the Robustness of Question Matching Models

[Paper][Code] [Leaderboard]

DuQM is a Chinese question matching robust dataset, which contains natural questions with linguistic perturbations to evaluate the robustness of question matching models. DuQM is designed to be fine-grained, diverse and natural. And it contains 3 categories and 13 subcategories with 32 linguistic perturbations.

DuReader Retrieval: A large-scale Chinese dataset for passage retrieval from web search engine

[Paper][Code] [Leaderboard]

DuReader Retrieval is a large-scale Chinese dataset for passage retrieval from web search engine. The dataset contains more than 90K queries and over 8M unique passages from realistic data sources.

DuReader-vis: A Chinese Dataset for Open-domain Document Visual Question Answering

[Paper]

DuReader-vis is the first Chinese Open-domain DocVQA dataset from web search engine. The dataset contains more than 15K labeled question-document pairs and over 158K unique documents from realistic data sources.

Dataset and Evaluation Tools

We make public a dataset loading and evaluation tool named qianyan. You can use this package easily by following the qianyan repo.

Copyright and License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Contact Information

For help or issues using DuReader, including datasets and baselines, please submit a Github issue.

For other communication or cooperation, please contact Jing Liu ([email protected]) or Hongyu Li ([email protected]).

dureader's People

Contributors

Stargazers

Watchers

Forkers

trumanhe bojone gmcather jameslin2014 liuyuuan klauszhao cosecant-csc hualichenxi dongcin leezqcst wushicanasl pingoogle shaoweistar zhangxt sunnymarkliu billpei benzite yaps noobfang 0xqq ydjbuaa chengniu jianbotang gallupliu crazyofapple jose77 hitluobin chunyu-lin-bjtu libertatis lilomarry godfanmiao lsjiguang mennianshi hadeshacker jiangyuenju leotywy lc222 zhongkailv dimkang zxu7 xunzhide jiujiuwo lreaderl frankblood snowfeet javacjh hncz003 wen-wen-luffy xiaobaixuexiba ghiblifield meloneat syjbupt zfxsteven ansvver cloudskyme fanfanba mrb957600057 agnon1573 cleverer123 arfu2016 bigdong89 zouxiaoyuonly yaoleo misoknisky visualjoyce yangapku babybear1992 yzx1992 jinxing94 aquadrop annht ktry1991 cutecha perryhau ttcwh chinesektry sibylzx0 konroyzhu herbertchen1 qiaofeiw minizhao chakra-coder wxc68762040 arrowa70 nipengmath zhihaosun greyring lsm1 zsweet lrxzhy chavesliu zjulins yutinglaitw jiayuzhai songyandong guadaler preke mars1994 cyzhangathit howl-anderson

dureader's Issues

Test loss does not decrease in the training process

When i ran the BiDAF model (origin code without any modification) in the tensorflow version, the training loss decreases to about 2. However, if we ran a dev mode after each epoch, the dev loss does not varies (always about 15). The bleu4 score is only about 20-25 in the dev set.
i have also tried to add dropout in the training process, and to add l2 regularization in the training process, embedding matrix has also been replaced by a pretrained embedding model, but all of them have less effect on the result of the dev process. Dev loss still does not change in the dev process after each epoch.
In addtion, the dataset is the training and dev set of baidu_search total dataset.
Is there anything special we could do to train the tensorflow version and get a better result?

Only support for single GPU?

I use the argument with --gpu 0,1,3,4, but it seems that the code only use one gpu. Is it support for multiple gpus?

Thanks.

用dureader_preprocessed.zip的数据运行paddle的train出错

我用data/download.sh下载了dureader_preprocessed.zip，解压得到testset、trainset、devset的各个json，我把每个json都head -n 3000截取头3000行。前面的步骤没有错误，到了train这步出现一堆SKIP：

之后infer也没提示出错，但运行完之后models底下的infer目录为空

之前试过不截取数据直接运行，train那一步也是有一堆SKIP的。只不过机器实在达不到要求，train还没运行完就挂掉了，也就没测试到后续的infer。

What is the appropriate choice of 'epoch' parameter ?

I am wondering what is the appropriate choice of 'epoch' parameter ?
The default value is 10, but maybe not the best one ?

The dataset can not be downloaded !!! giving error: {"code":"AccessDenied","message":"Your request is denied because there is an overdue bill of your account.","requestId":"fd3373c3-f10c-41cc-bf57-bda645ada1f7"}

{"code":"AccessDenied","message":"Your request is denied because there is an overdue bill of your account.","requestId":"fd3373c3-f10c-41cc-bf57-bda645ada1f7"}

Why the results obtained by tensorflow is not aligned to the results in paper?

I run the code of tensorflow, but the Bleu-4 score is about 13, and the rouge score is about 17, which is not aligned with the results of the paper.
Is there something wrong？

运行DuReader/paddle/paragraph_extraction.py生成的数据中， answer_span对应的答案与fake_answer不相同

运行paddle中的run.sh脚本，调用paragraph_extraction生成新的文件后，发现 answer_span对应的答案与fake_answer对应的答案不是一直相同的。原则上二者应该相同吧？

下面3个例子取自demo/devset/search.dev.json生成的新的文件。可以看出二者大多数情况相同，但也有一些情况不相同。

question id 181623 fake_answer 大众牛逼关键在于是个德国车企,**人普遍崇拜德国。
question id 181623 span_answer <splitter>我就想知道路上那么多大众怎么破,这不是事实么

question id 181625 fake_answer 结构不固定,形式多变,或正或反,或分或合,笔画或多或少,相当灵活,具有很大的随意性
question id 181625 span_answer 

question id 181611 fake_answer 1、将干海参用自来水直接冲洗1分钟,洗掉表面少许微尘。2、置于1-10度凉纯净水中24小时左右,中间换水2次直至将海参泡软。3、将泡软的海参从腹部纵向剖开,去掉海参前端牙状物和体内白筋。4、添纯净水上无油锅加盖煮沸,用中火煮15-25分钟。5、换新的凉纯净水,泡24小时左右,中间换水2次直至发泡到2倍左右长度。6、泡好后,即可食用。可把多余的单独零度以下冷冻,建议2周内用完。7、如有个别海参没有发大,属于正常现象,可重复4、5步骤
question id 181611 span_answer 凉水泡24小时直至海参变软。第二步清洁剪掉海参的沙嘴,切断筋,清洗干净。第三步;将海参放入无油的,装入凉水的干净锅内,大火煮开改用小火煮50至60分钟左右,将海参捞出,用海参掐海参侧壁肉,能掐透或者稍变软即可,如没有则继续煮。第四步;水发,将煮好的海参捞出来,自然凉透之后

上面的结果是使用下面代码打印的。

with io.open(dev_path, 'r', encoding='utf-8') as fin:
    data_set = []
    for lidx, line in enumerate(fin):
        sample = json.loads(line.strip())
            if len(sample['answer_spans']) == 0:
                continue
            if len(sample['answer_docs']) == 0:
                continue
            if sample['answer_docs'][0] >= len(sample['documents']):
                continue

            print('fake answer', sample['fake_answers'][0])
        
            answer_doc_idx = sample['answer_passages'][0]
            start = sample['answer_spans'][0][0]
            end = sample['answer_spans'][0][1]
            print('span answer',
                  ''.join(sample['documents'][answer_doc_idx]['segmented_paragraphs'][0][start: end + 1]))

About the code in match_layer.py

In the class MatchLSTMAttnCell, the call function, there is a tf.expand_dims(tc.layers.fully_connected(ref_vector,
num_outputs=self._num_units,
activation_fn=None), 1))
Why need to expand the dimension as axis 1?
Thx

paddle infer mode KeyError: 'answer_docs'

bash run_demo.sh test_bidaf_demo bidaf infer --testset ../data/preprocessed/testset/search.test.json
tensorflow版本没有问题，可以跑通。
paddlepaddle版本，训练集和验证集因为有ground_truth, 所以存在"answer_docs"。但是在测试集中没有ground_truth，所以就不存在“answer_docs”。但是在paddle版本的dataset.py中确实存在json解析，必须要"answer_docs"字段。是我需要进一步预处理一下数据还是说要在测试的时候改这部分代码？

I was implemented the dureader to recognize the story name from a long text but the result is so wired

I inspected the json format and prepared with labeled answers.

About convert MARCO dataset to Dureader style

when using the script marcov1_to_dureader.py to convert MARCOv1 to dureader, it failed because ValueError: Trailing data

Performance decline when using pretrained embedding.

I use the pre-trained embedding (fasttext zh-300-vec) rather than the random initialized embedding.
The loss was lower in evaluate. But the performance declined when inference.
Is it the OOV issue or other problem.

The implementation of BiDAF is not exactly same as the one in the original paper

For the calculation of similarity matrix,

sim_matrix = tf.matmul(passage_encodes, question_encodes, transpose_b=True)

This is a simple dot product. But the original paper choose

where h is the passage representation and u is the question representation.
This maybe not a big problem, but it may still cause some difference.

跑demo，在Preprocess the Data步骤出错

modify NoneType error when run `sh run.sh --para_extraction`

When I run command sh run.sh --para_extraction, the following error occur:

Start paragraph extraction, this may take a few hours
Source dir: ../data/preprocessed
Target dir: ../data/extracted
Processing trainset
Processing devset
Processing testset
Traceback (most recent call last):
  File "paragraph_extraction.py", line 197, in <module>
    paragraph_selection(sample, mode)
  File "paragraph_extraction.py", line 111, in paragraph_selection
    status = dup_remove(doc)
  File "paragraph_extraction.py", line 66, in dup_remove
    if p_idx < para_id:
TypeError: '<' not supported between instances of 'int' and 'NoneType'
Traceback (most recent call last):
  File "paragraph_extraction.py", line 197, in <module>
    paragraph_selection(sample, mode)
  File "paragraph_extraction.py", line 111, in paragraph_selection
    status = dup_remove(doc)
  File "paragraph_extraction.py", line 66, in dup_remove
    if p_idx < para_id:
TypeError: '<' not supported between instances of 'int' and 'NoneType'

So, I New a pull request #45 to modify this bug.

数据无法下载，data/download.sh里面的链接失效了，有人修复么

Error for preprocessing data

In Readme, you said:

To preprocess the raw data, you should first segment 'question', 'title', 'paragraphs' and then store the segemented result into 'segmented_question', 'segmented_title', 'segmented_paragraphs' like the downloaded preprocessed data

Actually, the 'anwsers' should be segmented into 'segmented_anwsers' too, please notice that.

And , how long will it take for preprocessing data? I've run the python file for over 12 hours...

About Match-LSTM code

hi
when i read the code ./tensorflow/nn_layers/match_layer.py, the class MatchLSTMAttnCell.
I found the you concat H^p ,H^q\alpha and H^p-H^q\alpha, H^p*H^q\alpha as the input of Match-LSTM , but the model proposed by Wang only concat H^p and H^q\alpha as the new input of Match-LSTM.
Could u tell me why did you make such a change.

how to achieve the baseline score

I see the method of choosing paragraph in tensorflow version is gold paragraph，should the score be more higher？thanks

how to generate sentence pair similarity data from the raw data?

Hi, how can I use these data to train my sentence semantic similarity model?

What is the appropriate choice of 'epoch' parameter ?

I am wondering what is the appropriate choice of 'epoch' parameter ?
The default value is 10, but maybe not the best one ?

Unresolved reference

Unresolved reference brc_eval and find_answer in utils/baseline_eval.py, 27 line.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc0 in position 4803: ordinal not in range(128)

about tf.reduce_max function

DuReader/tensorflow/layers/match_layer.py line 94:
b = tf.nn.softmax(tf.expand_dims(tf.reduce_max(sim_matrix, 2), 1), -1)
i think tf.reduce_max(t,2) hasnt been defined.it should be 0 or 1.

Question about batch_size

Question about batch_size:
The run.py file has batch_size term,and I set it to 64.
In _train_epoch() method, I insert the code below to print run time shape:

def _train_epoch(self, train_batches, **dropout_keep_prob):
# some code.......
        for bitx, batch in enumerate(train_batches, 1):
            feed_dict = {self.p: batch['passage_token_ids'],
                         self.q: batch['question_token_ids'],
                         self.p_length: batch['passage_length'],
                         self.q_length: batch['question_length'],
                         self.start_label: batch['start_id'],
                         self.end_label: batch['end_id'],
                         self.dropout_keep_prob: dropout_keep_prob}
            # inserted code
            p_shape,q_shape,sl_shape = self.sess.run([self.p, self.q, self.start_label], feed_dict)
            self.logger.info('p_shape {}\nq_shape {}\nsl_shape {}'.format(p_shape.shape,q_shape.shape,sl_shape.shape ))

The output is:

p_shape (320, 500)
q_shape (320, 12)
sl_shape (64,)

The first dimension of p and q and start_label is different, and first dimension of p and q are fixed number 320 for any batch_size setting in run.py.

发现一个错误

id为181573的question，bs_rank_pos＝2跟bs_rank_pos＝40的段落几乎是一样的，但是一个为select＝true，一个select＝false

当然我不大清楚这种错误是否普遍，仅此报告一下

关于实现细节

论文中S是这样求的，而这个实现中只是将p和q进行简单的相乘？这是为什么？还有论文中Modeling Layer是两层双向LSTM为什么这里只用了一层？

tensorflow版本的baseline结果性能指标是多少？

PaddlePaddle版本的Baseline有提供在DuReader2.0数据集上的Dev ROUGE-L和 Test ROUGE-L。而TensorFlow版本的Baseline并没有提供。有劳提供下TensorFlow版在Dev数据集和Test数据集上的结果性能指标。另外，对于TensorFlow版是直接采用默认参数就可以获得上述指标，还是有其他超参数尚未提供出来？能否公布下对应的参数？谢谢！

Five questions have arisen when i run "sh run.sh --train --pass_num 5 --use_gpu=False".

1)ParallelExeccutor is deprecated.Please use CompiledProgram and Executor.CompiledProgram is a central place for optimization and Executor is the unified executor.Example can be found in conpiler.py.

2)[3548 graph.h:204]WARN:After a series of passes,the current graph can be quite different from OriginProgram.So,pleasse avoid using the ' OriginProgram () 'method!

3)You can try our memory optimize feature to save your memory usage:...

4)The number of graph should be only one,but the current graph has 8 sub_graphs.If you want to see the nodes of the sub_graphs,you should use 'FLAGS_print_sub_graph_dir' to specify the output dir. NOTES : if you not do training,please don't pass loss_var_name.

5)Traceback (most recent call last):
File "run.py",line 645, in
train(logger, args)
File "run.py", line 464, in train
args)
File "run.py", line 308, in validation
ave_loss = 1.0 * total_loss / count
ZeroDivisionError: float division by zero

About the entity answers

The test result of the baseline model all output a empty list for the entity_answers, so, for the evaluation of the entity question , which answer is used? The entity_answers or the answers ?

用demo数据运行paddle的问题

运行paddle的infer这步，需要用到preprocessed/testnet下的数据

但github代码中并没有demo对应的preprocessed数据。我按Preprocess the Data那节的命令来生成testnet预处理数据出错（trainset和devset成功，只有testset失败。经查search.test.json的确没有segmented_answers键）

这样导致用demo的数据无法执行paddle infer这步，执行完后models底下的infer目录是空的。

CNN用了吗？

这个tensorflow使用了CNN对character进行embedding吗？

KeyError: 'segmented_paragraphs'

mldl@mldlUB1604:/ub16_prj/DuReader$ cat data/raw/trainset/search.train.json | python3 utils/preprocess.py > data/preprocessed/trainset/search.train.json
Traceback (most recent call last):
File "utils/preprocess.py", line 217, in
find_fake_answer(sample)
File "utils/preprocess.py", line 158, in find_fake_answer
for p_idx, para_tokens in enumerate(doc['segmented_paragraphs']):
KeyError: 'segmented_paragraphs'
mldl@mldlUB1604:/ub16_prj/DuReader$

About some hyperparams in the paper https://arxiv.org/pdf/1711.05073.pdf

Is the performance reported in the paper measured by using this repo?
How about some other hyperparams that this paper did not mention, such as weight_decay and dropout_keep_prob?
Thank you

About the preprocessed data

I see in the preprocessed data, there are several answers in the field of "answers", but only one element in "answer_docs" and "answer_spans", which answer should I choose? I also want to know does the element in "answer_docs" mean the index of selected passages? Thank you.

python3 unicodedecodeerror: ascii codec can't

what should I do if I want to use my data?

I want to know what the various keys of the json data set represent. For example, ‘is_selected’, ‘answer_spans’, and ‘match_scores’. And I see that there are no such keys in the raw data.

Incorrect file path in line 126 of README.md

Seems a small change should be made in README.md:

126: python run.py --predict --algo BIDAF --test_files ../data/demo/search.dev.json

126: python run.py --predict --algo BIDAF --test_files ../data/demo/devset/search.dev.json

数据集格式和应用的问题

看了下载的raw和preprocessed的头几行格式，大概总结出有以下的字段：
RAW
{
question,question_type？,fact_or_option？,question_id,
documents[{title,bs_rank_pos？,is_selected？,paragraphs[]}],
entity_answers[],
answers[]
}

PREPROCESSED
{
question,question_type,fact_or_option,question_id,
documents[{title,bs_rank_pos,is_selected,paragraphs[],+most_related_para？,+segmented_title[对title分词],+segmented_paragraphs[对paragraphs分词]}],
answers[],
+answer_spans[],？
+fake_answers[],？
+segmented_answers[对answers分词],
+answer_docs[],？
+segmented_question[对question分词],
+match_scores[]？,
+yesno_type？
}

有几个问题想了解一下：

打？的那些字段代表什么意义？REQUIRED还是OPTIONAL？取值不同或忽略对结果大概有什么影响？如果是由proprocess.py生成出来的字段就可以忽略不解释。
看过CLOSED ISUUSES，RAW里面的bs_rank_pos是搜索排名数，是越大越推荐还是越小越推荐？
RAW里面的answers、entity_answers似乎都是答案，有什么区别？
准备用MRC做医学问答系统，让机器阅读不同版本的教材文字作为文章，用课后作业及其答案作为训练，或者从医学科普杂志上抽取问答，而不是搜索引擎中获取数据集。这种情况下，阅读文章该放在哪里？documents.paragraphs吗？另外诸如bs_rank_pos、is_selected等与搜索引擎相关的字段该怎么配置？

万望赐教，谢谢！

关于filter_tokens_by_cnt

函数的目的是重建the token x id map，但通过下面的实现产生的token x id map应该和原来的是一样的

# rebuild the token x id map
        self.token2id = {}
        self.id2token = {}
        for token in self.initial_tokens:
            self.add(token, cnt=0)
        for token in filtered_tokens:
            self.add(token, cnt=0)

是不是应该改成：

        self.initial_tokens = filtered_tokens
        self.initial_tokens.extend([self.pad_token, self.unk_token])
        # rebuild the token x id map
        self.token2id = {}
        self.id2token = {}
        for token in self.initial_tokens:
            self.add(token, cnt=0)

关于使用百度AI studio

有没有小伙伴在BaiduAI Studio上面跑起来的啊，第一次用这个东西，错误日志都不知道去哪里找，有没有小伙伴帮帮我啊...

tensorflow版本感觉没有用到tensorflow

run cat data/raw/trainset/search.train.json | python utils/preprocess.py > data/preprocessed/trainset/search.train.json

Traceback (most recent call last):
File "utils/preprocess.py", line 217, in
find_fake_answer(sample)
File "utils/preprocess.py", line 158, in find_fake_answer
for p_idx, para_tokens in enumerate(doc['segmented_paragraphs']):
KeyError: 'segmented_paragraphs'

when I run the script in the readme, this error occur, please check

I do not understand the meaning of this line and cannot locate which module output this stdout.

2019-02-22 13:16:30,550 - brc - INFO - Training the model for epoch 10
2019-02-22 13:16:36,357 - brc - INFO - Average train loss for epoch 10 is 8.694370711477179
2019-02-22 13:16:36,358 - brc - INFO - Evaluating the model after epoch 10
**{'testlen': 5920, 'reflen': 9147, 'correct': [2409, 1195, 762, 577], 'guess': [5920, 5821, 5722, 5623]}
ratio: 0.6472067344483823**
2019-02-22 13:16:42,349 - brc - INFO - Dev eval loss 14.789813613891601
2019-02-22 13:16:42,351 - brc - INFO - Dev eval result: {'Bleu-3': 0.12942842413184924, 'Bleu-1': 0.2359285965766089, 'Bleu-4': 0.10657137097095022, 'Bleu-2': 0.1675745997143726, 'Rouge-L': 0.24175041960869123}
2019-02-22 13:16:42,911 - brc - INFO - Model saved in ../data/models/, with prefix BIDAF.

Code implementation different with paper in match layer

In file match_layer.py , class MatchLSTMAttnCell, function __call__

The variable new_inputs is computed as follows:

new_inputs = tf.concat([inputs, attended_context,
             inputs - attended_context, inputs * attended_context],
            -1)

However, I noticed that in the paper of match-LSTM, it should be

where, z_i corresponds to variable new_inputs, h_i^p corresponds to variable inputs, H^q * alpha_i^T corresponds to variable attended_context.

According to the algorithm in paper, the computation should be:

new_inputs = tf.concat([inputs, attended_context], -1)

So, I am wondering why inputs - attended_context and inputs * attended_context were involved.

@EastonWang

tensorflow error

[root@hd-master tensorflow]# python run.py --train --algo BIDAF --epochs 10
/opt/linuxsir/anaconda2/lib/python2.7/site-packages/h5py/init.py:34: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
WARNING:tensorflow:From /opt/linuxsir/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
2018-04-27 13:52:09,567 - brc - INFO - Running with args : Namespace(algo='BIDAF', batch_size=32, brc_dir='../data/baidu', dev_files=['../data/demo/devset/search.dev.json'], dropout_keep_prob=1, embed_size=300, epochs=10, evaluate=False, gpu='0', hidden_size=150, learning_rate=0.001, log_path=None, max_a_len=200, max_p_len=500, max_p_num=5, max_q_len=60, model_dir='../data/models/', optim='adam', predict=False, prepare=False, result_dir='../data/results/', summary_dir='../data/summary/', test_files=['../data/demo/testset/search.test.json'], train=True, train_files=['../data/demo/trainset/search.train.json'], vocab_dir='../data/vocab/', weight_decay=0)
2018-04-27 13:52:09,567 - brc - INFO - Load data_set and vocab...
2018-04-27 13:52:10,653 - brc - INFO - Train set size: 95 questions.
2018-04-27 13:52:11,078 - brc - INFO - Dev set size: 100 questions.
2018-04-27 13:52:11,078 - brc - INFO - Converting text into ids...
2018-04-27 13:52:11,211 - brc - INFO - Initialize the model...
2018-04-27 13:52:18,377 - brc - INFO - Time to build graph: 7.16313290596 s
2018-04-27 13:52:26,644 - brc - INFO - There are 4995603 parameters in the model
2018-04-27 13:52:28,206 - brc - INFO - Training the model...
2018-04-27 13:52:28,206 - brc - INFO - Training the model for epoch 1
已杀死

what's the meaning of "bs_rank_pos"??

does it mean the best rank postion? normally we get 5 documents for each question,but "bs_rank_pos" can be up tp 40?

problem of fake answer

in util/preprocess.py

find_fake_answer(sample)

fake answer means bad answer in English, but i find it is selected as golden answer in your code. what does fake answer mean?

data/download.sh 里的连接已经失效了

顺便吐槽下 ai.baidu.com ，跳转不带tag，从DuReader里点击download，弹出来的页面竟然是download页面里的DuConv

不知道找谁反馈

run TensorFlow的时候Utils下缺少文件

run TensorFlow的时候Utils下缺少文件
from utils import compute_bleu_rouge
from utils import normalize

baidu / dureader Goto Github PK

dureader's Introduction

DuReader

News

Detailed Description

DuReader 2.0 : Real question, Real article, Real answer

KT-NET: Integrate knowledge into pre-trained LMs.

D-NET: Model generalization

DuReader Robust: Model Robustness

DuReader Yes/No: Opinion Yes/No Questions

DuReader Checklist: Natural Language Understanding Capabilities

DuQM: Linguistically Perturbed Natural Questions for Evaluating the Robustness of Question Matching Models

DuReader Retrieval: A large-scale Chinese dataset for passage retrieval from web search engine

DuReader-vis: A Chinese Dataset for Open-domain Document Visual Question Answering

Dataset and Evaluation Tools

Copyright and License

Contact Information

dureader's People

Contributors

Stargazers

Watchers

Forkers

dureader's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs