GithubHelp home page GithubHelp logo

wuhurestaurant / xf_event_extraction2020top1 Goto Github PK

View Code? Open in Web Editor NEW
534.0 8.0 123.0 9.78 MB

科大讯飞2020事件抽取挑战赛第一名解决方案&完整事件抽取系统

Shell 1.67% Python 98.33%
pytorch events-extraction competition

xf_event_extraction2020top1's Issues

train

train.sh模型找不到

OSError: Can't load tokenizer for '../bert/torch_roberta_wwm'

File "D:\Event_Extraction\event_extraction2020\src_final\preprocess\processor.py", line 653, in convert_examples_to_features
tokenizer = BertTokenizer.from_pretrained(bert_dir)
File "D:\soft\Anaconda3\lib\site-packages\transformers-4.2.2-py3.8.egg\transformers\tokenization_utils_base.py", line 1760, in from_pretrained
OSError: Can't load tokenizer for '../bert/torch_roberta_wwm'
Make sure that:

  • '../bert/torch_roberta_wwm' is a correct model identifier listed on 'https://huggingface.co/models'
  • or '../bert/torch_roberta_wwm' is the correct path to a directory containing relevant tokenizer files

不是已经下载了bert模型吗? 直接bert_dir = ‘./bert/’ 也报同样的错,
但是
from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], *init_inputs, **kwargs):这个方法中
pretrained_model_name_or_path - A path to a directory containing vocabulary files required by the tokenizer, for instance saved
using the :meth:~transformers.tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained
method, e.g., ./my_model_directory/.
把bert_dir = ‘roberta-large’ ( model identifier listed on 'https://huggingface.co/models')
又报错
File "D:\Event_Extraction\event_extraction2020\src_final\preprocess\processor.py", line 653, in convert_examples_to_features
tokenizer = BertTokenizer.from_pretrained(bert_dir)
File "D:\soft\Anaconda3\lib\site-packages\transformers-4.2.2-py3.8.egg\transformers\tokenization_utils_base.py", line 1769, in from_pretrained
File "D:\soft\Anaconda3\lib\site-packages\transformers-4.2.2-py3.8.egg\transformers\tokenization_utils_base.py", line 1841, in _from_pretrained
File "D:\soft\Anaconda3\lib\site-packages\transformers-4.2.2-py3.8.egg\transformers\models\bert\tokenization_bert.py", line 193, in init
File "D:\soft\Anaconda3\lib\genericpath.py", line 30, in isfile
st = os.stat(path)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

请问大佬 这是什么原因啊

AssertionError: pretrained bert file does not exist

是少下载某个文件了吗?
Traceback (most recent call last):
File "train.py", line 225, in
training(args)
File "train.py", line 153, in training
train_base(opt, info_dict, train_examples, dev_info)
File "train.py", line 43, in train_base
model = build_model(opt.task_type, opt.bert_dir, **model_para)
File "/code/src_final/utils/model_utils.py", line 450, in build_model
model = Role1Extractor(bert_dir=bert_dir,
File "/code/src_final/utils/model_utils.py", line 188, in init
super(Role1Extractor, self).init(bert_dir=bert_dir,
File "/code/src_final/utils/model_utils.py", line 70, in init
assert os.path.exists(bert_dir) and os.path.exists(config_path),
AssertionError: pretrained bert file does not exist

score 测试效果

请问,submit_v1和submit_v1_ensemble的score怎么得出,具体在哪里可以看到?

如何判断事件的真实性?

比赛里需要判断事件属性的极性,即判断事件是否真实发生,请问您这里是怎么判断的呢?只根据文本字面的意思吗还是加入了知识库等信息呢?谢谢

data目录下有几个文件是什么数据?

preliminary_data_pred_trigger、preliminary_data_pred_triggle_and_role、prelimiary_pred_triggers_pred_roles这几个是从原始数据生成的吧?请问这几个数据具体作用是什么?

out文件

您好,请问是不是缺少out这个存放训练模型的文件夹呀

请教bert模型版本和vocab.txt 中 unused更改问题,谢谢

train.sh中有{export BERT_TYPE="roberta_wwm"  # roberta_wwm / ernie_1  / uer_large},去下载时有RoBERTa-wwm-ext、RoBERTa-wwm-ext-large、BERT-wwm-ext、BERT-wwm等版本,麻烦请问具体是哪一个呢?
另外{并将 vocab.txt 中的两个 unused 改成 [INV] 和 [BLANK](详见 processor 代码中的 fine_grade_tokenize)},我看各个版本vocab.txt中有[unused1]...[unsued99],查看processor中我也没懂怎么来改,麻烦指教下,谢谢!

预处理中clean_data函数的作用

马老师,请问一下,preprocess文件夹下的convert_raw_data.py中的clean_data函数的作用是什么?是因为有的数据中包含多个触发词,所以要单独取出来吗,取得时候设置距离触发词前后的距离为40个字符?

运行train.sh出错

image
在运行的时候一直提示识别不出bert参数,不知道为什么,想请教一下您

Bert_dir

你好,请问Bert文件夹,具体需要下载那些文件呢?我调试了好久依旧有问题,如:
一:transformers.tokenization_utils - Model name './bert' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, bert-base-finnish-cased-v1, bert-base-finnish-uncased-v1, bert-base-dutch-cased). Assuming './bert' is a path, a model identifier, or url to a directory containing tokenizer files.

二:
'pretrained bert file does not exist'
AssertionError: pretrained bert file does not exist

求解答。。。

test

再打扰您一下,我想请教一下为什么我这边测试的时候是错误的,不使用use_distant_trigger,我看模型输入的是 list, 不是tensor,还有就是 tmp_trigger_pred = trigger_model(**trigger_inputs)[0][0]测试的模型后面是【0】【0】 训练的时候是loss = model(**batch_data)[0] 请问这块是代码本身的问题还是我这边有问题呢

cuda error

03/16/2021 16:09:53 - INFO - transformers.modeling_utils - loading weights file /home/lab/Desktop/xf_event_extraction2020Top1-master/bert/torch_roberta_wwm/pytorch_model.bin
03/16/2021 16:16:01 - INFO - src_final.utils.functions_utils - Use single gpu in: ['1']
03/16/2021 16:16:01 - INFO - src_final.utils.trainer - ***** Running training *****
03/16/2021 16:16:01 - INFO - src_final.utils.trainer - Num Examples = 7416
03/16/2021 16:16:01 - INFO - src_final.utils.trainer - Num Epochs = 6
03/16/2021 16:16:01 - INFO - src_final.utils.trainer - Total training batch size = 16
03/16/2021 16:16:01 - INFO - src_final.utils.trainer - Total optimization steps = 2784
03/16/2021 16:16:01 - INFO - src_final.utils.trainer - Save model in 464 steps; Eval model in 464 steps
Traceback (most recent call last):
File "train.py", line 223, in
training(args)
File "train.py", line 152, in training
train_base(opt, info_dict, train_examples, dev_info)
File "train.py", line 44, in train_base
train(opt, model, train_dataset)
File "/home/lab/Desktop/xf_event_extraction2020Top1-master/src_final/utils/trainer.py", line 136, in train
loss = model(**batch_data)[0]
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/lab/Desktop/xf_event_extraction2020Top1-master/src_final/utils/model_utils.py", line 239, in forward
token_type_ids=token_type_ids
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/transformers/modeling_bert.py", line 734, in forward
encoder_attention_mask=encoder_extended_attention_mask,
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/transformers/modeling_bert.py", line 407, in forward
hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/transformers/modeling_bert.py", line 368, in forward
self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/transformers/modeling_bert.py", line 314, in forward
hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/transformers/modeling_bert.py", line 216, in forward
mixed_query_layer = self.query(hidden_states)
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/home/lab/anaconda3/envs/event/lib/python3.7/site-packages/torch/nn/functional.py", line 1612, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
你好,我在运行train.sh 一直报这个错误,我看了一下我的cuda正常啊,batchsize也调整了,也找了相关的资料,还是报错,不知道为什么,可以请您帮忙解答一下吗

test.json文件

您好,这个测试文件当中的distant_trigger是人工标的吗,依据是什么?因为我想换成自己的短文本数据看一看,但数据量比较大。

score

请问,submit_v1和submit_v1_ensemble的score怎么得出,具体在哪里可以看到?

test预测时,test.json 文件的由来

我理解 test.py 是预测 sentences.json的文件,当需要预测其它数据时,修改sentences.json 放入需要预测的新的数据集 ,除此之外是应该还需要处理test.json吗?那test.json在预测时是如何执行呢?

请问提取的事件,object可以为空吗?

我想自己构造训练集,提取的事件类型有"开幕"、"取消"、"降价"、"涨价"等等,这些事件都不需要object啊
请问一下,这种情况构造训练集时把role为object的,对应的text赋值为空,这样可以吗?
非常感谢啊~

训练出来的模型中,没有test.sh中指定的模型名称

作者发布的test.sh中指定的模型名称是:
--trigger_ckpt_dir='./out/final/trigger/roberta_wwm_distant_trigger_pgd_enhanced/checkpoint-100000'
--role1_ckpt_dir='./out/final/role1/roberta_wwm_distance_pgd_enhanced/checkpoint-884'
--role2_ckpt_dir='./out/final/role2/roberta_wwm_pgd_enhanced/checkpoint-1056'
--attribution_ckpt_dir='./out/final/attribution/roberta_wwm_pgd/checkpoint-100000'
--trigger_start_threshold=0.5 \

但是在我本地训练出来的文件夹中,没有role1模型的884和role2模型的1056,如下图,请问这种情况下应该怎么指定?
image

模型

您好,请问out这个文件夹下的模型可以提供吗

请问bert目录下具体的文件

可以帮忙发下bert目录下具体的文件吗?我模型下载了,也放目录了,vocab.txt也改了,但是关于bert模型总是报错
以下是我的bert目录,可以帮忙看下是否有问题吗?
xf_event_extraction2020Top1-master
bert
torch_roberta_wwm
bert_config.json
pytorch_model.bin
vocab.txt

bert模型获取

RoBERTa-wwm-ext 这个模型的torch版本不能获取了,不知道在哪里下载,请问该怎么办

对于role提取器,Trigger Distance 是可学习的,还是预先设置的

对于role提取器,
在这篇文章中,Trigger Distance是相对距离+0/1指示向量,如下图所示:

而在代码中,Trigger Distance 则是可学习的Embedding向量,如下所示:

self.trigger_distance_embedding = nn.Embedding(num_embeddings=512, embedding_dim=embedding_dim)

为什么会产生不一致,以及作者有没有对比过着两种方法?

将该代码转成tensorflow版本过程中遇到问题

如题

问题出现在:在转换过程中,尽量保持方法一致,然后训练时参数相同,但转换后的tensorflow版本在训练时loss值在最开始迅速下降后就不再下降(并不是迅速收敛,因为模型评估时结果很差),而且loss值很大(相对pytorch版本而言)。以上问题以训练trigger模型为例,望能联系并指点一下

马老师 教教我

马老师 ,我阅读了TriggerExtractor的代码 ,然后对照了一下说明的PPT ,但是还是不特别清楚distant_trigger的意义是什么 。 我的暂时的理解是,之前训练集学习到的含有触发词的句子,会加入知识库,对之后的触发词提取造成影响,但是前面的触发词是通过trigger_dict进行识别的,含有触发词的句子加入知识库,trigger_dict中的触发词种类没有更新,这样做的作用是神魔?我有点混乱 。
然后请问“将训练集分成K份,用前K-1份的所有label当做后一份的知识库,构造训练数据的distant trigger” 这句话的代码实现在哪个py文件呢

作者很棒很负责哦!

没别的问题,就像说两点,第一,作者开源的这个代码真棒!我已经看了三天了。
第二,能用中文交流太爽!

训练过程中报错

您好,感谢您的代码,这对我很有用。
在训练了几个epoch后,出现错误:RunTimeError: stack expects each tensor to be equal size. but got [256] at entry 0 and [345] at entry 3.
我实在找不到问题的原因,您知道这是什么问题导致的吗。
期待收到您的回复
@WuHuRestaurant

可以减少GPU的使用率吗

你好,作者,我这边的显卡可能只有10G显存,配置中提到需要32G的GPU,想问问可以在哪设置参数,减少GPU的使用率?

预测sentences.json文件的疑惑

你好,请问怎样才能对无标签数据进行预测呢?
1、根据你对其他人的回复,都是说对test.json文件进行预测,然而该文件已经将“trigger”信息标注出来了,即文件中distant_triggers字段。这里就会有一个疑惑,“distant_triggers”字段不应该是通过“trigger”模型预测出来的吗?
2、如果“distant_triggers”字段不是“trigger”模型预测出来的,那“distant_triggers”信息怎样获得的,看了下代码以及readme,一直没发现
3、如果“distant_triggers”字段是“trigger”模型预测出来的,那这一部分预测代码在哪里,我愣是没找到(抱歉)
4、按照我的理解,“trigger”模型是预测“触发词”、“role1”模型是预测“sub & obj”、"role2"模型是预测“time & loc”。但是如果对test.json文件进行预测,既然该文件中已经标注了“distant_triggers”即“触发词”,那为什么test.py还需要再加载“trigger”模型,这逻辑让我很混乱
5、怎样将sentences.json转成test.json,我终究是没找到相关代码,望进一步指明

望解答疑惑,多谢

您好,有三个问题想请教一下

一、运行复赛数据时,一条数据中只有一个事件,而初赛数据存在一条数据多触发词情况,我看到Role1Extractor中输出的是每个token的类别,请问在这种情况下,如何解决论元抽取中触发词和论元对应问题,即不仅预测出是否是论元,也预测出属于哪个trigger。

个人想的解决方法,想请教一下是否可行:经过触发词模型和论元模型的识别之后可以得到所有的触发词和所有的论元,遍历所有触发词和论元,拼接两者向量进行二分类训练,判断触发词与论元是否对应。

二、由于复赛只有一个事件,我看到AttributionClassifier中输出的是每条语句的类别,所以在一条语句中出现多个事件情况下,如何进行事件属性的分类。

个人想的解决方法,想请教一下是否可行:在一条语句出现多个事件情况下,根据事件的数量生成相应数量的数据,即一条数据只有一个事件,同样的语句,事件不同,然后当作不同的数据输入模型中进行训练。

三、在运行代码过程中,已经用train.py文件将四个任务的模型都训练完毕,在运行test.py文件时提示需要ckpt_dir路径中的model.pt文件,而运行train.py文件时只在out文件中生成了checkpoint文件,请问test.py文件所需要的model.pt文件如何解决。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.