weizhepei / casrel Goto Github PK

View Code? Open in Web Editor NEW

757.0 757.0 141.0 62 KB

A Novel Cascade Binary Tagging Framework for Relational Triple Extraction. Accepted by ACL 2020.

Home Page: https://arxiv.org/abs/1909.03227

License: MIT License

Python 100.00%

bert information-extraction keras knowledge-graph relation-extraction relational-triple-extraction

casrel's People

Contributors

Stargazers

Watchers

Forkers

wangbq18 lang101 blazingsiyan zxlzr langfangctt guome huangxizhi xumeng123 pvcastro yichao96 131250208 dapeng2018 xxcharles gaohuan2015 chunningdu lichunbao zgd716 liuwq168 michael-wzhu tanyazhao xrosliang webygit hjc3613 bobycv06fpm vhientran candy555 aliqiao 18856315269 longlongman hspix ethan-phu fresh382227905 bigbigkid liviuslw sidney1994 tntxhy vivianzy1985 colinsongf jingyiwang3 zhenpingli grh9 qianrenjian githublyp-max agolo tianyunzqs bluesky1018 shark803 hell-to-heaven sameerspatil liyuwe joesphneedham inderway cindytech guojson aeoling codewithzichao quangchiem139 tayeechang i-zhouqh xbjasper zhouhaha-one ada520 xbad zhihao-chen china-challengehub victortowne maxindian jacklmind zn-qiao likevivi markwjj dystudio gaobingbing2019 marshalxu clonecloud gxy0727 xichunling husheng-liu johnyanccer goomoo99 zhuifeng414 tiffen reactivetype snaildm liyandan javantang fangzheng354 cheasim xding2 quanjiehan deep-cognition albertbj littlerookie sjyttkl abel-harvey lxx1220 carlgao-git2 zhangxingone lightcraft2020 lotbear

casrel's Issues

您好，关于keras-bert tokenizer问题

代码中使用了keras-bert里面的tokenizer，但是这个tokenize的表现好像有些特殊，例如：

    "text":"三国中的谋士很多，但是谋士也要分不同的类别，有的善于统筹全局，有的善于战术规划，有的善于外交连横，不过说实话，其中大部分知名谋士的结局都不太好，如：荀彧被曹操逼死，陆逊被孙权气死，就连大家最敬仰的诸葛亮也是被军国大事给累死，但是有一个谋士不但得到了善终，而且还位高权重，关键就在于他在生涯中的五次站队都成功了，我们来看看吧",
    "triple_list":[
        [
            "陆逊",
            "朝代",
            "三国"
        ]
    ]
}

这条数据，text就会被tokenize成，['[CLS]', '##三', '##国', ... , ‘[unused1]’, '[SEP]'], 对应的subject会被tokenize成，['##陆', '##逊'， ‘[unused1]’]，不知道是出于什么考虑还是只是bug？因为如果这样，在原始输入序列中无法找到subject与object对应的位置，就无法产生对应的标签。（bert-base-chinese, vocab也是bert自带的vocab.txt）

但是代码中在data_generator阶段似乎又有意规避了末尾的unused1标签?

                for triple in line['triple_list']:
                    # 下面这个 -1，少取了末尾的token，是某种特殊的tokenize机制还是只是bug???
                    triple = (self.tokenizer.tokenize(triple[0])[1:-1], triple[1], self.tokenizer.tokenize(triple[2])[1:-1])
                    sub_head_idx = find_head_idx(tokens, triple[0])
                    obj_head_idx = find_head_idx(tokens, triple[2])
                    if sub_head_idx != -1 and obj_head_idx != -1:
                        sub = (sub_head_idx, sub_head_idx + len(triple[0]) - 1)
                        if sub not in s2ro_map:
                            s2ro_map[sub] = []
                        s2ro_map[sub].append((obj_head_idx,
                                           obj_head_idx + len(triple[2]) - 1,
                                           self.rel2id[triple[1]]))

另外，keras-bert 0.80.0似乎无法使用。
我的环境如下

keras == 2.4.3
keras-bert == 0.81.1
tensorflow-gpu == 1.13.1

综上，我的问题如下：

代码中的tokenize机制是某种特殊方式还是说只是bug？
与transformers等包的常规tokenizer相比较，这种tokenize方式是否具有某种优势？

Using Colab and tensorflow-gpu 1.13.1

gives:
No module named 'tensorflow.python.framework'

Solution:
tensorflow-gpu 1.15

I reproduced a script to preprocess NYT dataset for joint entity and relation extraction.

I reproduced a script to preprocess NYT dataset for joint entity and relation extraction. I aligned the dataset (CopyRE) to the origin NYT distant supervised learning dataset. No third-party tool is needed because all the sentences can be found in the origin NYT dataset.
Here is my script.

与HBT相比，CasRel模型是在哪里有改动，导致f1又有提升的呢？

CopyRL model inplementation

Hello, @weizhepei , in your paper you mentioned that you have implemented CopyRE(with Reinforcement Learning)? Could you make it public? Thanks a lot.

解码过程有点问题

请问模型是如何解决的以下问题的呢，真心求解答，在代码里没看出来：

实体的head重叠，比如"北京市政府"与"北京大学"，代码里对每一个subjet/object的head,取的都是最近的tail,因此以上问题是否解决不了？

Maybe some mistakes on WebNLG reported scores

The paper reports 89.4, 92.2, 94.7, on WebNLG-Normal, WebNLG-EPO, WebNLG-SEO. But In my reproduction, when the f1 score on WebNLG achieves 91.8 reported on the paper, it is hard to exceed 92.5 on WebNLG-SEO but easy to get 94.0 on WebNLG-EPO. It seems that the original paper may have mistaken the score on WebNLG-EPO with the score on WebNLG-SEO. I did a statistic, WebNLG-Normal, WebNLG-EPO, WebNLG-SEO separately have 246, 26, 457 samples and 246, 98, 1345 relation triplets. It is obvious that WebNLG-SEO is the main part, and it has 13 times triplets as WebNLG-EPO and 5 times as WebNLG-Normal. Given that WebNLG-Normal and WebNLG-EPO have a small number of triplets, so it seems little likely to drop the score down from 94.7 to 91.8. So, it seems a little likely that CasRel achieves 91.8 on the entire WebNLG but 94.7 on WebNLG-SEO.

Please check it again, Thanks~

Wiki-KBP

论文中的Wiki-KBP 训练集有5万多句，但我根据链接下载的只有2万多句，想问下数据是否争取？

extract_items 的问题

extract_items 获取subject 对应文本的部分是否存在问题? 没有想通，望得到解答，麻烦了~

如下代码中的sub_heads, sub_tails 代表head和tail所有可能位置的candidates

其中sub_head 和 sub_tail 应该对应的都是span的闭区间，对tokens做切片时为何不sub_tail+1

 sub_heads_logits, sub_tails_logits = subject_model.predict([token_ids, segment_ids])
    sub_heads, sub_tails = np.where(sub_heads_logits[0] > h_bar)[0], np.where(sub_tails_logits[0] > t_bar)[0]
    subjects = []
    for sub_head in sub_heads:
        sub_tail = sub_tails[sub_tails >= sub_head]
        if len(sub_tail) > 0:
            sub_tail = sub_tail[0] 
            subject = tokens[sub_head: sub_tail]
            subjects.append((subject, sub_head, sub_tail))

从data_loader中代码来看，构建gold_label时，subject 和object 文本span 都是闭区间 (其中是sub_head_idx + len(triple[0]) - 1)

                for triple in line['triple_list']:
                    triple = (self.tokenizer.tokenize(triple[0])[1:-1], triple[1], self.tokenizer.tokenize(triple[2])[1:-1])
                    sub_head_idx = find_head_idx(tokens, triple[0])
                    obj_head_idx = find_head_idx(tokens, triple[2])
                    if sub_head_idx != -1 and obj_head_idx != -1:
                        sub = (sub_head_idx, sub_head_idx + len(triple[0]) - 1)
                        if sub not in s2ro_map:
                            s2ro_map[sub] = []
                        s2ro_map[sub].append((obj_head_idx,
                                           obj_head_idx + len(triple[2]) - 1,
                                           self.rel2id[triple[1]]))

为什么论文里写的WebNLG中有246种关系，但是下载的数据集里却只看到了170种？

您好，
请问为什么论文里写的WebNLG中有246种关系，但是下载的数据集里却只看到了170种？

is there a bug in build_data.py, line18? or did i do sth wrong?

it says "list indices must be integers or slices, not str"
the sentence is :
if not a['relationMentions']
and i didn't see relationMentions as a int varible

i look forward to your reply

i want to konw which keras-bert you used

os.environ["CUDA_VISIBLE_DEVICES"] = "1" kernel version seems to match DSO: 410.129.0

os.environ["CUDA_VISIBLE_DEVICES"] = "1"
this shoud be change to
###os.environ["CUDA_VISIBLE_DEVICES"] = "1"
or
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
when error happen , or you should change it by your own device

get_encoders() got an unexpected keyword argument 'use_adapter'

请问这是版本冲突吗

Training Time

How much time does it take for an epoch? I got a 16 GB GPU.

损失函数中好像没有看到关系的信息？

model.py
line 52-65
gold_sub_heads = K.expand_dims(gold_sub_heads, 2)
gold_sub_tails = K.expand_dims(gold_sub_tails, 2)

sub_heads_loss = K.binary_crossentropy(gold_sub_heads, pred_sub_heads)
sub_heads_loss = K.sum(sub_heads_loss * mask) / K.sum(mask)
sub_tails_loss = K.binary_crossentropy(gold_sub_tails, pred_sub_tails)
sub_tails_loss = K.sum(sub_tails_loss * mask) / K.sum(mask)

obj_heads_loss = K.sum(K.binary_crossentropy(gold_obj_heads, pred_obj_heads), 2, keepdims=True)
obj_heads_loss = K.sum(obj_heads_loss * mask) / K.sum(mask)
obj_tails_loss = K.sum(K.binary_crossentropy(gold_obj_tails, pred_obj_tails), 2, keepdims=True)
obj_tails_loss = K.sum(obj_tails_loss * mask) / K.sum(mask)

loss = (sub_heads_loss + sub_tails_loss) + (obj_heads_loss + obj_tails_loss)

只看到了头尾实体的损失，关系的损失是和尾实体一起计算的，还是怎么计算的，在输入中也没有看到关系的信息，除了关系的数量
line 43-43
pred_obj_heads = Dense(num_rels, activation='sigmoid')(tokens_feature)
pred_obj_tails = Dense(num_rels, activation='sigmoid')(tokens_feature)

对于这块没有看太懂，请教下，谢谢。。

why use choice() in construction of relational triples

I notice that in the data_generator class, that you only preserve one relation triple by using random.choice(). what is the motivation of this?

Thanks
Feng

内存溢出

训练轮次多了之后，模型会占用大量的内存空间导致溢出，怀疑是evaluate的问题，有遇到过这情况的吗。

one problem about test

since after training, only hbt model was saved; then when test, how can we load the subject and object model?

数据解析的问题

你好，这一行的i和外层循环的i冲突了吧？直接运行程序会报list out of index的错误，请看一下代码

CasRel/data/NYT/raw_NYT/generate.py

Line 53 in e04924c

tokens = [word_dict[i] for i in sents[i]]

关于Relation-specific Object Taggers的输入

想请教一下，在训练过程中，Relation-specific Object Taggers的输入是Subject Tagger预测的结果还是标准的Subject的边界？如果是预测的结果，那么如果Subject Tagger预测错误，那么损失如何构建？

TypeError: list indices must be integers or slices, not str

Traceback (most recent call last):
File "build_data.py", line 18, in
if not a['relationMentions']:
TypeError: list indices must be integers or slices, not str

How to solve this problem?

关于subject

sub_head, sub_tail = choice(list(s2ro_map.keys()))。为什么采样句子中的subject是随机选取其中一个，这样做的目的是什么呢，谢谢解答

数据集文件需要修复

CasRel/data/NYT/raw_NYT/generate.py

这个代码产生的文件所有的实体都只有一个词。比如New York，会变成York。

对于WebNLG也有同样的问题。

但是项目的README里面写的样本是正常的，说明是文件分享的问题，所以可以麻烦修复一下google drive分享的文件吗？谢谢！

表2的结果？

大神你好，请问表2中对于NYT和WebNLG这两个语料都是单个token的subj/obj的结果吗？还是说您在处理的时候只是取了subj/obj的head部分

HBTokenizer

class HBTokenizer(Tokenizer):
def _tokenize(self, text):
if not self._cased:
text = unicodedata.normalize('NFD', text)
text = ''.join([ch for ch in text if unicodedata.category(ch) != 'Mn'])
text = text.lower()
spaced = ''
for ch in text:
if ord(ch) == 0 or ord(ch) == 0xfffd or self._is_control(ch):
continue
else:
spaced += ch
tokens = []
for word in spaced.strip().split():
tokens += self._word_piece_tokenize(word)
tokens.append('[unused1]')
return tokens

请问为什么要加一个'[unused1]'呢

请问作者有考虑分享Pytorch版本的代码么？

作者您好，万分感谢您分享的代码！

因为自己能力水平有限，对Tensorflow和Keras不是很了解，所以最近在您代码的基础上实现pytorch版本。但是复现之后的性能差很多。

请问作者有发布Pytorch版本代码的计划么？

IndexError: Read less bytes than requested

I get the following error while running the run.py file

!python run.py --train=True --dataset=NYT

Using TensorFlow backend.
2020-07-03 14:29:11.236282: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-03 14:29:11.241179: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2200000000 Hz
2020-07-03 14:29:11.241354: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1e66a00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-03 14:29:11.241383: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-03 14:29:11.243340: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-03 14:29:11.248062: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-07-03 14:29:11.248099: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: e54567bf18c0
2020-07-03 14:29:11.248114: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: e54567bf18c0
2020-07-03 14:29:11.248169: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.67.0
2020-07-03 14:29:11.248200: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.67.0
2020-07-03 14:29:11.248212: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 418.67.0
train_data len: 56195
dev_data len: 4999
test_data len: 1297
Traceback (most recent call last):
  File "run.py", line 40, in <module>
    subject_model, object_model, hbt_model = E2EModel(bert_config_path, bert_checkpoint_path, LR, num_rels)
  File "/content/CasRel/model.py", line 15, in E2EModel
    bert_model = load_trained_model_from_checkpoint(bert_config_path, bert_checkpoint_path, seq_len=None)
  File "/usr/local/lib/python3.6/dist-packages/keras_bert/loader.py", line 170, in load_trained_model_from_checkpoint
    load_model_weights_from_checkpoint(model, config, checkpoint_file, training=training)
  File "/usr/local/lib/python3.6/dist-packages/keras_bert/loader.py", line 114, in load_model_weights_from_checkpoint
    loader('bert/encoder/layer_%d/output/dense/kernel' % i),
  File "/usr/local/lib/python3.6/dist-packages/keras_bert/loader.py", line 18, in _loader
    return tf.train.load_variable(checkpoint_file, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 85, in load_variable
    return reader.get_tensor(name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line 70, in get_tensor
    self, compat.as_bytes(tensor_str))
IndexError: Read less bytes than requested

Please do let me know if there is any solution to this problem. Thanks in advance! :D

Keras tensorflow version issue:

Hi,

While trying to set up, facing the below issue just before, ! python run.py ---train=True --dataset=NYT
Looks like versioning issues, could you please share requirements.txt or used packages along with the specific versions.

raceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/keras/init.py", line 3, in
from tensorflow.keras.layers.experimental.preprocessing import RandomRotation
ModuleNotFoundError: No module named 'tensorflow.keras'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 3, in
from model import E2EModel, Evaluate
File "/content/CasRel/model.py", line 2, in
from keras.layers import *
File "/usr/local/lib/python3.6/dist-packages/keras/init.py", line 6, in
'Keras requires TensorFlow 2.2 or higher. '
ImportError: Keras requires TensorFlow 2.2 or higher. Install TensorFlow via pip install tensorflow

run.py hbt_model.fit_generator报错 python interpreter state is not initialized. the process may be terminated

GPU的使用问题

想问下作者，为啥我能运行程序，但是却是使用CPU运行的

关于ExactMatch问题

你好：
我在测试的时候选择了ExactMatch，但是结果还是实体只有头词。我查看了下训练数据和测试数据，好像所有实体都是头词，请问如果想看看ExactMatch，是否需要自己重新处理数据形成带有边界的训练数据，来完成ExactMatch？谢谢

What is the hyperparameters of the LSTM-based model?

Hello, I want to know the hyperparameters of the LSTM-based model. I noticed that there are no mentioned in your paper and no code about it in this github project.
Can you share it? Please.

疑惑？？

作者你好，你这个论文和苏建林的这篇博客有什么区别博客

AttributeError: 'tuple' object has no attribute 'get_shape'

版本都是一致的，训练时报错，求解答，谢谢。

使用copy‘r 的nty数据集

使用CopyR 的nty数据集。里边在处理entity，只保留了最后一个单词
例如：Suffolk County -> County。

在casrel也使用了这个数据集，通过把数字化的数据集转化回文字，并保存关系三元组。
按照casrel论文，那sub start，end 且不是都指向County。这个是否存在问题啊？

webnlg得到的triplet.json文件里为什么很多实体都只有尾部

{
"text": "Alan Bean ( of the United States ) was a crew member of NASA 's Apollo 12 under the commander David Scott .",
"triple_list": [
[
"Bean",
"was a crew member of",
"12"
],
[
"Bean",
"nationality",
"States"
],
[
"12",
"operator",
"NASA"
],
[
"12",
"commander",
"Scott"
]
比如这个，实体不应该是alan bean和apollo 12嘛？用这样的数据测出来不能让人信服吧？

Could you share the best parameters？

Could you share the best model parameters？

关于同一实体出现多次的问题

作者在进行关系抽取的时候，有咩有考虑到同一个实体可能在一个句子中出现多次，举个例子，第一个实体参与了关系三元组的组成，而第二个句子没有参与这种情况

小白提问，关于NYT数据处理的问题，在raw_NYT/generate.py中处理出的new_valid等文件是什么意思呀

Problems with Wiki-KBP dataset

We can only get a training dataset with 23k sentences from your link, but one with 79k sentences mentioned in your paper. Is there any problem with the link? Please check again. Or could you please send the 79K version training dataset to our email? ([email protected]) Just for a fair comparison, Thank you!

数据

为什么三元组中的头尾实体都是单个单词呢

请问训练的时候报错：TypeError: get_encoders() got an unexpected keyword argument 'use_adapter'是为什么？

你好，我在复现你的代码的时候
在train时，报错：TypeError: get_encoders() got an unexpected keyword argument 'use_adapter'
请问你知道这是为什么吗？（PS：自己在网上查了很久，没能解决。。。）

A question about the training process

How long does the training process take?

inferece过程的解码是否有问题

hello～
我通读了你的论文和代码。有点想不通的就是inference这块。
subject中，
我看到，np.where(sub_heads_logits[0] > h_bar)[0], np.where(sub_tails_logits[0] > t_bar)[0]
取出来了所有的heads和tails
heads = [0,1]
tails = [4, 5]
最终组合的结果是
[0,4] [0,5], [1,4], [1,5]
怎么排除错误的subject？
而且预测object的时候
sub_heads, sub_tails = np.array([sub[1:] for sub in subjects]).T.reshape((2, -1, 1))
送入object_net的是多个头和尾的位置，seq_gather应该会有问题吧。按照你suject的循环，
obj_heads, obj_tails = np.where(obj_heads_logits[i] > h_bar), np.where(obj_tails_logits[i] > t_bar)
这里是通过i取出当前subject对应head或者tail的结果，但是这个obj_tails_logits貌似没有这一维度啊？

Why add '[unused]' token after every meaningful token?

Does it support Chinese? and How can I develop the model?

Can you give me some suggestions and some tips？
Thanks!
Looking forward to your reply

extract_items中预测实体边界问题

模型中subject_model和object_model预测的实体结尾tail的位置，就是实体结尾那个字的索引，但是在extract_items中，模型预测位置后，选择实体的时候使用的是：subject = tokens[sub_head: sub_tail]和obj = tokens[obj_head: obj_tail]，这样选择的话，实体的最后一个字符不是没有选到结果里面吗？是不是应该用subject = tokens[sub_head: sub_tail+1]和obj = tokens[obj_head: obj_tail+1]

nyt.7z的train.json内容有问题

https://drive.google.com/file/d/10f24s9gM7NdyO3z5OqQxJgYud4NnCJg3/view

train.json里面只有一行数字列表。

小白提问：使用NYT数据集, 复现结果相差有10%，不知道问题在哪QAQ

我完全按照readme的步骤
NYT数据集复现结果是0.8030150753768869（precision.) 0.7881627620221975(recall) 0.7955196017423797(F1)
与论文中的89.7（precision） 89.5(recall) 89.6(F1) 相差甚远，可又不知道问题可能在哪

weizhepei / casrel Goto Github PK

casrel's People

Contributors

Stargazers

Watchers

Forkers

casrel's Issues

作者你好，你这个论文和苏建林的这篇博客有什么区别 博客

Recommend Projects

Recommend Topics

Recommend Org

Jobs

作者你好，你这个论文和苏建林的这篇博客有什么区别博客