GithubHelp home page GithubHelp logo

weizhepei / casrel Goto Github PK

View Code? Open in Web Editor NEW
736.0 16.0 142.0 62 KB

A Novel Cascade Binary Tagging Framework for Relational Triple Extraction. Accepted by ACL 2020.

Home Page: https://arxiv.org/abs/1909.03227

License: MIT License

Python 100.00%
knowledge-graph relation-extraction relational-triple-extraction information-extraction keras bert

casrel's Introduction

A Novel Cascade Binary Tagging Framework for Relational Triple Extraction

This repository contains the source code and dataset for the paper: A Novel Cascade Binary Tagging Framework for Relational Triple Extraction. Zhepei Wei, Jianlin Su, Yue Wang, Yuan Tian and Yi Chang. ACL 2020. [pdf]

Overview

At the core of the proposed CasRel framework is the fresh perspective that instead of treating relations as discrete labels on entity pairs, we actually model the relations as functions that map subjects to objects. More precisely, instead of learning relation classifiers f(s,o) -> r, we learn relation-specific taggers f_{r}(s) -> o, each of which recognizes the possible object(s) of a given subject under a specific relation. Under this framework, relational triple extraction is a two-step process: first we identify all possible subjects in a sentence; then for each subject, we apply relation-specific taggers to simultaneously identify all possible relations and the corresponding objects.

overview

Requirements

This repo was tested on Python 3.7 and Keras 2.2.4. The main requirements are:

  • tqdm
  • codecs
  • keras-bert = 0.80.0
  • tensorflow-gpu = 1.13.1

Datasets

Usage

  1. Get pre-trained BERT model for Keras

    Download Google's pre-trained BERT model (BERT-Base, Cased). Then decompress it under pretrained_bert_models/. More pre-trained models are available here.

  2. Build dataset in the form of triples

    Take the NYT dataset for example:

    a) Switch to the corresponding directory and download the dataset

    cd CasRel/data/NYT/raw_NYT

    b) Follow the instructions at the same directory, and just run

    python generate.py

    c) Finally, build dataset in the form of triples

    cd CasRel/data/NYT
    python build_data.py

    This will convert the raw numerical dataset into a proper format for our model and generate train.json, test.json and val.json(if not provided in the raw dataset, it will randomly sample 5% or 10% data from the train.json or test.json to create val.json as in line with previous works). Then split the test dataset by type and num for in-depth analysis on different scenarios of overlapping triples.

  3. Specify the experimental settings

    By default, we use the following settings in run.py:

    {
        "bert_model": "cased_L-12_H-768_A-12",
        "max_len": 100,
        "learning_rate": 1e-5,
        "batch_size": 6,
        "epoch_num": 100,
    }
  4. Train and select the model

    Specify the running mode and dataset at the command line

    python run.py ---train=True --dataset=NYT

    The model weights that lead to the best performance on validation set will be stored in saved_weights/DATASET/.

  5. Evaluate on the test set

    Specify the test dataset at the command line

    python run.py --dataset=NYT

    The extracted result will be saved in results/DATASET/ with the following format:

    {
        "text": "Tim Brooke-Taylor was the star of Bananaman , an STV series first aired on 10/03/1983 and created by Steve Bright .",
        "triple_list_gold": [
            {
                "subject": "Bananaman",
                "relation": "starring",
                "object": "Tim Brooke-Taylor"
            },
            {
                "subject": "Bananaman",
                "relation": "creator",
                "object": "Steve Bright"
            }
        ],
        "triple_list_pred": [
            {
                "subject": "Bananaman",
                "relation": "starring",
                "object": "Tim Brooke-Taylor"
            },
            {
                "subject": "Bananaman",
                "relation": "creator",
                "object": "Steve Bright"
            }
        ],
        "new": [],
        "lack": []
    }

Citation

@inproceedings{wei2020CasRel,
  title={A Novel Cascade Binary Tagging Framework for Relational Triple Extraction},
  author={Wei, Zhepei and Su, Jianlin and Wang, Yue and Tian, Yuan and Chang, Yi},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  pages={1476--1488},
  year={2020}
}

casrel's People

Contributors

weizhepei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

casrel's Issues

关于Relation-specific Object Taggers的输入

想请教一下,在训练过程中,Relation-specific Object Taggers的输入是Subject Tagger预测的结果还是标准的Subject的边界?如果是预测的结果,那么如果Subject Tagger预测错误,那么损失如何构建?

关于ExactMatch问题

你好:
我在测试的时候选择了ExactMatch,但是结果还是实体只有头词。我查看了下训练数据和测试数据,好像所有实体都是头词,请问如果想看看ExactMatch,是否需要自己重新处理数据形成带有边界的训练数据,来完成ExactMatch? 谢谢

内存溢出

训练轮次多了之后,模型会占用大量的内存空间导致溢出,怀疑是evaluate的问题,有遇到过这情况的吗。

GPU的使用问题

想问下作者,为啥我能运行程序,但是却是使用CPU运行的

extract_items中预测实体边界问题

模型中subject_model和object_model预测的实体结尾tail的位置,就是实体结尾那个字的索引,但是在extract_items中,模型预测位置后,选择实体的时候使用的是:subject = tokens[sub_head: sub_tail]和obj = tokens[obj_head: obj_tail],这样选择的话,实体的最后一个字符不是没有选到结果里面吗?是不是应该用subject = tokens[sub_head: sub_tail+1]和obj = tokens[obj_head: obj_tail+1]

Wiki-KBP

论文中的Wiki-KBP 训练集有5万多句,但我根据链接下载的只有2万多句,想问下数据是否争取?

inferece过程的解码是否有问题

hello~
我通读了你的论文和代码。有点想不通的就是inference这块。
subject中,
我看到,np.where(sub_heads_logits[0] > h_bar)[0], np.where(sub_tails_logits[0] > t_bar)[0]
取出来了所有的heads和tails
heads = [0,1]
tails = [4, 5]
最终组合的结果是
[0,4] [0,5], [1,4], [1,5]
怎么排除错误的subject?
而且预测object的时候
sub_heads, sub_tails = np.array([sub[1:] for sub in subjects]).T.reshape((2, -1, 1))
送入object_net的是多个头和尾的位置,seq_gather应该会有问题吧。按照你suject的循环,
obj_heads, obj_tails = np.where(obj_heads_logits[i] > h_bar), np.where(obj_tails_logits[i] > t_bar)
这里是通过i取出当前subject对应head或者tail的结果,但是这个obj_tails_logits貌似没有这一维度啊?

您好,关于keras-bert tokenizer问题

代码中使用了keras-bert里面的tokenizer,但是这个tokenize的表现好像有些特殊,例如:

    "text":"三国中的谋士很多,但是谋士也要分不同的类别,有的善于统筹全局,有的善于战术规划,有的善于外交连横,不过说实话,其中大部分知名谋士的结局都不太好,如:荀彧被曹操逼死,陆逊被孙权气死,就连大家最敬仰的诸葛亮也是被军国大事给累死,但是有一个谋士不但得到了善终,而且还位高权重,关键就在于他在生涯中的五次站队都成功了,我们来看看吧",
    "triple_list":[
        [
            "陆逊",
            "朝代",
            "三国"
        ]
    ]
}

这条数据,text就会被tokenize成,['[CLS]', '##三', '##国', ... , ‘[unused1]’, '[SEP]'], 对应的subject会被tokenize成,['##陆', '##逊', ‘[unused1]’],不知道是出于什么考虑还是只是bug?因为如果这样,在原始输入序列中无法找到subject与object对应的位置,就无法产生对应的标签。(bert-base-chinese, vocab也是bert自带的vocab.txt)

但是代码中在data_generator阶段似乎又有意规避了末尾的unused1标签?

                for triple in line['triple_list']:
                    # 下面这个 -1,少取了末尾的token,是某种特殊的tokenize机制还是只是bug???
                    triple = (self.tokenizer.tokenize(triple[0])[1:-1], triple[1], self.tokenizer.tokenize(triple[2])[1:-1])
                    sub_head_idx = find_head_idx(tokens, triple[0])
                    obj_head_idx = find_head_idx(tokens, triple[2])
                    if sub_head_idx != -1 and obj_head_idx != -1:
                        sub = (sub_head_idx, sub_head_idx + len(triple[0]) - 1)
                        if sub not in s2ro_map:
                            s2ro_map[sub] = []
                        s2ro_map[sub].append((obj_head_idx,
                                           obj_head_idx + len(triple[2]) - 1,
                                           self.rel2id[triple[1]]))

另外,keras-bert 0.80.0似乎无法使用。
我的环境如下

keras == 2.4.3
keras-bert == 0.81.1
tensorflow-gpu == 1.13.1

综上,我的问题如下:

  1. 代码中的tokenize机制是某种特殊方式还是说只是bug?
  2. 与transformers等包的常规tokenizer相比较,这种tokenize方式是否具有某种优势?

关于同一实体出现多次的问题

作者在进行关系抽取的时候,有咩有考虑到同一个实体可能在一个句子中出现多次,举个例子,第一个实体参与了关系三元组的组成,而第二个句子没有参与这种情况

数据

为什么三元组中的头尾实体都是单个单词呢

损失函数中好像没有看到关系的信息?

model.py
line 52-65
gold_sub_heads = K.expand_dims(gold_sub_heads, 2)
gold_sub_tails = K.expand_dims(gold_sub_tails, 2)

sub_heads_loss = K.binary_crossentropy(gold_sub_heads, pred_sub_heads)
sub_heads_loss = K.sum(sub_heads_loss * mask) / K.sum(mask)
sub_tails_loss = K.binary_crossentropy(gold_sub_tails, pred_sub_tails)
sub_tails_loss = K.sum(sub_tails_loss * mask) / K.sum(mask)

obj_heads_loss = K.sum(K.binary_crossentropy(gold_obj_heads, pred_obj_heads), 2, keepdims=True)
obj_heads_loss = K.sum(obj_heads_loss * mask) / K.sum(mask)
obj_tails_loss = K.sum(K.binary_crossentropy(gold_obj_tails, pred_obj_tails), 2, keepdims=True)
obj_tails_loss = K.sum(obj_tails_loss * mask) / K.sum(mask)

loss = (sub_heads_loss + sub_tails_loss) + (obj_heads_loss + obj_tails_loss)

只看到了头尾实体的损失,关系的损失是和尾实体一起计算的,还是怎么计算的,在输入中也没有看到关系的信息,除了关系的数量
line 43-43
pred_obj_heads = Dense(num_rels, activation='sigmoid')(tokens_feature)
pred_obj_tails = Dense(num_rels, activation='sigmoid')(tokens_feature)

对于这块没有看太懂,请教下,谢谢。。

IndexError: Read less bytes than requested

I get the following error while running the run.py file

!python run.py --train=True --dataset=NYT

Using TensorFlow backend.
2020-07-03 14:29:11.236282: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-03 14:29:11.241179: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2200000000 Hz
2020-07-03 14:29:11.241354: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1e66a00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-03 14:29:11.241383: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-03 14:29:11.243340: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-03 14:29:11.248062: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-07-03 14:29:11.248099: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: e54567bf18c0
2020-07-03 14:29:11.248114: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: e54567bf18c0
2020-07-03 14:29:11.248169: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.67.0
2020-07-03 14:29:11.248200: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.67.0
2020-07-03 14:29:11.248212: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 418.67.0
train_data len: 56195
dev_data len: 4999
test_data len: 1297
Traceback (most recent call last):
  File "run.py", line 40, in <module>
    subject_model, object_model, hbt_model = E2EModel(bert_config_path, bert_checkpoint_path, LR, num_rels)
  File "/content/CasRel/model.py", line 15, in E2EModel
    bert_model = load_trained_model_from_checkpoint(bert_config_path, bert_checkpoint_path, seq_len=None)
  File "/usr/local/lib/python3.6/dist-packages/keras_bert/loader.py", line 170, in load_trained_model_from_checkpoint
    load_model_weights_from_checkpoint(model, config, checkpoint_file, training=training)
  File "/usr/local/lib/python3.6/dist-packages/keras_bert/loader.py", line 114, in load_model_weights_from_checkpoint
    loader('bert/encoder/layer_%d/output/dense/kernel' % i),
  File "/usr/local/lib/python3.6/dist-packages/keras_bert/loader.py", line 18, in _loader
    return tf.train.load_variable(checkpoint_file, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 85, in load_variable
    return reader.get_tensor(name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/py_checkpoint_reader.py", line 70, in get_tensor
    self, compat.as_bytes(tensor_str))
IndexError: Read less bytes than requested

Please do let me know if there is any solution to this problem. Thanks in advance! :D

Problems with Wiki-KBP dataset

We can only get a training dataset with 23k sentences from your link, but one with 79k sentences mentioned in your paper. Is there any problem with the link? Please check again. Or could you please send the 79K version training dataset to our email? ([email protected]) Just for a fair comparison, Thank you!

one problem about test

since after training, only hbt model was saved; then when test, how can we load the subject and object model?

使用copy‘r 的nty数据集

使用CopyR 的nty数据集。里边在处理entity,只保留了最后一个单词
例如:Suffolk County -> County。

在casrel也使用了这个数据集,通过把数字化的数据集转化回文字,并保存关系三元组。
按照casrel论文,那sub start,end 且不是都指向County。 这个是否存在问题啊?

数据集文件需要修复

CasRel/data/NYT/raw_NYT/generate.py

这个代码产生的文件所有的实体都只有一个词。比如New York,会变成York。

对于WebNLG也有同样的问题。

但是项目的README里面写的样本是正常的,说明是文件分享的问题,所以可以麻烦修复一下google drive分享的文件吗?谢谢!

表2的结果?

大神你好,请问表2中对于NYT和WebNLG这两个语料都是单个token的subj/obj的结果吗?还是说您在处理的时候只是取了subj/obj的head部分

Keras tensorflow version issue:

Hi,

While trying to set up, facing the below issue just before, ! python run.py ---train=True --dataset=NYT
Looks like versioning issues, could you please share requirements.txt or used packages along with the specific versions.

raceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/keras/init.py", line 3, in
from tensorflow.keras.layers.experimental.preprocessing import RandomRotation
ModuleNotFoundError: No module named 'tensorflow.keras'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 3, in
from model import E2EModel, Evaluate
File "/content/CasRel/model.py", line 2, in
from keras.layers import *
File "/usr/local/lib/python3.6/dist-packages/keras/init.py", line 6, in
'Keras requires TensorFlow 2.2 or higher. '
ImportError: Keras requires TensorFlow 2.2 or higher. Install TensorFlow via pip install tensorflow

解码过程有点问题

请问模型是如何解决的以下问题的呢,真心求解答,在代码里没看出来:

  1. 实体的head重叠,比如"北京市政府"与"北京大学",代码里对每一个subjet/object的head,取的都是最近的tail,因此以上问题是否解决不了?

HBTokenizer

class HBTokenizer(Tokenizer):
def _tokenize(self, text):
if not self._cased:
text = unicodedata.normalize('NFD', text)
text = ''.join([ch for ch in text if unicodedata.category(ch) != 'Mn'])
text = text.lower()
spaced = ''
for ch in text:
if ord(ch) == 0 or ord(ch) == 0xfffd or self._is_control(ch):
continue
else:
spaced += ch
tokens = []
for word in spaced.strip().split():
tokens += self._word_piece_tokenize(word)
tokens.append('[unused1]')
return tokens

请问为什么要加一个'[unused1]'呢

关于subject

sub_head, sub_tail = choice(list(s2ro_map.keys()))。为什么采样句子中的subject是随机选取其中一个,这样做的目的是什么呢,谢谢解答

请问作者有考虑分享Pytorch版本的代码么?

作者您好,万分感谢您分享的代码!

因为自己能力水平有限,对Tensorflow和Keras不是很了解,所以最近在您代码的基础上实现pytorch版本。但是复现之后的性能差很多。

请问作者有发布Pytorch版本代码的计划么?

webnlg得到的triplet.json文件里为什么很多实体都只有尾部

{
"text": "Alan Bean ( of the United States ) was a crew member of NASA 's Apollo 12 under the commander David Scott .",
"triple_list": [
[
"Bean",
"was a crew member of",
"12"
],
[
"Bean",
"nationality",
"States"
],
[
"12",
"operator",
"NASA"
],
[
"12",
"commander",
"Scott"
]
比如这个,实体不应该是alan bean和apollo 12嘛?用这样的数据测出来不能让人信服吧?

Training Time

How much time does it take for an epoch? I got a 16 GB GPU.

Maybe some mistakes on WebNLG reported scores

The paper reports 89.4, 92.2, 94.7, on WebNLG-Normal, WebNLG-EPO, WebNLG-SEO. But In my reproduction, when the f1 score on WebNLG achieves 91.8 reported on the paper, it is hard to exceed 92.5 on WebNLG-SEO but easy to get 94.0 on WebNLG-EPO. It seems that the original paper may have mistaken the score on WebNLG-EPO with the score on WebNLG-SEO. I did a statistic, WebNLG-Normal, WebNLG-EPO, WebNLG-SEO separately have 246, 26, 457 samples and 246, 98, 1345 relation triplets. It is obvious that WebNLG-SEO is the main part, and it has 13 times triplets as WebNLG-EPO and 5 times as WebNLG-Normal. Given that WebNLG-Normal and WebNLG-EPO have a small number of triplets, so it seems little likely to drop the score down from 94.7 to 91.8. So, it seems a little likely that CasRel achieves 91.8 on the entire WebNLG but 94.7 on WebNLG-SEO.

Please check it again, Thanks~

extract_items 的问题

extract_items 获取subject 对应文本的部分是否存在问题? 没有想通,望得到解答,麻烦了~

如下代码中的sub_heads, sub_tails 代表head和tail所有可能位置的candidates

  • 其中sub_head 和 sub_tail 应该对应的都是span的闭区间,对tokens做切片时为何不sub_tail+1
 sub_heads_logits, sub_tails_logits = subject_model.predict([token_ids, segment_ids])
    sub_heads, sub_tails = np.where(sub_heads_logits[0] > h_bar)[0], np.where(sub_tails_logits[0] > t_bar)[0]
    subjects = []
    for sub_head in sub_heads:
        sub_tail = sub_tails[sub_tails >= sub_head]
        if len(sub_tail) > 0:
            sub_tail = sub_tail[0] 
            subject = tokens[sub_head: sub_tail]
            subjects.append((subject, sub_head, sub_tail)) 

从data_loader中代码来看,构建gold_label时,subject 和object 文本span 都是闭区间 (其中是sub_head_idx + len(triple[0]) - 1)

                for triple in line['triple_list']:
                    triple = (self.tokenizer.tokenize(triple[0])[1:-1], triple[1], self.tokenizer.tokenize(triple[2])[1:-1])
                    sub_head_idx = find_head_idx(tokens, triple[0])
                    obj_head_idx = find_head_idx(tokens, triple[2])
                    if sub_head_idx != -1 and obj_head_idx != -1:
                        sub = (sub_head_idx, sub_head_idx + len(triple[0]) - 1)
                        if sub not in s2ro_map:
                            s2ro_map[sub] = []
                        s2ro_map[sub].append((obj_head_idx,
                                           obj_head_idx + len(triple[2]) - 1,
                                           self.rel2id[triple[1]]))

疑惑??

作者你好,你这个论文和苏建林的这篇博客有什么区别 博客

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.