sakuranew / bert-attributeextraction Goto Github PK

USING BERT FOR Attribute Extraction in KnowledgeGraph. fine-tuning and feature extraction. 使用基于bert的微调和特征提取方法来进行知识图谱百度百科人物词条属性抽取。

Python 100.00%

bert deeplearning ai knowledge-graph relation-extraction nlp feature-extraction fine-tuning attribute-extraction

bert-attributeextraction's Introduction

BERT-Attribute-Extraction

基于bert的知识图谱属性抽取

USING BERT FOR Attribute Extraction in KnowledgeGraph with two method,fine-tuning and feature extraction.

知识图谱百度百科人物词条属性抽取，使用基于bert的微调fine-tuning和特征提取feature-extraction方法进行实验。

Prerequisites

Tensorflow >=1.10
scikit-learn

Pre-trained models

BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

Installing

None

Dataset

The dataset is constructed according to Baidu Encyclopedia character entries. Filter out corpus that does not contain entities and attributes.

Entities and attributes are obtained from name entity recognition.

Labels are obtained from the Baidu Encyclopedia infobox, and most of them are labeled manually,so some are not very good.
For example:

黄维#1904年#1#黄维（1904年-1989年），字悟我，出生于江西贵溪一农户家庭。        
陈昂#山东省滕州市#1#邀请担任诗词嘉宾。1992年1月26日，陈昂出生于山东省滕州市一个普通的知识分子家庭，其祖父、父亲都
陈伟庆#肇庆市鼎湖区#0#长。任免信息2016年10月21日下午，肇庆市鼎湖区八届人大一次会议胜利闭幕。陈伟庆当选区人民政府副区长。

Getting Started

run strip.py can get striped data
run data_process.py can process data to get numpy file input
parameters file is the parameters that run model need

Running the tests

For example with birthplace dataset：

fine-tuning

run run_classifier.py to get predicted probability outputs

python run_classifier.py \
        --task_name=my \
        --do_train=true \
        --do_predict=true \
        --data_dir=a \
        --vocab_file=/home/tiny/zhaomeng/bertmodel/vocab.txt \
        --bert_config_file=/home/tiny/zhaomeng/bertmodel/bert_config.json \
        --init_checkpoint=/home/tiny/zhaomeng/bertmodel/bert_model.ckpt \
        --max_seq_length=80 \
        --train_batch_size=32 \
        --learning_rate=2e-5 \
        --num_train_epochs=1.0 \
        --output_dir=./output

then run proba2metrics.py to get final result with wrong classification

feature-extraction

run extract_features.py to get the vector representation of train and test data in json file format

python extract_features.py \
        --input_file=../data/birth_place_train.txt \
        --output_file=../data/birth_place_train.jsonl \
        --vocab_file=/home/tiny/zhaomeng/bertmodel/vocab.txt \
        --bert_config_file=/home/tiny/zhaomeng/bertmodel/bert_config.json \
        --init_checkpoint=/home/tiny/zhaomeng/bertmodel/bert_model.ckpt \
        --layers=-1 \
        --max_seq_length=80 \
        --batch_size=16

then run json2vector.py to transfer json file to vector representation
finally run run_classifier.py to use machine learning methods to do classification,MLP usually peforms best

Result

The predicted results and misclassified corpus are saved in result dir.

For example with birthplace dataset using fine-tuning method,the result is:

            precision    recall  f1-score   support

     0      0.963     0.967     0.965       573
     1      0.951     0.946     0.948       389

Authors

zhao meng

License

This project is licensed under the MIT License

Acknowledgments

bert-attributeextraction's People

Contributors

Stargazers

Watchers

bert-attributeextraction's Issues

多分类 vs 单分类

对于比较复杂的关系，比如人和人之间父母，师生，夫妻，朋友，同事，这些不同的关系，是构建一个多分类器更好，还是构建多个 0-1 分类器更好呢。。。。。

你好，请问这个项目跟知识图谱有什么关系呢

项目名字是“使用基于bert的微调和特征提取方法来进行知识图谱百度百科人物词条属性抽取。”，但我看到里面都是利用bert来进行训练评估和预测，得到准确率召回率这样的指标，好像跟知识图谱没什么关系..

请问怎么安装raw_bert呢

我直接用pip install raw_bert发现没有这个库，那么
from raw_bert import modeling
from raw_bert import optimization
from raw_bert import tokenization
这三行里的raw_bert是哪个raw_bert呢

why do we need feature-extraction?

既然可以直接进行分类来提取属性，为啥还要做 feature-extraction 呢？效果会更好一些吗？

TypeError: 'NoneType' object is not iterable on birthplace dataset, raised on the 2 round

ub16c9@ub16c9-gpu:~/ub16_prj/BERT-AttributeExtraction/birthplace$ PYTHONPATH=../ python3 run_classifier.py --task_name=my --do_train=true --do_predict=true --data_dir=date/ --vocab_file=../../../data/bert/chinese_L-12_H-768_A-12/vocab.txt --bert_config_file=../../../data/bert/chinese_L-12_H-768_A-12/bert_config.json --init_checkpoint=../../../data/bert/chinese_L-12_H-768_A-12/bert_model.ckpt --max_seq_length=80 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=1.0 --output_dir=./output
reading test data ...
8727
962
WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder..model_fn at 0x7f59fc83bc80>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_save_summary_steps': 100, '_eval_distribute': None, '_evaluation_master': '', '_cluster': None, '_experimental_distribute': None, '_protocol': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f59fc4dab70>, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_task_id': 0, '_save_checkpoints_steps': 1000, '_global_id_in_cluster': 0, '_tf_random_seed': None, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_service': None, '_log_step_count_steps': None, '_num_worker_replicas': 1, '_train_distribute': None, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None), '_device_fn': None, '_save_checkpoints_secs': None, '_keep_checkpoint_max': 5, '_model_dir': './output', '_task_type': 'worker', '_keep_checkpoint_every_n_hours': 10000}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
INFO:tensorflow:Writing example 0 of 8727
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: 0-0
INFO:tensorflow:tokens: [CLS] 表大会常务委员会的决定任免驻外大使：任命王雪峰为中华人民共和国驻萨摩亚独立国特命全权大使。 [SEP]
INFO:tensorflow:input_ids: 101 6134 1920 833 2382 1218 1999 1447 833 4638 1104 2137 818 1048 7728 1912 1920 886 8038 818 1462 4374 7434 2292 711 704 1290 782 3696 1066 1469 1744 7728 5855 3040 762 4324 4989 1744 4294 1462 1059 3326 1920 886 511 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)

INFO:tensorflow: name = bert/pooler/dense/kernel:0, shape = (768, 768), INIT_FROM_CKPT
INFO:tensorflow: name = bert/pooler/dense/bias:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = output_weights:0, shape = (2, 768)
INFO:tensorflow: name = output_bias:0, shape = (2,)
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2019-01-02 22:28:29.025812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-02 22:28:29.025856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-02 22:28:29.025874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-02 22:28:29.025880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-02 22:28:29.025991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10080 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from ./output/model.ckpt-272
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:prediction_loop marked as finished
INFO:tensorflow:prediction_loop marked as finished
k is 2
Traceback (most recent call last):
File "run_classifier.py", line 850, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_classifier.py", line 704, in main
processor = processorstask_name
File "run_classifier.py", line 195, in init
self.train, self.test = self._create_examples(data_process.data_gen, k)
File "run_classifier.py", line 218, in _create_examples
(train_x, test_x), (train_y, test_y) = f(k)
TypeError: 'NoneType' object is not iterable
ub16c9@ub16c9-gpu:~/ub16_prj/BERT-AttributeExtraction/birthplace$

你好，预测时data_dir = a?

根据代码
predict_examples = processor.get_test_examples(FLAGS.data_dir)得知训练样例从data_dir里得到，然后根据运行参数--data_dir=a \发现data_dir是a.

但项目目录里没有a文件夹啊

请问怎么标注数据，标注数据中的1和0分别代表的是什么，看不太懂

请问，能否用于任意实体关系的抽取？

能发布数据集吗

comparison with relation extraction method by biGRU

https://github.com/crownpku/Information-Extraction-Chinese/tree/master/RE_BGRU_2ATT

这个项目用了 biGRU 的方式做关系提取，您可以跑一下它的数据集，对比一下这个方法和您的方法的准确率的高低吗？

4 methods to process raw data

File "run_classifier.py", line 668
for k in range(1, 4):
“data_process.py“ has 4 methods to process raw data,but I have commented 3 methods and forget to change the k

Originally posted by @sakuranew in #2 (comment)

Do you mind explaining the difference among the four methods to process row data? What's the purpose to have four methods? And what's the difference between the strip version and non-strip version? Thank you.

怎么构建事件提取的分类器？

比如对于 “2018年12月29日,人工智能创业公司北京阿博茨科技有限公司(以下简称“阿博茨科技”)宣布完成3000万美元B轮融资,此轮融资主要投资方为Mindworks概念资本、SIG海纳亚洲、启明创投。“，这样的一段话，如何构建分类器来提取这个融资事件？

这里面关系包括：“北京阿博茨科技有限公司-融资时间-2018年12月29日”， “北京阿博茨科技有限公司-融资金额-B轮”， “北京阿博茨科技有限公司-融资金额-3000万美元”，“北京阿博茨科技有限公司-投资方-Mindworks概念资本、SIG海纳亚洲、启明创投”。。。。

按照您的建议，也是构建多个关系分类器来一一判断吗？

raw_bert不存在的问题

参考README运行run_classifier.py的时候提示from raw_bert 找不到。
查看后发现.gitignore里面添加了raw_bert。
是否以为这个文件夹没有提交上来，或者再其他的git上呢？