GithubHelp home page GithubHelp logo

sakuranew / bert-attributeextraction Goto Github PK

View Code? Open in Web Editor NEW
258.0 17.0 65.0 5.48 MB

USING BERT FOR Attribute Extraction in KnowledgeGraph. fine-tuning and feature extraction. 使用基于bert的微调和特征提取方法来进行知识图谱百度百科人物词条属性抽取。

Python 100.00%
bert deeplearning ai knowledge-graph relation-extraction nlp feature-extraction fine-tuning attribute-extraction

bert-attributeextraction's Introduction

BERT-Attribute-Extraction

基于bert的知识图谱属性抽取

USING BERT FOR Attribute Extraction in KnowledgeGraph with two method,fine-tuning and feature extraction.

知识图谱百度百科人物词条属性抽取,使用基于bert的微调fine-tuning和特征提取feature-extraction方法进行实验。

Prerequisites

Tensorflow >=1.10
scikit-learn

Pre-trained models

BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

Installing

None

Dataset

The dataset is constructed according to Baidu Encyclopedia character entries. Filter out corpus that does not contain entities and attributes.

Entities and attributes are obtained from name entity recognition.

Labels are obtained from the Baidu Encyclopedia infobox, and most of them are labeled manually,so some are not very good.
For example:

黄维#1904年#1#黄维(1904年-1989年),字悟我,出生于江西贵溪一农户家庭。        
陈昂#山东省滕州市#1#邀请担任诗词嘉宾。1992年1月26日,陈昂出生于山东省滕州市一个普通的知识分子家庭,其祖父、父亲都
陈伟庆#肇庆市鼎湖区#0#长。任免信息2016年10月21日下午,肇庆市鼎湖区八届人大一次会议胜利闭幕。陈伟庆当选区人民政府副区长。

Getting Started

  • run strip.py can get striped data
  • run data_process.py can process data to get numpy file input
  • parameters file is the parameters that run model need

Running the tests

For example with birthplace dataset:

  • fine-tuning

    • run run_classifier.py to get predicted probability outputs
    python run_classifier.py \
            --task_name=my \
            --do_train=true \
            --do_predict=true \
            --data_dir=a \
            --vocab_file=/home/tiny/zhaomeng/bertmodel/vocab.txt \
            --bert_config_file=/home/tiny/zhaomeng/bertmodel/bert_config.json \
            --init_checkpoint=/home/tiny/zhaomeng/bertmodel/bert_model.ckpt \
            --max_seq_length=80 \
            --train_batch_size=32 \
            --learning_rate=2e-5 \
            --num_train_epochs=1.0 \
            --output_dir=./output
    • then run proba2metrics.py to get final result with wrong classification
  • feature-extraction

    • run extract_features.py to get the vector representation of train and test data in json file format
    python extract_features.py \
            --input_file=../data/birth_place_train.txt \
            --output_file=../data/birth_place_train.jsonl \
            --vocab_file=/home/tiny/zhaomeng/bertmodel/vocab.txt \
            --bert_config_file=/home/tiny/zhaomeng/bertmodel/bert_config.json \
            --init_checkpoint=/home/tiny/zhaomeng/bertmodel/bert_model.ckpt \
            --layers=-1 \
            --max_seq_length=80 \
            --batch_size=16
    • then run json2vector.py to transfer json file to vector representation
    • finally run run_classifier.py to use machine learning methods to do classification,MLP usually peforms best

Result

The predicted results and misclassified corpus are saved in result dir.

  • For example with birthplace dataset using fine-tuning method,the result is:

                precision    recall  f1-score   support
    
         0      0.963     0.967     0.965       573
         1      0.951     0.946     0.948       389
    

Authors

  • zhao meng

License

This project is licensed under the MIT License

Acknowledgments

  • etc

bert-attributeextraction's People

Contributors

sakuranew avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bert-attributeextraction's Issues

多分类 vs 单分类

对于比较复杂的关系,比如 人和人之间 父母,师生,夫妻,朋友,同事 ,这些不同的关系,是构建一个多分类器 更好, 还是 构建 多个 0-1 分类器更好呢 。。。。。

你好,请问这个项目跟知识图谱有什么关系呢

项目名字是“使用基于bert的微调和特征提取方法来进行知识图谱百度百科人物词条属性抽取。”,但我看到里面都是利用bert来进行训练评估和预测,得到准确率召回率这样的指标,好像跟知识图谱没什么关系..

请问怎么安装raw_bert呢

我直接用pip install raw_bert发现没有这个库,那么
from raw_bert import modeling
from raw_bert import optimization
from raw_bert import tokenization
这三行里的raw_bert是哪个raw_bert呢

TypeError: 'NoneType' object is not iterable on birthplace dataset, raised on the 2 round

ub16c9@ub16c9-gpu:~/ub16_prj/BERT-AttributeExtraction/birthplace$ PYTHONPATH=../ python3 run_classifier.py --task_name=my --do_train=true --do_predict=true --data_dir=date/ --vocab_file=../../../data/bert/chinese_L-12_H-768_A-12/vocab.txt --bert_config_file=../../../data/bert/chinese_L-12_H-768_A-12/bert_config.json --init_checkpoint=../../../data/bert/chinese_L-12_H-768_A-12/bert_model.ckpt --max_seq_length=80 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=1.0 --output_dir=./output
reading test data ...
8727
962
WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder..model_fn at 0x7f59fc83bc80>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_save_summary_steps': 100, '_eval_distribute': None, '_evaluation_master': '', '_cluster': None, '_experimental_distribute': None, '_protocol': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f59fc4dab70>, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_task_id': 0, '_save_checkpoints_steps': 1000, '_global_id_in_cluster': 0, '_tf_random_seed': None, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_service': None, '_log_step_count_steps': None, '_num_worker_replicas': 1, '_train_distribute': None, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None), '_device_fn': None, '_save_checkpoints_secs': None, '_keep_checkpoint_max': 5, '_model_dir': './output', '_task_type': 'worker', '_keep_checkpoint_every_n_hours': 10000}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
INFO:tensorflow:Writing example 0 of 8727
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: 0-0
INFO:tensorflow:tokens: [CLS] 表 大 会 常 务 委 员 会 的 决 定 任 免 驻 外 大 使 : 任 命 王 雪 峰 为 中 华 人 民 共 和 国 驻 萨 摩 亚 独 立 国 特 命 全 权 大 使 。 [SEP]
INFO:tensorflow:input_ids: 101 6134 1920 833 2382 1218 1999 1447 833 4638 1104 2137 818 1048 7728 1912 1920 886 8038 818 1462 4374 7434 2292 711 704 1290 782 3696 1066 1469 1744 7728 5855 3040 762 4324 4989 1744 4294 1462 1059 3326 1920 886 511 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)

INFO:tensorflow: name = bert/pooler/dense/kernel:0, shape = (768, 768), INIT_FROM_CKPT
INFO:tensorflow: name = bert/pooler/dense/bias:0, shape = (768,), INIT_FROM_CKPT
INFO:tensorflow: name = output_weights:0, shape = (2, 768)
INFO:tensorflow: name = output_bias:0, shape = (2,)
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2019-01-02 22:28:29.025812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-02 22:28:29.025856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-02 22:28:29.025874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-02 22:28:29.025880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-02 22:28:29.025991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10080 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from ./output/model.ckpt-272
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:prediction_loop marked as finished
INFO:tensorflow:prediction_loop marked as finished
k is 2
Traceback (most recent call last):
File "run_classifier.py", line 850, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_classifier.py", line 704, in main
processor = processorstask_name
File "run_classifier.py", line 195, in init
self.train, self.test = self._create_examples(data_process.data_gen, k)
File "run_classifier.py", line 218, in _create_examples
(train_x, test_x), (train_y, test_y) = f(k)
TypeError: 'NoneType' object is not iterable
ub16c9@ub16c9-gpu:~/ub16_prj/BERT-AttributeExtraction/birthplace$

你好,预测时data_dir = a?

根据代码
predict_examples = processor.get_test_examples(FLAGS.data_dir)得知训练样例从data_dir里得到,然后根据运行参数--data_dir=a \发现data_dir是a.

但项目目录里没有a文件夹啊

4 methods to process raw data

File "run_classifier.py", line 668
for k in range(1, 4):
“data_process.py“ has 4 methods to process raw data,but I have commented 3 methods and forget to change the k

Originally posted by @sakuranew in #2 (comment)

Do you mind explaining the difference among the four methods to process row data? What's the purpose to have four methods? And what's the difference between the strip version and non-strip version? Thank you.

怎么构建 事件提取 的 分类器?

比如对于 “2018年12月29日,人工智能创业公司北京阿博茨科技有限公司(以下简称“阿博茨科技”)宣布完成3000万美元B轮融资,此轮融资主要投资方为Mindworks概念资本、SIG海纳亚洲、启明创投。“, 这样的一段话,如何构建分类器来提取这个融资事件?

这里面关系包括:“北京阿博茨科技有限公司-融资时间-2018年12月29日”, “北京阿博茨科技有限公司-融资金额-B轮”, “北京阿博茨科技有限公司-融资金额-3000万美元”,“北京阿博茨科技有限公司-投资方-Mindworks概念资本、SIG海纳亚洲、启明创投”。。。。

按照您的建议,也是构建多个关系分类器来一一判断吗?

raw_bert不存在的问题

参考README运行run_classifier.py的时候提示from raw_bert 找不到。
查看后发现.gitignore里面添加了raw_bert。
是否以为这个文件夹没有提交上来,或者再其他的git上呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.