xmunlp / tagger Goto Github PK

View Code? Open in Web Editor NEW

305.0 16.0 86.0 3 MB

Deep Semantic Role Labeling with Self-Attention

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

deep-learning tagging semantic-role-labeling tensorflow srl srltagger

tagger's Introduction

Tagger

This is the source code for the paper "Deep Semantic Role Labeling with Self-Attention".

Basics
- Notice
- Prerequisites
Walkthrough
- Data
- Training
- Decoding
Benchmarks
Pretrained Models
License
Citation
Contact

Basics

Notice

The original code used in the paper is implemented using TensorFlow 1.0, which is obsolete now. We have re-implemented our methods using PyTorch, which is based on THUMT. The differences are as follows:

We only implement DeepAtt-FFN model
Model ensemble are currently not available

Please check the git history to use TensorFlow implementation.

Prerequisites

Python 3
PyTorch
TensorFlow-2.0 (CPU version)
GloVe embeddings and srlconll scripts

Walkthrough

Data

Training Data

We follow the same procedures described in the deep_srl repository to convert the CoNLL datasets. The GloVe embeddings and srlconll scripts can also be found in that link.

If you followed these procedures, you can find that the processed data has the following format:

2 My cats love hats . ||| B-A0 I-A0 B-V B-A1 O

The CoNLL datasets are not publicly available. We cannot provide these datasets.

Vocabulary

You can use the build_vocab.py script to generate vocabularies. The command is described as follows:

python tagger/scripts/build_vocab.py --limit LIMIT --lower TRAIN_FILE OUTPUT_DIR

where LIMIT specifies the vocabulary size. This command will create two vocabularies named vocab.txt and label.txt in the OUTPUT_DIR.

Training

Once you finished the procedures described above, you can start the training stage.

Preparing the validation script

An external validation script is required to enable the validation functionality. Here's the validation script we used to train an FFN model on the CoNLL-2005 dataset. Please make sure that the validation script can run properly.

#!/usr/bin/env bash
SRLPATH=/PATH/TO/SRLCONLL
TAGGERPATH=/PATH/TO/TAGGER
DATAPATH=/PATH/TO/DATA
EMBPATH=/PATH/TO/GLOVE_EMBEDDING
DEVICE=0

export PYTHONPATH=$TAGGERPATH:$PYTHONPATH
export PERL5LIB="$SRLPATH/lib:$PERL5LIB"
export PATH="$SRLPATH/bin:$PATH"

python $TAGGERPATH/tagger/bin/predictor.py \
  --input $DATAPATH/conll05.devel.txt \
  --checkpoint train \
  --model deepatt \
  --vocab $DATAPATH/deep_srl/word_dict $DATAPATH/deep_srl/label_dict \
  --parameters=device=$DEVICE,embedding=$EMBPATH/glove.6B.100d.txt \
  --output tmp.txt

python $TAGGERPATH/tagger/scripts/convert_to_conll.py tmp.txt $DATAPATH/conll05.devel.props.gold.txt output
perl $SRLPATH/bin/srl-eval.pl $DATAPATH/conll05.devel.props.* output

Training command

The command below is what we used to train a model on the CoNLL-2005 dataset. The content of run.sh is described in the above section.

#!/usr/bin/env bash
SRLPATH=/PATH/TO/SRLCONLL
TAGGERPATH=/PATH/TO/TAGGER
DATAPATH=/PATH/TO/DATA
EMBPATH=/PATH/TO/GLOVE_EMBEDDING
DEVICE=[0]

export PYTHONPATH=$TAGGERPATH:$PYTHONPATH
export PERL5LIB="$SRLPATH/lib:$PERL5LIB"
export PATH="$SRLPATH/bin:$PATH"

python $TAGGERPATH/tagger/bin/trainer.py \
  --model deepatt \
  --input $DATAPATH/conll05.train.txt \
  --output train \
  --vocabulary $DATAPATH/deep_srl/word_dict $DATAPATH/deep_srl/label_dict \
  --parameters="save_summary=false,feature_size=100,hidden_size=200,filter_size=800,"`
               `"residual_dropout=0.2,num_hidden_layers=10,attention_dropout=0.1,"`
               `"relu_dropout=0.1,batch_size=4096,optimizer=adadelta,initializer=orthogonal,"`
               `"initializer_gain=1.0,train_steps=600000,"`
               `"learning_rate_schedule=piecewise_constant_decay,"`
               `"learning_rate_values=[1.0,0.5,0.25,],"`
               `"learning_rate_boundaries=[400000,50000],device_list=$DEVICE,"`
               `"clip_grad_norm=1.0,embedding=$EMBPATH/glove.6B.100d.txt,script=run.sh"

Decoding

The following is the command used to generate outputs:

#!/usr/bin/env bash
SRLPATH=/PATH/TO/SRLCONLL
TAGGERPATH=/PATH/TO/TAGGER
DATAPATH=/PATH/TO/DATA
EMBPATH=/PATH/TO/GLOVE_EMBEDDING
DEVICE=0

python $TAGGERPATH/tagger/bin/predictor.py \
  --input $DATAPATH/conll05.test.wsj.txt \
  --checkpoint train/best \
  --model deepatt \
  --vocab $DATAPATH/deep_srl/word_dict $DATAPATH/deep_srl/label_dict \
  --parameters=device=$DEVICE,embedding=$EMBPATH/glove.6B.100d.txt \
  --output tmp.txt

Benchmarks

We've performed 4 runs on CoNLL-05 datasets. The results are shown below.

Runs	Dev-P	Dev-R	Dev-F1	WSJ-P	WSJ-R	WSJ-F1	BROWN-P	BROWN-R	BROWN-F1
Paper	82.6	83.6	83.1	84.5	85.2	84.8	73.5	74.6	74.1
Run0	82.9	83.7	83.3	84.6	85.0	84.8	73.5	74.0	73.8
Run1	82.3	83.4	82.9	84.4	85.3	84.8	72.5	73.9	73.2
Run2	82.7	83.6	83.2	84.8	85.4	85.1	73.2	73.9	73.6
Run3	82.3	83.6	82.9	84.3	84.9	84.6	72.3	73.6	72.9

Pretrained Models

The pretrained models of TensorFlow implementation can be downloaded at Google Drive.

LICENSE

BSD

Citation

If you use our codes, please cite our paper:

@inproceedings{tan2018deep,
  title = {Deep Semantic Role Labeling with Self-Attention},
  author = {Tan, Zhixing and Wang, Mingxuan and Xie, Jun and Chen, Yidong and Shi, Xiaodong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year = {2018}
}

Contact

This code is written by Zhixing Tan. If you have any problems, feel free to send an email.

tagger's People

Stargazers

Watchers

Forkers

hexingwei tony-hong renhongkai gy910210 eight-corner aforever colinsongf peteroxic yisawatbek facingwaller cluluxiu levstyle hins gailysun changfengfeng tjunlp yangzhenghao wolfhu chenmoshushi weifenghu strubell yutaoxxx yongtaek jankim flyrae leesehoon sedflix kaeflint yorick76ee nick-2008 serenayj alphadl kingofoz chenwi yusifu peiyance xgk itsmengzaime charlotteliu zhiyuanding muximuxi cmc1023 acdante hailingc zhongyunuestc lcy081099 zhyuxie wibruce zhaoguangxiang dx2048 zscwind bowbowbow sanyu12 rizwan09 cofec xiongshufeng cxncu001 gokunwu lefugang mylv1222 reborn2016 vhientran jryongithub shangcambridge zp1481616577 awesome-archive yynst2 meccy sandeepsingh goingcoder hahahawu fanfanba intuitionmachine amarantolaw zhuzhibin1988 liuyibinbin sangwx ada520 lxngoddess5321 zzsfornlp qiudengqiang overoptimus moqingyan euhkim adithyanub rocke2020

tagger's Issues

Preparing the validation script

SRLPATH=/PATH/TO/SRLCONLL
TAGGERPATH=/PATH/TO/TAGGER
DATAPATH=/PATH/TO/DATA

export PERL5LIB="$SRLPATH/lib:$PERL5LIB"
export PATH="$SRLPATH/bin:$PATH"

python $TAGGERPATH/main.py predict --data_path # $DATAPATH/conll05.devel.txt
--model_dir train --model_name deepatt
--vocab_path $DATAPATH/word_dict $DATAPATH/label_dict
--device_list 0
--decoding_params="decode_batch_size=512"
--model_params="num_hidden_layers=10,feature_size=100,hidden_size=200,filter_size=800"
python $TAGGERPATH/scripts/convert_to_conll.py# conll05.devel.txt.deepatt.decodes # $DATAPATH/conll05.devel.props.gold.txt output
perl $SRLPATH/bin/srl-eval.pl # $DATAPATH/conll05.devel.props. output

请问以上加粗的文件名对应的文件分别是什么格式？好像项目中并没有给出，可否分享下，谢谢！

只做deocding的时候Input 文件的格式

我大致看了下代码，似乎每一行起始是一个数字，是什么含义？
我试了
"""
4 john wants to go
"""
报数组引用越界的错
改成
"""
1 john wants to go
"""
就好了。

麻烦详细的介绍下decoding时的数据格式

How to set train path ?

Hi, Thanks for your great work ! I am wondering how to set the TRAIN_PATH as mentioned in the training command. I tried to set it to the path that contains the processed tf.Record files but got this error:

ValueError: No data files found in ./data/processed*train* .

I have checked another closed thread, but did not figure out the correct way to set the path. What should be the intended parameter here ? And to make sure, the conll05.devel.props.gold.txt, conll05.devel.props.* mentioned in the validation script are in the BIO tagging format, right ?

一个bug问题

首先感谢作者分享的代码。

在测试时发现一个问题：
1 Wait ! ! Wait '' ! ! Cried the guard who ran from the hut to shout to other men standing about outside . ||| O B-V O O O O O O O O O O O O O O O O O O O O O O O 4 Wait ! ! Wait '' ! ! Cried the guard who ran from the hut to shout to other men standing about outside . ||| O O O O B-V O O O O O O O O O O O O O O O O O O O O

比如在对这两个case做预测时，发现遇到“!”后，预测就停止了。导致长度被截断。用perl脚本做验证时报错。把"!"换成"."就没问题了。

features里面文件是什么？

我想查看train时features里面是包含的是什么文件
试着print(features）的时候报错了
可能是我的train.txt不行
所以我试图了解feature到底包含了哪两种信息
如果能给出具体的答案的话，对我有非常大的帮助！

请问tf的具体版本是多少？

A newer version of TensorFlow 指的什么版本

So weird training result

When i trained the model with these parameters, the result is fine. Normally, F1 Score is up to 0.81 or even better.

glove.6B.100d.txt
feature_size=100
hidden_size=200
filter_size=800

But when i tried the following hyper-parameter, F1 score is always less than 0.1. Specifically, it is 0.042365085 and the model only predicted B-V label. No any other labels. It is so weird. What's the problem?

glove.6B.200d.txt
feature_size=200
hidden_size=400
filter_size=1600

请问训练脚本中的train_path是指序列化以后的文件么？

直接给定路径好像不可以。
报错信息是： No data files found in XXXX

请问这个如何解决？

How can I download srlconll scripts， it often can't connect to the website and fails

http://www.lsi.upc.edu/%7Esrlconll/soft.html
I couldn't download it, would you mind if you send the file to me?

How do you get the data with the propId?

If you run the pretrained model with this data file: https://github.com/luheng/deep_srl/blob/master/sample_data/sentences_with_predicates.txt

the model works. I'm wondering how I can use your repo to generate the propid before doing SRL. Do you provide this functionality?

Thank you!

Decoding seems to be reloading the model...?

I'm using the provided command to generate outputs, the one in the section under "decoding." I have a few questions:

It continuously prints things like:

2018-05-14 16:51:35.363220: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] Adding visible gpu devices: 0
INFO:tensorflow:Restoring parameters from models/conll05/single/checkpoint/model.ckpt-540432
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Graph was finalized.

why doesn't it restore parameters and figure out which GPUs are visible just once? Is the model being loaded every single iteration?

Is there any way I can see how much of the input file it has now decoded? It doesn't seem to write the output until the very end.

Thanks!

What is the version of Tensorflow?

Excuse me, can you tell me the version of Tensorflow?
Thank you very much!

训练脚本疑问

python tagger/main.py train
--data_path TRAIN_PATH --model_dir train --model_name deepatt
--vocab_path word_dict label_dict --emb_path glove.6B.100d.txt
--model_params=feature_size=100,hidden_size=200,filter_size=800,residual_dropout=0.2,
num_hidden_layers=10,attention_dropout=0.1,relu_dropout=0.1
-training_params=batch_size=4096,eval_batch_size=1024,optimizer=Adadelta,initializer=orthogonal,
use_global_initializer=false,initializer_gain=1.0,train_steps=600000,
learning_rate_decay=piecewise_constant,learning_rate_values=[1.0,0.5,0.25],
learning_rate_boundaries=[400000,500000],device_list=[0],clip_grad_norm=1.0 \
--validation_params=script=run.sh

问下最后一个参数run.sh这个脚本是哪个？仓库貌似没有。先谢过了。

关于conll2005

从http://www.cs.upc.edu/~srlconll/soft.html 下载了相关文件，但其中好像只包含了标签，没有相应的文本。由于在这方面比较生疏，请问如何获取conll2005完整的语料集？非常感谢。

在FFN中的问题

论文中我看到的FFN是FFN（X） = ReLU(XW1)W2,为什么在代码中的_ffn_layer 中的linear 函数里会有tf.nn.convolution？只有_linear_2d 没有卷积函数其他的都有

运行

请问更改之后怎么运行呢

losses_avg

请问您在loss中加入
with tf.variable_scope("losses_avg"):
loss_moving_avg = tf.get_variable("training_loss",
initializer=100.0,
trainable=False)
lm = loss_moving_avg.assign(loss_moving_avg * 0.9 + loss * 0.1)
tf.summary.scalar("loss_avg/total_loss", lm)

            with tf.control_dependencies([lm]):
                loss = tf.identity(loss)

为什么要加入一个losses_avg

维度不匹配问题

初始化时，运行报错：
InvalidArgumentError (see above for traceback): logits and labels must be same size: logits_size=[4080,105] labels_size=[3978,105]
[[Node: tagger/softmax_cross_entropy_with_logits_sg = SoftmaxCrossEntropyWithLogits[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tagger/softmax_cross_entropy_with_logits_sg/Reshape, tagger/softmax_cross_entropy_with_logits_sg/Reshape_1)]]
[[Node: training/global_norm_2/global_norm/_2135 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_9328_training/global_norm_2/global_norm", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

数据集：conll2005
embeddings: glove 200d

关于数据

不知道能否分享一份论文中的CoNLL-2005和CoNLL-2012的数据呢？
先谢谢了

请问tf的具体版本是多少？

A newer version of TensorFlow 指的什么版本

缺少embed.npy文件

跑这个项目的时候少一个预训练的词向量embed.npy文件，能否提供下，或者能够告知怎么训练的，保存什么格式，谢谢

Error during running the pretrained model

Hello, I encountered the following error when trying to use the provided command for prediction with pre-trained models:

I used the following command:

$ python main.py predict --data_path ./data/preprocessed.txt --model_dir pre-trained/conll05/ensemble/run0 --model_name deepatt --vocab_path ./pre-trained/conll05/ensemble/dict/word_dict0 ./pre-trained/conll05/ensemble/dict/label_dict --device_list 0 --emb_path ./data/glove/glove.6B.100d.txt

and got this error information:

Traceback (most recent call last):
File "main.py", line 895, in
predict(parsed_args)
File "main.py", line 616, in predict
as_iterable=True)
File "/Users/xiaotang/Documents/SRL/Tagger/venv/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/Users/xiaotang/Documents/SRL/Tagger/venv/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 670, in predict
iterate_batches=iterate_batches)
File "/Users/xiaotang/Documents/SRL/Tagger/venv/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 967, in _infer_model
features = self._get_features_from_input_fn(input_fn)
File "/Users/xiaotang/Documents/SRL/Tagger/venv/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 947, in _get_features_from_input_fn
result = input_fn()
File "/Users/xiaotang/Documents/SRL/Tagger/data/plain_text.py", line 43, in _decode_batch_input_fn
outputs = preprocess_fn(inputs)
File "/Users/xiaotang/Documents/SRL/Tagger/data/plain_text.py", line 175, in
lambda x: convert_text(x, vocab, params))
File "/Users/xiaotang/Documents/SRL/Tagger/data/plain_text.py", line 158, in convert_text
emb[i] = params.embedding[word]
ValueError: could not broadcast input array from shape (100) into shape (128)

Is there something wrong in my command ? How could I run prediction with pretrained model ?

Checkpoint averaging

For the evaluation metrics you report in the paper, do you use any checkpoint averaging? Tagger/scripts/avg_checkpoints.py. Thanks.

How to configure early stopping?

The readme says to point to the train/best directory for decoding a trained model, but the code doesn't seem to be saving any models to that directory. It is saving models to the train directory, which I can successfully evaluate. How can I configure training to save the best model during training?

Here is the command I am running to train:

python main.py train \
    --data_path data2/ \
    --model_dir train \
    --model_name deepatt \
    --vocab_path data/word_dict data/label_dict \
    --emb_path data/glove.6B.100d.txt \
    --model_params=feature_size=100,hidden_size=200,filter_size=800,residual_dropout=0.2,num_hidden_layers=10,attention_dropout=0.1,relu_dropout=0.1 \
    --training_params=batch_size=4096,eval_batch_size=1024,optimizer=Adadelta,initializer=orthogonal,use_global_initializer=false,initializer_gain=1.0,train_steps=600000,learning_rate_decay=piecewise_constant,learning_rate_values=[1.0,0.5,0.25],learning_rate_boundaries=[400000,500000],device_list=[0],clip_grad_norm=1.0 \
    --validation_params=script=run.sh

xmunlp / tagger Goto Github PK

tagger's Introduction

Tagger

Contents

Basics

Notice

Prerequisites

Walkthrough

Data

Training Data

Vocabulary

Training

Preparing the validation script

Training command

Decoding

Benchmarks

Pretrained Models

LICENSE

Citation

Contact

tagger's People

Stargazers

Watchers

Forkers

tagger's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs