qipeng / gcn-over-pruned-trees Goto Github PK

View Code? Open in Web Editor NEW

372.0 7.0 70.0 599 KB

Graph Convolution over Pruned Dependency Trees Improves Relation Extraction (authors' PyTorch implementation)

License: Other

Python 98.98% Shell 1.02%

information-extraction relation-extraction nlp natural-language-processing dependency-parsing dependency-parse-trees

gcn-over-pruned-trees's People

Contributors

Stargazers

Watchers

gcn-over-pruned-trees's Issues

all predicted values are “no_relation”, and the Precision value is 100%, the Recall value is 0, and the value of dev_f1 is also 0.

When I am resurrecting, all predicted values are “no_relation”, and the Precision value is 100%, the Recall value is 0, and the value of dev_f1 is also 0.
Evaluating on dev set...
['no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation', 'no_relation']
Precision (micro): 100.000%
Recall (micro): 0.000%
F1 (micro): 0.000%
epoch 19: train_loss = 0.852363, dev_loss = 5.416558, dev_f1 = 0.0000
model saved to ./saved_models/00/checkpoint_epoch_19.pt

SemEval 2010 Task 8 Dataset is not used

Hi Yuhao,
Congratulations on the good work. In the original paper, you conduct experiment on SemEval 2010 Task 8 Dataset. But in this code, I don't see this experiment.Can you share your experiment code on this dataset.Thank you!

The number of GCN layers is the best for experiment?

Dear @yuhaozhang ,

In your paper, in Appendix A.1, you said that you use 2 GCN layers for the best performance. But when I run your source code, I fine-tune the number of GCN layer parameter, and I see that the number is 1 for the best, with the same your performance reported (P=0.695 R=0.636 F=0.664). For the number of GCN layers is 2, I got (P=0.68 R=0.62 F=0.644). Any mistypes in your Appendix? or what is the number of GCN layers for the best performance? Thanks!

Do i need both cpu and Gpu for running this implementation?

I am geeting strange results like precision=100%, R=0, F=0
You can see
Per-relation statistics:
org:top_members/employees P: 100.00% R: 0.00% F1: 0.00% #: 1
per:age P: 100.00% R: 0.00% F1: 0.00% #: 1
per:origin P: 100.00% R: 0.00% F1: 0.00% #: 1
per:title P: 100.00% R: 0.00% F1: 0.00% #: 1

Final Score:
Precision (micro): 100.000%
Recall (micro): 0.000%
F1 (micro): 0.000%
test set evaluate result: 1.00 0.00 0.00
Evaluation ended.

A bug: subj_mask and obj_mask don't mask the padding tokens

Hi, thanks for sharing your code. I noticed a bug that would affect the experimental results.

This line of code below constructs subj_mask and obj_mask according to whether subj_pos or obj_pos is 0. But in DataLoader, shorter sequences are also padded with 0 for their subj_poss and obj_poss. So subj_mask and obj_mask don't mask the padding tokens.

gcn-over-pruned-trees/model/gcn.py

Line 88 in db7c128

 subj_mask, obj_mask = subj_pos.eq(0).eq(0).unsqueeze(2), obj_pos.eq(0).eq(0).unsqueeze(2) # invert mask 

This will affect the following subject and object pooling operations cause the representation vectors for padding tokens are not 0 (for example, a linear transformation would add bias term to these vectors).

Changing it to the following would fix the problem

subj_mask, obj_mask = subj_pos.eq(0).eq(0), obj_pos.eq(0).eq(0) # invert mask
subj_mask = (subj_mask | masks).unsqueeze(2)  # logical or with word masks
obj_mask = (obj_mask | masks).unsqueeze(2)

Meaning of "stanford_head" in given dataset

I find the "head" in tree.py is used to construct dependency parsing tree. It corresponds to "stanford_head" in the given dataset. "stanford_deprel " is also in the dataset but has not been used. I don't know what does "stanford_head" mean in representing the dependency parsing label. Could you please explain for it, or is there any information about it? Thank you!

A little confuse on the code in func. head_to_tree in tree.py

In the code:

gcn-over-pruned-trees/model/tree.py

Line 80 in db7c128

while h > 0:

I wonder whether this loop is a dead loop?
For example: if h = 3, head[2] = 3, then it gonna loop with adding '2' into tmp and never stop.
Cause that's the problem on my dataset, so I wonder if this method has a problem?
Thank you

Why is your model stable every time it runs, while mine is unstable? I have set the seed.

Higher Loss Value

I am training the model with prune_k=-1, but i am getting very high loss values. What can be the problem indeed?
Kindly give insight into it.

Using fine-tune word embedding in the evaluation phase

I am sorry for disturbing you. I assume that your model uses fine-tune all embeddings in the training phase before. However, I wonder about using the fine-tune word embedding layer in the test phase.

In the line 36: trainer = GCNTrainer(opt), the file eval.py , you init GCNTrainer without providing the argument emb_matrix, or emb_matrix=None. Therefore, your model will init an uniform gcn_model.emb.weight.data in the test phase. What about using the learnable gcn_model.emb.weight.data in the trainning phase before?

Thank you very much for your time!

Influence on finetuning word embeddings

Dear authors,

Did you try _not_ finetuning word embeddings and what's the influence? Thanks.

eval.py error

When I run eval.py, an error raised.

Loading model from saved_models/00/best_model.pt
[ Fail: model loading failed. ]
Traceback (most recent call last):
  File "eval.py", line 35, in <module>
    opt = torch_utils.load_config(model_file)
  File "/home/penzm/gcn-over-pruned-trees/utils/torch_utils.py", line 161, in load_config
    return dump['config']
UnboundLocalError: local variable 'dump' referenced before assignment

Missing a line with the no-attn in train.py

There is a line missing here that should have added the --no-attn key. I copied it from here for myself.

Inconsistency between the dataset and the loading method

In two places accessing tokens from the tacred dataset (here and here the wrong key has been used. The data in this repository uses tokens instead of token as the key for the list of tokens. This looks like a typo that has been corrected as a previous project had what would have worked with this code, namely token, as the key.

I've fixed the code for myself but you may decide whether to fix the code or the data.

P.S. thanks for your code, which is otherwise quite nice and tidy! ☺️

Memory issue

I am using jupyter notebook on server.
I have preprocess a new dataset in the required format( i.e. tacred dataset) but i am getting MemoryError while training;
bash train_gcn.sh 0
I show you some my dataset examples;

Here is the error i got;
Finetune all embeddings.
Traceback (most recent call last):
File "train.py", line 142, in
loss = trainer.update(batch)
File "/home/hz071/gcn-over-pruned-trees/model/trainer.py", line 80, in update
logits, pooling_output = self.model(inputs)
File "/home/hz071/.conda/envs/AG-GCNN/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hz071/gcn-over-pruned-trees/model/gcn.py", line 27, in forward
outputs, pooling_output = self.gcn_model(inputs)
File "/home/hz071/.conda/envs/AG-GCNN/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hz071/gcn-over-pruned-trees/model/gcn.py", line 84, in forward
adj = inputs_to_tree_reps(head.data, words.data, l, self.opt['prune_k'], subj_pos.data, obj_pos.data)
File "/home/hz071/gcn-over-pruned-trees/model/gcn.py", line 78, in inputs_to_tree_reps
trees = [head_to_tree(head[i], words[i], l[i], prune, subj_pos[i], obj_pos[i]) for i in range(len(l))]
File "/home/hz071/gcn-over-pruned-trees/model/gcn.py", line 78, in
trees = [head_to_tree(head[i], words[i], l[i], prune, subj_pos[i], obj_pos[i]) for i in range(len(l))]
File "/home/hz071/gcn-over-pruned-trees/model/tree.py", line 81, in head_to_tree
tmp += [h-1]
MemoryError

After that the implementation stopped.
Can you give me any idea what went wrong? I have used stanza pipeline(spacy tokenizer) for pre-processing.

About dataset of SemEval 2010 task8

Hello, you mentioned that the effect of the semeval dataset in the paper is also very good, could you please share your code that tests this network in SemEval 2010 task8?

你好，我想问一下数据集中的head是什么意思呢？

Question on SemEval2010_task8 dataset

May I ask your Hyperparameter setting on SemEval2010_task8 dataset?
I can't get the best performance on SemEval2010_task8 in the paper.
Thank you so much!

Hi, I just want to ask, what does head mean in the dataset？

Speedup training 10 times by one line of code

Hi Yuhao,
The code looks very well written, but it's kind of slow. I found that you are doing all tree manipulations on GPU. Just add

head, words, subj_pos, obj_pos = head.cpu().numpy(), words.cpu().numpy(), subj_pos.cpu().numpy(), obj_pos.cpu().numpy()

at line 77 in gcn.py, then the speed increases from 0.12 sec/batch to 0.016 sec/batch on my machine.

Can the model be adjusted for sentence classification instead of sequence labeling with a [CLS] token?

The model is designed for Named Entity Recognition with Sequence Labeling approach.
The Named Entities correspond to relations.

I would like to use this model for sentence classification and a sentence representation token [CLS] is needed at some place.

How can I adjust the model?

Implement of self loop

Dear authors:
I have a question about the implement of self loop.
In "tree.py: line 173-175",we have set adj[i,i]=1,

if self_loop: for i in idx: ret[i, i] = 1

And I think the information of the node itself have been contained by the bmm operation for adj and gcn_input,just like "gcn.py : line 166":

Ax = adj.bmm(gcn_inputs)

But why we still use "gcn.py : line 168" to do self loop ?

AxW = AxW + self.W[l](gcn_inputs)

qipeng / gcn-over-pruned-trees Goto Github PK

gcn-over-pruned-trees's People

Contributors

Stargazers

Watchers

Forkers

gcn-over-pruned-trees's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs