GithubHelp home page GithubHelp logo

lexiconaugmentedner's Introduction

LexiconAugmentedNER

This is the implementation of our arxiv paper "Simplify the Usage of Lexicon in Chinese NER", which rejects complicated operations for incorporating word lexicon in Chinese NER. We show that incorporating lexicon in Chinese NER can be quite simple and, at the same time, effective.

Source code description

Requirement:

Python 3.6 Pytorch 0.4.1

Input format:

CoNLL format, with each character and its label split by a whitespace in a line. The "BMES" tag scheme is prefered.

别 O

错 O

过 O

邻 O

近 O

大 B-LOC

鹏 M-LOC

湾 E-LOC

的 O

湿 O

地 O

Pretrain embedding:

The pretrained embeddings(word embedding, char embedding and bichar embedding) are the same with Lattice LSTM

Run the code:

  1. Download the character embeddings and word embeddings from Lattice LSTM and put them in the data folder.
  2. Download the four datasets in data/MSRANER, data/OntoNotesNER, data/ResumeNER and data/WeiboNER, respectively.
  3. To train on the four datasets:
  • To train on OntoNotes:

python main.py --train data/OntoNotesNER/train.char.bmes --dev data/OntoNotesNER/dev.char.bmes --test data/OntoNotesNER/test.char.bmes --modelname OntoNotes --savedset data/OntoNotes.dset

  • To train on Resume:

python main.py --train data/ResumeNER/train.char.bmes --dev data/ResumeNER/dev.char.bmes --test data/ResumeNER/test.char.bmes --modelname Resume --savedset data/Resume.dset --hidden_dim 200

  • To train on Weibo:

python main.py --train data/WeiboNER/train.all.bmes --dev data/WeiboNER/dev.all.bmes --test data/WeiboNER/test.all.bmes --modelname Weibo --savedset data/Weibo.dset --lr=0.005 --hidden_dim 200

  • To train on MSRA:

python main.py --train data/MSRANER/train.char.bmes --dev data/MSRANER/dev.char.bmes --test data/MSRANER/test.char.bmes --modelname MSRA --savedset data/MSRA.dset

  1. To train/test your own data: modify the command with your file path and run.

lexiconaugmentedner's People

Contributors

rtmaww avatar v-mipeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

lexiconaugmentedner's Issues

About mask Settings

for idx in range(w_length):
                    gazmask = []
                    gazcharmask = []

                    for label in range(4):
                        label_len = len(gazs[idx][label])
                        count_set = set(gazs_count[idx][label])
                        if len(count_set) == 1 and 0 in count_set:
                            gazs_count[idx][label] = [1]*label_len

                        mask = label_len*[0]
                        mask += (max_gazlist-label_len)*[1]

                        gazs[idx][label] += (max_gazlist-label_len)*[0]  ## padding
                        gazs_count[idx][label] += (max_gazlist-label_len)*[0]  ## padding

                        char_mask = []
                        for g in range(len(gaz_char_Id[idx][label])):
                            glen = len(gaz_char_Id[idx][label][g])
                            charmask = glen*[0]
                            charmask += (max_gazcharlen-glen) * [1]
                            char_mask.append(charmask)
                            gaz_char_Id[idx][label][g] += (max_gazcharlen-glen) * [0]
                        gaz_char_Id[idx][label] += (max_gazlist-label_len)*[[0 for i in range(max_gazcharlen)]]
                        char_mask += (max_gazlist-label_len)*[[1 for i in range(max_gazcharlen)]]

                        gazmask.append(mask)
                        gazcharmask.append(char_mask)
                    layergazmasks.append(gazmask)
                    gazchar_masks.append(gazcharmask)

这段代码是构建掩码的部分,我不太明白你掩码构建的逻辑,并且其构建的掩码感觉很奇怪,比如:
gazs:[[2, 0], [0, 0], [0, 0], [0, 0]]
gazs_count:[[3], [1], [1], [1]]
mask:[[0, 1], [0, 1], [0, 1], [0, 1]]
掩码不应该是:[[1, 0],[0,0],[0,0],[0,0]]吗

关于测试集的实体信息加入gaz的问题

您好,我在阅读代码时发现main代码中的data_initialization函数有data.build_gaz_alphabet(test_file,count=True)这段,这里面会把test_file中的实体信息也加入到gaz中。但是实际情况应该是我们是不知道测试集中的实体信息的,这样是不是存在信息泄露的问题?

不知道我理解的对不对,请指正,谢谢。

词典构建

请问一下词典的构建方式是怎样的呢

Ontonotes数据集的处理

作者大大,我现在已经获取到ontonotes4.0的许可了,但是处理这个数据集却很棘手,不知您能否分享一份处理好的数据给我呢?我的邮箱是 [email protected]

这是我的许可截图。
O%)0N}9_E7E)ZM7IVZ60(F0
非常感谢。

soft lexicon 复杂度

你好还有一个问题想请教,想用词增强主要是想逼近bert的性能又比bert要快。不知道是不是我没完全理解soft lexicon提取每个char 对应B/M/E/S word list的方式,不过一眼看去提取soft lexicon这一步是接近O(N^2)(N是sentence length)的复杂度,因为要遍历句子中任意长度的sentence[i:j]看是否出现在词典中,这样在线上infer的时候虽然模型更轻量,但是是不是会在生成模型输入这一步更加耗时呢?

请问数据集中的train.xxx.bmes和test.xxx.bmes是如何得到的?

你好,我在Ontonotes官网下载了数据集,但不不知道如何得到train.xxx.bmes和test.xxx.bmes这种格式的数据;我还下载了MSRANER以及WeiboNER,但都不是train.xxx.bmes和test.xxx.bmes这种格式的。能够请教一下,这种格式的数据集如何得到?
蟹蟹🥰

about onenotes dataset

你好,我是一个个人研究者,无法获得onenotes 数据,可以将您使用的数据发我一份吗。仅限复现论文使用。我的邮箱:[email protected]。谢谢!

如何添加自己的词向量?

我用word2vec训练了50维的词向量,更改路径运行后发生了这个报错:
Traceback (most recent call last):
File "main.py", line 526, in
data.build_word_pretrain_emb(char_emb)
File "/home/admin/lc/各种模型/融合词典的中文分词/LexiconAugmentedNER-master/utils/data.py", line 262, in build_word_pretrain_emb
self.pretrain_word_embedding, self.word_emb_dim = build_pretrain_embedding(emb_path, self.word_alphabet, self.word_emb_dim, self.norm_word_emb)
File "/home/admin/lc/各种模型/融合词典的中文分词/LexiconAugmentedNER-master/utils/functions.py", line 188, in build_pretrain_embedding
embedd_dict, embedd_dim = load_pretrain_emb(embedding_path)
File "/home/admin/lc/各种模型/融合词典的中文分词/LexiconAugmentedNER-master/utils/functions.py", line 233, in load_pretrain_emb
assert (embedd_dim + 1 == len(tokens))
AssertionError

请问这是什么原因?

model_type = 'transformer'遇到的问题

请问选择使用transformer当作encoder的时候,在输入的时候为什么没有做Mask操作呢?另外想问下做mask和不做mask对ner的影响是多大呢?

关于复现问题

您好,论文中带Bert的实验结果是怎么设置的呢?以note4数据集为例,我使用你的代码仅仅能够达到80.2的效果,差最好的82.8很多。这个实验中,你用到了biword的embedding么

No module named 'transformers'

File "/Users/ling/Program/LexiconAugmentedNER/model/gazlstm.py", line 8, in
from transformers.modeling_bert import BertModel
ModuleNotFoundError: No module named 'transformers'
需要安装pip包吗?

test bert

Hello:
Are you have test in BERT?

欠拟合的的改进措施

您好,我在自己的语料上训练效果并不好,主要问题在于训练集在10-20个epoch就达到了1的acc,但 best dev的F1只停留在0.7(不希望使用bert以及transformer)。预料中句子长度20以内占较大比例,长度在20-60的也有不可忽略的句子个数。请问有什么改进建议吗,谢谢!

关于如何与预训练模型BERT结合

你好👋打扰一下!请问直接将use_bert设置为True就是结合bert了嘛?(之前跑过Bert相关模型,都需要配置相关文件,所以很疑惑将其改成True就是结合了嘛?)
我将其改为True之后,会碰见报错:Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) 请问您碰到过吗

代码无法达到论文的水平

你好,我用了你的代码在weibo和Resume上进行了测试,结果并没有论文中的那么高,请问是什么原因呢。
结果如下:save_model/Weibo
Best dev score: p:0.7023121387283237, r:0.6246786632390745, f:0.6612244897959183

Test score: p:0.6656534954407295, r:0.5289855072463768, f:0.5895020188425303

save_model/Resume
Best dev score: p:0.9511705685618729, r:0.9498997995991983, f:0.9505347593582887

Test score: p:0.9524984577421345, r:0.947239263803681, f:0.9498615810519839

在微博上f是58.95,resume则为94.99,论文中分别为61.42和95.53.

我对代码修改的地方又两点,第一是将mask的.byte()的tensor全部改为了.bool()
第二是代码中的原始数据集指示路径为.CNNER/data/……,但是实际上下载的代码是没有.CNNER目录的,我修改为了./data/……

如果能抽空指教,将不胜感激,谢谢!

全连接层输出维度

您好,全连接层输出维度加2是什么含义,为什么不直接使用标签数作维度?

KeyError: 1

自己构建了一个简单的医疗领域的数据集,想要用已训练好的模型直接进行测试,但是在这份数据上没法正常标注“BME“,总是遇到含有多个matched list中多个值(例如matched_Id [1, 2723] matched_list ['指数', '指'])的情况下出错。
原来给定的数据测试正常。

gazs_count[idx][0].append(gaz_count[matched_Id[w]])
KeyError: 1

gaz tokenize问题

你好我看bert tokenizer只对text进行了tokenize,如果碰到tokenizer把例如1994分成了19和##94, 但是gaz是针对每个character 1/9/9/4识别的BMES word,不会发生输入mismatch的问题么?

关于使用BERT的一些疑问

如果输入使用BERT,是不是相当于获得BERT字向量与giga字向量的拼接使用?使用BERT,训练epoch与学习率等参数需要调整吗?

OntoNotes 数据集获取

我已经获得了OntoNote数据集许可了,取问您可以发一份处理过的数据和处理代码给我吗,以及其他三个数据集的数据,谢谢,邮箱:
微信图片编辑_20201226112927
[email protected]

Related questions about batch size setting

During the training process, because batch_size is 1, the training speed is too slow, and then I increase the batch_size,When batcg_size is set to 8 and 16, there is not much change, but when batch_size exceeds 32, it is a bit strange, the results of the first few epochs at the beginning of the training are particularly strange, p-value is -1, r value is zero, f1 value is -1, and the final result is worse. What is the reason, and does the size of batch_size have a big impact on the NER task? Finally, I would like to ask you how to debug the parameters to obtain the accuracy that is not different from your paper if you want to use a large batch_szie

使用新数据集训练出现错误。

您好:
如果使用新数据集(标签与训练不同),会出现出现错误,需要更改源码的哪些地方吗?
报错内容为:
调用下列函数:
gaz_list, batch_word, batch_biword, batch_wordlen, batch_label, layer_gaz, gaz_count, gaz_chars, gaz_mask, gazchar_mask, mask, batch_bert, bert_mask = batchify_with_label(instance, data.HP_gpu,data.HP_num_layer)
会出现 bert_seq_tensor[b, :seqlen+2] = torch.LongTensor(bert_id)
TypeError:an intege is required (got type NoneType)

About ontotes data processing

I would like to ask how to deal with ontontes, so that it meets the format you need for training data. The results of my search are all in English, not in Chinese.

关于bert

我看论文中怎么没有列举使用bert的实验结果

gaz_file哪里来?

char_emb = "../CNNNERmodel/data/gigaword_chn.all.a2b.uni.ite50.vec"
bichar_emb = "../CNNNERmodel/data/gigaword_chn.all.a2b.bi.ite50.vec"
gaz_file = "../CNNNERmodel/data/ctb.50d.vec"

main.py里面这几个文件从哪里下载

标记方式

你好,我想问你实验用的数据集是BIOES标注还是BMES标注呢?还有可否公开一份OntoNotes的数据集以供使用呢,谢谢

关于GPU使用率问题,以及LSTM比Transformer速度”更快“的问题

你好👋打扰了!我发现了代码在GPU服务器上跑的时候利用率只有百分之三十,不知道是否是这个原因代码整体的运行速度很慢。
另外还有一个疑问就是序列编码层用LSTM比用Transformer的时候还要快(我是通过您代码输出的Speed和Time发现LSTM会比Transformer要快,具体运行的结果如下面所示)

LSTM:
image

Transformer:
image

Transformer的时候GPU使用信息:
image

最后还有一个问题,就是您论文里的Computational Efficiency Study中的 Inference speed应该不是这个输出的speed对吧?

(大佬们的论文真的太棒了!希望能有机会和您们交流下这几个问题🙏🏻,感激不尽!谢谢!)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.