GithubHelp home page GithubHelp logo

luopeixiang / named_entity_recognition Goto Github PK

View Code? Open in Web Editor NEW
2.1K 22.0 536.0 28.87 MB

中文命名实体识别(包括多种模型:HMM,CRF,BiLSTM,BiLSTM+CRF的具体实现)

Python 100.00%
named-entity-recognition nlp sequence-labeling pytorch-nlp hmm crf bi-lstm-crf ner bi-lstm pytorch-ner

named_entity_recognition's Issues

关于list和lists的问题

您好,请问为什么您在把tag或word读取出来存成List之后,还要在换行的地方将list存进lists。
请问为什么不直接将包括换行符在内的所有tag或word存成list,不用Lists?

打开train、dev、text时报错

运行时出现打开文件问题
UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 2: illegal multibyte sequence
但因为对build_corpus函数不熟悉,不知道参数是什么,无法改成utf-8格式。
想咨询博主应该怎么办,谢谢回答!
如果方便的话,想请博主给我一个联系方式,或者通过邮箱联系。我的个人邮箱是[email protected]
再次感谢!

ResumeNER数据好像有点小问题哦

传上来repo的ResumeNER数据有点问题,运行的时候data.py 第16行会报错,大概是有空行的原因,我用了原作者的就不会出现这个问题。(环境啥的都没问题,应该就是上传的那份数据存在一个空行)
再次感谢大佬开源鸭。。。

BiLstm-CRF

请问在两三个epoch后为什么loss会变成复数呢,不应该是无限接近于0吗

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

正在训练评估双向LSTM模型...
Traceback (most recent call last):
File "main.py", line 73, in
main()
File "main.py", line 43, in main
crf=False
File "/home/l1/NER/named_entity_recognition/evaluate.py", line 64, in bilstm_train_and_eval
bilstm_model = BILSTM_Model(vocab_size, out_size, crf=crf)
File "/home/l1/NER/named_entity_recognition/models/bilstm_crf.py", line 31, in init
self.hidden_size, out_size).to(self.device)
File "/home/l1/anaconda3/envs/NER/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
return self._apply(convert)
File "/home/l1/anaconda3/envs/NER/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/home/l1/anaconda3/envs/NER/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 117, in _apply
self.flatten_parameters()
File "/home/l1/anaconda3/envs/NER/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 113, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

我按照requirements.txt安装了环境,使用python3.7 cuda10.0 cudnn v7.6.5 版本 一直抱上述错误,有遇到同样错误的人吗?

crf log exp sum

is the summation dimension problematic in models/utili.py LINE 146?
should the dimension set to 2? (as we want to sum over the previous step t-1's tag space)

环境配置不成功

发现错误
AttributeError: 'LSTM' object has no attribute '_flat_weights'
搜索后发现是torch版本问题,我用的是最新版本1.11.0
想要更换版本发现1.0.1.post2不存在
ERROR: Could not find a version that satisfies the requirement torch==1.0.1.post2 (from versions: 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0)
ERROR: No matching distribution found for torch==1.0.1.post2
搜索后发现window不支持
随后在服务器上安装
发现依旧不支持,所以这个项目无法进行下去了

代码是否多余?

    def train_step(self, batch_sents, batch_tags, word2id, tag2id):
        self.model.train()
        self.step += 1
        # 准备数据
        tensorized_sents, lengths = tensorized(batch_sents, word2id)
        tensorized_sents = tensorized_sents.to(self.device)
        targets, lengths = tensorized(batch_tags, tag2id)
        targets = targets.to(self.device)

        # forward
        scores = self.model(tensorized_sents, lengths)

        # 计算损失 更新参数
        self.optimizer.zero_grad()
        loss = self.cal_loss_func(scores, targets, tag2id).to(self.device)
        loss.backward()
        self.optimizer.step()

        return loss.item()

以上代码来自与model/bilstm_crf.py文件
我想问下train_step函数下的self.model.train()起到一个什么作用

运行main.py时报错:not enough values to unpack

运行'python main.py'时,出现如下错误:

>python main.py
读取数据...
Traceback (most recent call last):
  File "main.py", line 73, in <module>
    main()
  File "main.py", line 14, in main
    build_corpus("train")
  File "data.py", line 16, in build_corpus
    word, tag = line.strip('\n').split()
ValueError: not enough values to unpack (expected 2, got 0)

数据问题

不是说用的BIOES标注的吗,ResumeNER文件夹下面怎么又是BEMS啦

potential fix in build_corpus

I changed an if-else block to try-except block and it worked. Machine: windows10, python3.7
also i need another sklearn package after i installed requirements.txt
i think this is due to a syntactical difference between bmes format and a windows file reader. idk.

def build_corpus(split, make_vocab=True, data_dir="./ResumeNER"):
    """读取数据"""
    assert split in ['train', 'dev', 'test']

    word_lists = []
    tag_lists = []
    with open(join(data_dir, split+".char.bmes"), 'r', encoding='utf-8') as f:
        word_list = []
        tag_list = []
        for line in f.readlines():
            try:
                word, tag = line.strip('\n').split()
                word_list.append(word)
                tag_list.append(tag)
            except:
                word_lists.append(word_list)
                tag_lists.append(tag_list)
                word_list = []
                tag_list = []

    # 如果make_vocab为True,还需要返回word2id和tag2id
    if make_vocab:
        word2id = build_map(word_lists)
        tag2id = build_map(tag_lists)
        return word_lists, tag_lists, word2id, tag2id
    else:
        return word_lists, tag_lists

数据问题

请问我按BEMS打标替换数据之后,在计算召回率时报错division by zero是怎么回事呀,这个数据格式有什么要求吗

预训练词向量

您好,请问词向量可以替换为预训练好的词向量么?

关于代码的几点思考

首先非常佩服大佬能写出这些代码,对我来说,光是理解就需要花费很长的时间,整个看下来也还是有很多不理解的地方,需要时间慢慢消化。不过,在读代码的过程中也有几点思考,想跟repo主交流一下。

  1. 关于HMM和BiLSTM-CRF测试函数的问题,repo主写的接口函数test需要提供word2id和tag2id的参数;个人觉得如果是用训练好的模型对未标注序列进行测试的时候,也就是说迁移到一个新的环境的时候,这两个参数是很难提供的;个人想法是将word2id和tag2id这两个参数直接在__init__中提供,这样train和test函数就都不需要再提供这个参数了,便于迁移。

  2. BiLSTM-Model中的test函数还需要提供tag_lists参数,这在作者的测试环境来说是可行的,因为测试集也是有标注的,只是为了检验得到的效果;但是在对真正无标注序列进行预测的时候是无法提供的,而该函数也没有考虑tag_lists无法提供的情况;相应的sort_by_lengths函数和preprocess_data_for_lstmcrf函数也要做一下简单的修改。

以上仅是个人的一些想法,希望能和作者交流一下,再次感谢你的代码!

模型运行速度/调参

请问为什么模型在我的数据中使用时,HMM和CRF可以很快运行出结果,但是Bilstm和Bilstm-crf模型训练非常慢,基本要两到三天,最后结果也不是很理想。之前尝试调大学习率和减小epoch,训练速度也没有明显提高。想问一下是我的数据问题吗?还是模型设置问题?

bilistm_crf模型中的为啥要sort_by_lengths(word_lists, tag_lists),作用是啥。

bilistm_crf模型中的为啥要sort_by_lengths(word_lists, tag_lists),作用是啥。在训练中有啥好处。 如果不排序是否也没事呢。


def sort_by_lengths(word_lists, tag_lists):
    pairs = list(zip(word_lists, tag_lists))
    indices = sorted(range(len(pairs)),
                     key=lambda k: len(pairs[k][0]),
                     reverse=True)
    pairs = [pairs[i] for i in indices]
    # pairs.sort(key=lambda pair: len(pair[0]), reverse=True)

    word_lists, tag_lists = list(zip(*pairs))

    return word_lists, tag_lists, indices

对数据集的疑问

您好,我对数据集内容还有部分疑问,烦请解答
1、项目中dev\test\train三个文件分别是做什么用的数据集?
2、这些数据集中的标注是人工标注的吗?
3、训练好的模型是如何测试准确率的?
谢谢!

请问数据集有哪些需要注意的点吗

用我自己的数据集进行训练,训练过程是没问题的,但是进行评估的时候出现这个报错:
Traceback (most recent call last):
File "C:/Users/gavin/PycharmProjects/named_entity_recognition-master/test.py", line 68, in
main()
File "C:/Users/gavin/PycharmProjects/named_entity_recognition-master/test.py", line 55, in main
crf_word2id, crf_tag2id)
File "C:\Users\gavin\PycharmProjects\named_entity_recognition-master\models\bilstm_crf.py", line 170, in test
pred_tag_lists = [pred_tag_lists[i] for i in indices]
File "C:\Users\gavin\PycharmProjects\named_entity_recognition-master\models\bilstm_crf.py", line 170, in
pred_tag_lists = [pred_tag_lists[i] for i in indices]
IndexError: list index out of range

recall

您好 您的项目非常的棒,有个问题:我使用这个项目进行中文分词,准确度达到不错的效果,但是recall召回率却非常的低,请问是怎么回事呢

关于HMM解码时初始状态计算的问题

你好,在models中的HMM解码那部分,计算序列首元素的标签概率的时候,直接将初始状态概率Pi和元素对应的标签概率bt相加,这样概率和不就不为1了吗?希望能帮忙解答一下,谢谢!

下载问题

您好,博主,为什么我下载下来的文件夹没有后面的内容?非常感谢您的回复

epoch显示貌似有问题

如下,epoch 11已经训练到位了,怎么到12epoch, 又从8.33% 开始了呢???
Epoch 12, step/total_step: 5/60 8.33% Loss:5.1469
Epoch 12, step/total_step: 10/60 16.67% Loss:3.1716

保存模型... Epoch 10, Val Loss:4.1249 Epoch 11, step/total_step: 5/60 8.33% Loss:5.9976 Epoch 11, step/total_step: 10/60 16.67% Loss:3.6216 Epoch 11, step/total_step: 15/60 25.00% Loss:2.9838 Epoch 11, step/total_step: 20/60 33.33% Loss:1.7743 Epoch 11, step/total_step: 25/60 41.67% Loss:2.1327 Epoch 11, step/total_step: 30/60 50.00% Loss:1.2837 Epoch 11, step/total_step: 35/60 58.33% Loss:1.3841 Epoch 11, step/total_step: 40/60 66.67% Loss:1.2403 Epoch 11, step/total_step: 45/60 75.00% Loss:1.0821 Epoch 11, step/total_step: 50/60 83.33% Loss:0.9138 Epoch 11, step/total_step: 55/60 91.67% Loss:0.6943 Epoch 11, step/total_step: 60/60 100.00% Loss:0.5065 保存模型... Epoch 11, Val Loss:4.0510 Epoch 12, step/total_step: 5/60 8.33% Loss:5.1469 Epoch 12, step/total_step: 10/60 16.67% Loss:3.1716 Epoch 12, step/total_step: 15/60 25.00% Loss:2.4593 Epoch 12, step/total_step: 20/60 33.33% Loss:1.4299 Epoch 12, step/total_step: 25/60 41.67% Loss:1.9214 Epoch 12, step/total_step: 30/60 50.00% Loss:1.2415 Epoch 12, step/total_step: 35/60 58.33% Loss:1.3120 Epoch 12, step/total_step: 40/60 66.67% Loss:1.2194 Epoch 12, step/total_step: 45/60 75.00% Loss:0.9205 Epoch 12, step/total_step: 50/60 83.33% Loss:0.8615 Epoch 12, step/total_step: 55/60 91.67% Loss:0.6182 Epoch 12, step/total_step: 60/60 100.00% Loss:0.4272

HMM的Learning问题

大佬你好,请问HMM的参数学习过程是如何体现极大似然的EM算法的呢,train函数中三个参数矩阵仅仅是按照标签频率归一以后初始化,并没有看到学习的过程,初入门,望指点!

NER 的评估方式应该是以entity为基本单位而不是以单个tag 为单位

感谢使用我们的论文和数据。

我发现你的评估函数中 precision/recall/F1 是所有tag 的平均值,然而实际的NER 的评估是以entity 为单位的而不是以tag 为单位。举个例子:

美 B-LOC
国 E-LOC
的 O
华 B-PER
莱 I-PER
士 E-PER

我 O
跟 O
他 O
谈 O
笑 O
风 O
生 O

这里实际上我们只关心美国(LOC) 和 华莱士(PER) 这两个entity 有没有预测对。 因此评估时需要先把识别出的entity的位置及其种类抽取出来,如果其中有任何一个不一致就得当作这个预测是错的。

可以参考https://github.com/jiesutd/NCRFpp/blob/master/utils/metric.py 对NER评估函数的实现。

想问一下bilstm+crf做推理的时候,为什么还要加入tag呢?

print("加载并评估bilstm+crf模型...")
crf_word2id, crf_tag2id = extend_maps(word2id, tag2id, for_crf=True)
bilstm_model = load_model(BiLSTMCRF_MODEL_PATH)
bilstm_model.model.bilstm.bilstm.flatten_parameters()  # remove warning
test_word_lists, test_tag_lists = prepocess_data_for_lstmcrf(
    test_word_lists, test_tag_lists, test=True
)
lstmcrf_pred, target_tag_list = bilstm_model.test(test_word_lists, test_tag_lists,
                                                  crf_word2id, crf_tag2id)

我只需要得到lstmcrf_pred就行了,然后为什么test参数一定要test_tag_lists

标签替换

您好,您的代码我觉得非常好,想学习一下,换成自己的数据集,但是找不到在哪里替换自己的标签,请问可以帮我答疑下吗?万分感谢!

打开ckpts中训练好的模型出错

你好,我用pickle.load()打开训练好的模型,报了No module named 'models.hmm'这样的错误,所有的模型都会出错,请问一下这是为什么呢?

训练bilstm_crf,不需要在标注后加<end>

感谢作者的分享!
在prepocess_data_for_lstmcrf中,发觉作者对每句句子和tag之后都加入了end的标志。
在我自己的数据集上跑代码下来,val_loss是不会变负的,不work。
我的理解是,这样做相当于有了两个end。这样训练crf这个转移矩阵的时候,相当于end->end在最后一步要有最大值,感觉是不对的。个人觉得并不需要给word和tag在数据标注上增加这个end尾巴。start和end的tag添加是给crf的矩阵使用的。

单词转换成向量用的什么

单词转向量是用的什么,word2vec吗,我看别的人写的是给单词一个word2id(唯一id)就行,不用训练词向量吗,这一块有些疑惑

potential fix in build_corpus

I changed an if-else block to try-except block and it worked. Machine: windows10, python3.7
also i need another sklearn package after i installed requirements.txt
i think this is due to a syntactical difference between bmes format and a windows file reader. idk.

 def build_corpus(split, make_vocab=True, data_dir="./ResumeNER"):                                                           """读取数据"""                                                                                                          assert split in ['train', 'dev', 'test']                                                                                                                                                                                                        word_lists = []                                                                                                         tag_lists = []                                                                                                          with open(join(data_dir, split+".char.bmes"), 'r', encoding='utf-8') as f:                                                  word_list = []                                                                                                          tag_list = []                                                                                                           for line in f.readlines():                                                                                                  try:                                                                                                                        word, tag = line.strip('\n').split()                                                                                    word_list.append(word)                                                                                                  tag_list.append(tag)                                                                                                except:                                                                                                                     word_lists.append(word_list)                                                                                            tag_lists.append(tag_list)                                                                                              word_list = []                                                                                                          tag_list = []

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.