gaohongkui / globalpointer_pytorch Goto Github PK

View Code? Open in Web Editor NEW

376.0 2.0 45.0 1.02 MB

全局指针统一处理嵌套与非嵌套NER的Pytorch实现

Python 100.00%

ner chinese-ner

globalpointer_pytorch's Introduction

GlobalPointer_pytorch

喜欢本项目的话，欢迎点击右上角的star，感谢每一个点赞的你。

项目介绍

本项目的模型参考苏剑林的文章GlobalPointer：用统一的方式处理嵌套和非嵌套NER，并用Pytorch实现。

GlobalPointer的设计思路与TPLinker-NER类似，但在实现方式上不同。具体体现在：

加性乘性Attention

TPLinker在Multi-Head上用的是加性Attention：

而GlobalPointer用的是乘性Attention：

位置编码

GlobalPointer在模型中还加入了一种旋转式位置编码RoPE。这是一种“通过绝对位置编码的方式实现相对位置编码”，在本模型中效果明显。

Usage

实验环境

本次实验进行时Python版本为3.6，其他主要的第三方库包括：

pytorch==1.8.1
wandb==0.10.26 #for logging the result
transformers==4.1.1
tqdm==4.54.1

下载预训练模型

请下载Bert的中文预训练模型bert-base-chinese存放至 pretrained_models/，并在config.py中配置正确的bert_path

Train

python train.py

Evaluation

python evaluate.py

实验结果

默认配置（超参数已在 config.py 文件中），数据集是 CLUENER

验证集 Best F1：0.7966

globalpointer_pytorch's People

Contributors

Stargazers

Watchers

globalpointer_pytorch's Issues

IndexError: list index out of range

您好，我在运行evaluate.py的时候同样遇到了该问题，下面是报错信息，想请问下您能指点下吗

IndexError: list index out of range

请问运行evaluate.py出现这个报错原因是什么呢？没有做任何修改

raise UsageError("api_key not configured (no-tty). call " + directive) wandb.errors.UsageError: api_key not configured (no-tty). call wandb.login(key=[your_api_key])
请问这个问题怎么解决？试了各种办法无果，api_key我已经注册拿到了

实验结果

作者你好，请问你在cluener数据集上的实验结果是多少？谢谢！

loss的参数顺序需要修正

调用multilabel_categorical_crossentropy时出错：train.py / line 188

调用loss_fun出错：train.py / line 159

loss计算在整体上没有问题，但是是因为后面参数传递也发生了错误。

数据集划分问题

请问数据集文件有dev、train和test，test是没标签，请问带有标签的测试集用来评估测试结果是哪个文件呢？dev文件是验证集吗？evaluate.py这个文件是做什么的呢？评估测试集结果和预测未知标签数据集都是这个吗？

关于内存方面

你在generate_inputs时一次性加入所有的labels会不会导致内存爆啊，就如CMeEE数据集而言，13000 9 256 256 8=66GB

当调大batch_size时，比如512，第一个epoch为的验证指标都是0

当调大batch_size时，比如512，第一个epoch为的验证指标都是0，请问是什么原因呢

valid: f1: 0.00000, precision: 1.00000, recall: 0.00000, best f1: 0.00000

split的维度问题

苏建林的tf原版

def __init__(
        self,
        heads,
        head_size,
        RoPE=True,
        use_bias=True,
        kernel_initializer='glorot_uniform',
        **kwargs
    ):
        super(GlobalPointer, self).__init__(**kwargs)
        self.heads = heads
        self.head_size = head_size
        ...

def call(self, inputs, mask=None):
        # 输入变换
        inputs = self.dense(inputs)
        inputs = tf.split(inputs, self.heads, axis=-1)
        ...

您的版本：

def __init__(self, encoder, ent_type_size, inner_dim, RoPE=True):
    super().__init__()
    self.encoder = encoder
    self.ent_type_size = ent_type_size # 实体类型个数
    self.inner_dim = inner_dim # head_size??? head的维度大小???
    self.hidden_size = encoder.config.hidden_size
    self.dense = nn.Linear(self.hidden_size, self.ent_type_size * self.inner_dim * 2)
    ......

 def forward(self, input_ids, attention_mask, token_type_ids):
      .......
      outputs = self.dense(last_hidden_state)
      outputs = torch.split(outputs, self.inner_dim * 2, dim=-1)

按照苏建林版本我的理解是,head_size表示head头的大小,heads是head的个数，也就是实体类型的个数;在下面的split时按照实体类型的维度展开；
您的版本中torch.split按照head_size * 2的展开，这里是我理解的有问题还是有错误？麻烦指点，谢谢!

预测结果实体多且长的问题

请问作者有在其它数据集尝试GP模型吗？

有个疑惑，自己搜集了一些预料，用bert基础版预训练模型做了个训练，然后再测试集上预测时，出现了预测实体多且长的问题。

测试集一条样本：

{
        "text": "**政府宣布2019年国防开支将比前一年增长7.5%，超过预计今年的经济增长率。第十三届全国人大第二次会议星期二（2019年3月5日）在开幕时公布的政府预算报告显示，今年的国防开支将达到11899亿元人民币，相当于大约1780亿美元。外界一般认为，**实际的军事开支可能高出政府公开的国防预算金额。**国防部公布的消息说，今年的国防预算将重点支持国防和军队改革，全面推动国防和军队现代化建设。**每年一度的国防开支预告一直受到国际广泛关注。各国试图从中了解**战略意图的变化和发展。",
        "entities": [
            {
                "start_idx": 0,
                "end_idx": 4,
                "type": "ORG",
                "entity": "**政府"
            },
            {
                "start_idx": 6,
                "end_idx": 11,
                "type": "TIM",
                "entity": "2019年"
            },
            {
                "start_idx": 18,
                "end_idx": 20,
                "type": "NUM",
                "entity": "一年"
            },
            {
                "start_idx": 57,
                "end_idx": 66,
                "type": "TIM",
                "entity": "2019年3月5日"
            },
            {
                "start_idx": 93,
                "end_idx": 103,
                "type": "NUM",
                "entity": "11899亿元人民币"
            },
            {
                "start_idx": 109,
                "end_idx": 116,
                "type": "NUM",
                "entity": "1780亿美元"
            },
            {
                "start_idx": 124,
                "end_idx": 126,
                "type": "LOC",
                "entity": "**"
            },
            {
                "start_idx": 149,
                "end_idx": 154,
                "type": "ORG",
                "entity": "**国防部"
            },
            {
                "start_idx": 196,
                "end_idx": 198,
                "type": "LOC",
                "entity": "**"
            },
            {
                "start_idx": 228,
                "end_idx": 230,
                "type": "LOC",
                "entity": "**"
            }
        ]
}

预测结果：

{
        "text": "**政府宣布2019年国防开支将比前一年增长7.5%，超过预计今年的经济增长率。第十三届全国人大第二次会议星期二（2019年3月5日）在开幕时公布的政府预算报告显示，今年的国防开支将达到11899亿元人民币，相当于大约1780亿美元。外界一般认为，**实际的军事开支可能高出政府公开的国防预算金额。**国防部公布的消息说，今年的国防预算将重点支持国防和军队改革，全面推动国防和军队现代化建设。**每年一度的国防开支预告一直受到国际广泛关注。各国试图从中了解**战略意图的变化和发展。",
        "pred_entities": [
            {
                "start_idx": 0,
                "end_idx": 1,
                "type": "TIM",
                "entity": "中"
            },
            {
                "start_idx": 0,
                "end_idx": 10,
                "type": "TIM",
                "entity": "**政府宣布2019"
            },
            {
                "start_idx": 0,
                "end_idx": 12,
                "type": "TIM",
                "entity": "**政府宣布2019年国"
            },
            {
                "start_idx": 0,
                "end_idx": 24,
                "type": "TIM",
                "entity": "**政府宣布2019年国防开支将比前一年增长7."
            },
            {
                "start_idx": 0,
                "end_idx": 34,
                "type": "TIM",
                "entity": "**政府宣布2019年国防开支将比前一年增长7.5%，超过预计今年的"
            },
            {
                "start_idx": 0,
                "end_idx": 46,
                "type": "TIM",
                "entity": "**政府宣布2019年国防开支将比前一年增长7.5%，超过预计今年的经济增长率。第十三届全国"
            },
            ...
            {
                "start_idx": 236,
                "end_idx": 240,
                "type": "WEA",
                "entity": "化和发展"
            },
            {
                "start_idx": 237,
                "end_idx": 240,
                "type": "WEA",
                "entity": "和发展"
            }
       ]
}

然后，分析了下代码，发现 decode_ent 这里预测实体的起止索引向量维度很高

d = np.where(pred_matrix > threshold)

print(np.array(d).shape)
# Out[4]: (3, 112304)

看起来，模型并没有很好地预测出实体的边界，自己检查过已标注的实体，index正常。想问下，作者有遇到类似情况吗？还是说GP模型在实体较长、嵌套较深、或者上下文信息较丰富的情况下就是会出现这种情况。感谢！

请问【cls】与【sep】分隔符可以不需要吗？是有意为之吗

是否可以训练出1024长度的模型

您好，我看您采用了旋转式位置编码RoPE，是不是意味着可以训练出1024长度的模型呢？

为什么f1,precision,recall是step累加的？

这样好像计算出来和一般的f1定义不同吧

标签数量

您好，您的工作很好的解决了本人标签嵌套的问题，但本人所做任务的标签数足足有接近一万个（细粒度非常高），这使得self.dense成为了一个将近4G的线性层，且由于每个标签单独的占用一个(1, seq_len, seq_len)空间，则在训练时需要较大时间和显存成本，请问作者有没有针对这种高细粒度标签的NER模型呢？非常感谢！

在 train_step 调用loss_fun的时候第一个参数是pred，但是定义loss_fun的时候第一个参数是 label

在 train_step 调用loss_fun的时候第一个参数是pred

GlobalPointer_pytorch/train.py

Line 159 in d32f84b

loss = criterion(logits, batch_labels)

但是定义的时候第一个参数是 label
def loss_fun(y_true, y_pred):

计算loss的过程是对称的，所以没有出错，不过还是建议作者改一下？

去除padding部分，以及最后计算acc

我看苏神的代码里面有去除padding部分，还有就是最后计算acc，我看是除以y_pred.sum()，其实你计算的就是precision吧，感觉都是实体维度的，acc没必要了吧

大哥，你本人测试的结果也是这样吗

训练了33个epoch得到下面的结果：
avg_precision: 0.7733637138826565, avg_recall: 0.7902737446924301, avg_f1: 0.7806660288039
Best F1: 0.7875311899437892

这个结果是不是跟bert+crf比还差一点点呢？
https://github.com/lonePatient/BERT-NER-Pytorch 这个里面的BERT+CRF

Accuracy (entity) | Recall (entity) | F1 score (entity)
0.7977 | 0.8177 | 0.8076

请问一下可以替代CRF，进行信息抽取任务吗，需要修改哪些地方

ValueError，预训练模型问题

请问按照您提供的transfomer版本，下载不了您给出huggingface里的Bert-base-chineses，而且在其他代码里涉及transformer的地方也会有相同报错，可以指教一下吗


"Connection error, and we cannot find the requested files in the cached path."
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

请问如何对未知标签数据进行预测？

请问对未知标签数据进行预测代码在哪部分了？如何进行预测呢？谢谢

似乎有个bug？

下图实现要找token_span，但是好像没考虑同名实体，比如例子（张三传是由张三在2021年拍摄），第一个张三可能是属于 movie实体，第二个张三是director实体；

但是下图while循环有个break，匹配到就跳出，以上面的例子看，如果要找第2个张三，似乎匹配到第一个张三就跳出了；

附：代码/common/utils.py/Preprocessor(clss)/get_ent2token_spans(func)

@gaohongkui

 for epoch in range(hyper_parameters["epochs"]): 

 train(model, train_dataloader, epoch)

从这处代码看，每个epoch都会用一个新的optimizer和初始learning_rate训练model，和一般训练方案不太一样，这是作者大佬的什么trick吗？

您好？换了其他数据集，好像会有文本长度过长的问题，报错如下。望回答

RuntimeError: The size of tensor a (1087) must match the size of tensor b (512) at non-singleton dimension 1

请问gp 算logits的时候最后为什么要开方？

GlobalPointer_pytorch/models/GlobalPointer.py

Line 208 in d32f84b

return logits/self.inner_dim**0.5

谢谢

	for epoch in range(hyper_parameters["epochs"]):
	train(model, train_dataloader, epoch)

gaohongkui / globalpointer_pytorch Goto Github PK

globalpointer_pytorch's Introduction

GlobalPointer_pytorch

项目介绍

Usage

实验环境

下载预训练模型

Train

Evaluation

实验结果

globalpointer_pytorch's People

Contributors

Stargazers

Watchers

Forkers

globalpointer_pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs