GithubHelp home page GithubHelp logo

sunyilgdx / sifrank_zh Goto Github PK

View Code? Open in Web Editor NEW
417.0 8.0 80.0 2.44 MB

Keyphrase or Keyword Extraction 基于预训练模型的中文关键词抽取方法(论文SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model 的中文版代码)

Python 100.00%
sifrank keyphrase-extraction keyword-extraction elmo pre-trained-language-models sif word-embeddings sentence-embeddings python36

sifrank_zh's Introduction

SIFRank_zh

这是我们论文的相关代码 SIFRank: A New Baseline for Unsupervised Keyphrase Extraction Based on Pre-trained Language Model 原文是在对英文关键短语进行抽取,这里迁移到中文上,部分pipe进行了改动。英文原版在这里

版本介绍

  • 2020/03/03——最初版本 本版本中只包含了最基本的功能,部分细节还有待优化和扩充。

核心算法

预训练模型ELMo+句向量模型SIF

词向量ELMo优势:1)经过大规模预训练,较TFIDF、TextRank等基于统计和图的具有更多的语义信息;2)ELMo是动态的,可以改善一词多义问题;3)ELMo通过Char-CNN编码,对生僻词非常友好;4)不同层的ELMo可以捕捉不同层次的信息

句向量SIF优势:1)根据词频对词向量进行平滑反频率加权,能更好地捕捉句子的中心话题;2)更好地过滤通用词

候选关键短语识别

首先对句子进行分词和词性标注,再利用正则表达式确定名词短语(例如:形容词+名词),将名词短语作为候选关键短语

候选关键短语重要程度排序(SIFRank)

利用相同的算法计算整个文档(或句子)和候选关键短语的句向量,再依次进行相似度计算(余弦距离),作为重要程度

文档分割(document segmentation,DS)+词向量对齐(embeddings alignment,EA)

DS:通过将文档分为较短且完整的句子(如16个词左右),并行计算来加速ELMo; EA:同时利用锚点词向量对不同句子中的相同词的词向量进行对齐,来稳定同一词在相同语境下的词向量表示。

位置偏权(SIFRank+)

核心**:对于长文本,先出现的词往往具有更重要的地位

因此利用每个词第一次出现的位置来产生权重:1/p+u(还要经过一个softmax拟合),u是一个超参数,经过实验设置为3.4

环境

Python 3.6
nltk 3.4.3
elmoformanylangs 0.0.3
thulac 0.2.1
torch 1.1.0

提示

哈工大的elmoformanylangs 0.0.3中有个较为明显的问题,当返回所有层Embeddings的时候代码写错了,当output_layer=-2时并不是返回所有层的向量,只是返回了倒数第二层的。问题讨论在这里#31

elmo.sents2elmo(sents_tokened,output_layer=-2)

建议这样修改elmo.py里class Embedder(object)类中的代码。

原代码:

if output_layer == -1:
     payload = np.average(data, axis=0)
else:
     payload = data[output_layer]

修改后:

if output_layer == -1:
     payload = np.average(data, axis=0)
 #code changed here
 elif output_layer == -2:
     payload = data
 else:
     payload = data[output_layer]

下载

  • 哈工大ELMo zhs.model 请从这里 下载,将其解压保存到 auxiliary_data/目录下(注意要按照其要求更改config文件),本项目中已经将部分文件上传了,其中比较大的模型文件encoder.pkltoken_embedder.pkl请自行添加。
  • 清华分词工具包THULAC thulac.models 请从这里下载, 将其解压保存到 auxiliary_data/目录下。

用法

from embeddings import sent_emb_sif, word_emb_elmo
from model.method import SIFRank, SIFRank_plus
import thulac

#download from https://github.com/HIT-SCIR/ELMoForManyLangs
model_file = r'../auxiliary_data/zhs.model/'

ELMO = word_emb_elmo.WordEmbeddings(model_file)
SIF = sent_emb_sif.SentEmbeddings(ELMO, lamda=1.0)
#download from http://thulac.thunlp.org/
zh_model = thulac.thulac(model_path=r'../auxiliary_data/thulac.models/',user_dict=r'../auxiliary_data/user_dict.txt')
elmo_layers_weight = [0.5, 0.5, 0.0]

text = "计算机科学与技术(Computer Science and Technology)是国家一级学科,下设信息安全、软件工程、计算机软件与理论、计算机系统结构、计算机应用技术、计算机技术等专业。 [1]主修大数据技术导论、数据采集与处理实践(Python)、Web前/后端开发、统计与数据分析、机器学习、高级数据库系统、数据可视化、云计算技术、人工智能、自然语言处理、媒体大数据案例分析、网络空间安全、计算机网络、数据结构、软件工程、操作系统等课程,以及大数据方向系列实验,并完成程序设计、数据分析、机器学习、数据可视化、大数据综合应用实践、专业实训和毕业设计等多种实践环节。"
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
keyphrases_ = SIFRank_plus(text, SIF, zh_model, N=15, elmo_layers_weight=elmo_layers_weight)

用例展示

我们选取了一段百度百科中关于“计算机科学与技术”的描述作为被抽取对象(如下),用Top10来观察其效果。

text = "计算机科学与技术(Computer Science and Technology)是国家一级学科,下设信息安全、软件工程、计算机软件与理论、计算机系统结构、计算机应用技术、计算机技术等专业。 [1]主修大数据技术导论、数据采集与处理实践(Python)、Web前/后端开发、统计与数据分析、机器学习、高级数据库系统、数据可视化、云计算技术、人工智能、自然语言处理、媒体大数据案例分析、网络空间安全、计算机网络、数据结构、软件工程、操作系统等课程,以及大数据方向系列实验,并完成程序设计、数据分析、机器学习、数据可视化、大数据综合应用实践、专业实训和毕业设计等多种实践环节。"

  • SIFRank_zh抽取结果
关键词         权重
大数据技术导论  0.9346
计算机软件      0.9211
计算机系统结构  0.9182
高级数据库系统  0.9022
计算机网络      0.8998
媒体大数据案例  0.8997
数据结构       0.8971
软件工程       0.8955
大数据         0.8907
计算机技术     0.8838
  • SIFRank+_zh抽取结果
关键词         权重
计算机软件       0.9396
计算机科学与技术  0.9286
计算机系统结构    0.9245
大数据技术导论    0.9222
软件工程         0.9213
信息             0.8787
计算机技术       0.8778
高级数据库系统    0.8770
computer        0.8717
媒体大数据案例    0.8687
  • jieba分词TFIDF抽取结果
关键词         权重
数据         0.8808
可视化       0.5891
技术         0.3726
机器         0.3496
毕业设计     0.3369
专业         0.3260
网络空间     0.3235
数据库系统   0.2983
数据结构     0.2801
计算技术     0.2738
  • jieba分词TextRank抽取结果
关键词         权重
数据        1.0000
技术        0.4526
可视化      0.3170
计算机系统  0.2488
机器        0.2420
结构        0.2371
计算机      0.2365
专业        0.2121
网络空间    0.2103
计算技术    0.1954

分析

我们的SIFRank和SIFRank+采用了动态预训练词向量模型ELMo和句向量模型SIF,用完全无监督的方法进行关键短语(keyphrase)的抽取,相比于jieba的TFIDF和TextRank算法,不仅抽取的关键词更加完整,且由于引入了预训练的知识,关键词之间的关系更为丰富,不再仅限于句子结构本身。

此外,清华的分词模型支持自定义用户词典,可以保持专有名词的完整性,并且通过ELMo的CNN编码层,对专有名词的识别和编码效果会更好。

引用

If you use this code, please cite this paper

@article{DBLP:journals/access/SunQZWZ20,
  author    = {Yi Sun and
               Hangping Qiu and
               Yu Zheng and
               Zhongwei Wang and
               Chaoran Zhang},
  title     = {SIFRank: {A} New Baseline for Unsupervised Keyphrase Extraction Based
               on Pre-Trained Language Model},
  journal   = {{IEEE} Access},
  volume    = {8},
  pages     = {10896--10906},
  year      = {2020},
  url       = {https://doi.org/10.1109/ACCESS.2020.2965087},
  doi       = {10.1109/ACCESS.2020.2965087},
  timestamp = {Fri, 07 Feb 2020 12:04:22 +0100},
  biburl    = {https://dblp.org/rec/journals/access/SunQZWZ20.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

sifrank_zh's People

Contributors

sunyilgdx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sifrank_zh's Issues

运行test.py时报错: size mismatch for word_emb_layer.embedding.weight

您好,我在运行您提供的测试用例test.py时报错:

2021-08-15 18:28:10,586 INFO: char embedding size: 6169
2021-08-15 18:28:10,924 INFO: word embedding size: 71222
2021-08-15 18:28:16,333 INFO: Model(
  (token_embedder): ConvTokenEmbedder(
    (word_emb_layer): EmbeddingLayer(
      (embedding): Embedding(71222, 100, padding_idx=3)
    )
    (char_emb_layer): EmbeddingLayer(
      (embedding): Embedding(6169, 50, padding_idx=6166)
    )
    (convolutions): ModuleList(
      (0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
      (1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
      (2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
      (3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
      (4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
      (5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
      (6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
    )
    (highways): Highway(
      (_layers): ModuleList(
        (0): Linear(in_features=2048, out_features=4096, bias=True)
        (1): Linear(in_features=2048, out_features=4096, bias=True)
      )
    )
    (projection): Linear(in_features=2148, out_features=512, bias=True)
  )
  (encoder): ElmobiLm(
    (forward_layer_0): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
    (backward_layer_0): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
    (forward_layer_1): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
    (backward_layer_1): LstmCellWithProjection(
      (input_linearity): Linear(in_features=512, out_features=16384, bias=False)
      (state_linearity): Linear(in_features=512, out_features=16384, bias=True)
      (state_projection): Linear(in_features=4096, out_features=512, bias=False)
    )
  )
)
Traceback (most recent call last):
  File "/Users/xing.sun/PycharmProjects/SIFRank_zh/test/test.py", line 14, in <module>
    ELMO = word_emb_elmo.WordEmbeddings(model_file)
  File "/Users/xing.sun/PycharmProjects/SIFRank_zh/embeddings/word_emb_elmo.py", line 22, in __init__
    self.elmo = Embedder(model_path)
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/elmo.py", line 106, in __init__
    self.model, self.config = self.get_model()
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/elmo.py", line 182, in get_model
    model.load_model(self.model_dir)
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/elmoformanylangs/frontend.py", line 207, in load_model
    map_location=lambda storage, loc: storage))
  File "/Users/xing.sun/opt/anaconda3/envs/bert4keras36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 777, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ConvTokenEmbedder:
	size mismatch for word_emb_layer.embedding.weight: copying a param with shape torch.Size([140384, 100]) from checkpoint, the shape in current model is torch.Size([71222, 100]).
	size mismatch for char_emb_layer.embedding.weight: copying a param with shape torch.Size([15889, 50]) from checkpoint, the shape in current model is torch.Size([6169, 50]).

我的运行环境是按照您README里给出的配置的。

期待您的回复,谢谢

Sent from PPHub

运行时出现报错

ValueError: could not broadcast input array from shape (3,41,1024) into shape (3)

image

The sentence will note bs split by '。'

If you do that, there will be another error.

作者注意看一下,这个SIF的中文版其实很有问题。
输入不会按中文句号分句的,不带.都只有一句。
分句的话,embedding返回的list中每个Tensor维度不一致(根据多少word来的),想请教作者怎么做pad_sequence。

提取的关键词倾向于带英文字母

大佬好!
我用这份代码提取《大话数据结构》全书,发现得到的关键词大多都含字母,且不大像一个词,如下图。
请问,我该怎么改进呢?

SIFRank关键词

COMPARE WITH OTHER PRE-TRAINED LANGUAGE MODELS

hi,感谢开源代码~
在论文中有提到"We compare the effect of replacing ELMo with word embeddings of GloVe and BERT"
请问要如何修改代码将ELMo替换成其他预训练模型比如BERT呢?

无法运行。。。ValueError: index can't contain negative values

2022-01-10 17:05:17,709 INFO: char embedding size: 6169
2022-01-10 17:05:17,918 INFO: word embedding size: 71222
2022-01-10 17:05:21,442 INFO: Model(
(token_embedder): ConvTokenEmbedder(
(word_emb_layer): EmbeddingLayer(
(embedding): Embedding(71222, 100, padding_idx=3)
)
(char_emb_layer): EmbeddingLayer(
(embedding): Embedding(6169, 50, padding_idx=6166)
)
(convolutions): ModuleList(
(0): Conv1d(50, 32, kernel_size=(1,), stride=(1,))
(1): Conv1d(50, 32, kernel_size=(2,), stride=(1,))
(2): Conv1d(50, 64, kernel_size=(3,), stride=(1,))
(3): Conv1d(50, 128, kernel_size=(4,), stride=(1,))
(4): Conv1d(50, 256, kernel_size=(5,), stride=(1,))
(5): Conv1d(50, 512, kernel_size=(6,), stride=(1,))
(6): Conv1d(50, 1024, kernel_size=(7,), stride=(1,))
)
(highways): Highway(
(_layers): ModuleList(
(0): Linear(in_features=2048, out_features=4096, bias=True)
(1): Linear(in_features=2048, out_features=4096, bias=True)
)
)
(projection): Linear(in_features=2148, out_features=512, bias=True)
)
(encoder): ElmobiLm(
(forward_layer_0): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(backward_layer_0): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(forward_layer_1): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
(backward_layer_1): LstmCellWithProjection(
(input_linearity): Linear(in_features=512, out_features=16384, bias=False)
(state_linearity): Linear(in_features=512, out_features=16384, bias=True)
(state_projection): Linear(in_features=4096, out_features=512, bias=False)
)
)
)
Model loaded succeed
2022-01-10 17:05:24,990 INFO: 1 batches, avg len: 77.5
Traceback (most recent call last):
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/test/test.py", line 21, in
keyphrases = SIFRank(text, SIF, zh_model, N=5,elmo_layers_weight=elmo_layers_weight)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/model/method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/sent_emb_sif.py", line 48, in get_tokenized_sent_embeddings
elmo_embeddings = self.word_embeddor.get_tokenized_words_embeddings(tokens_segmented)
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/word_emb_elmo.py", line 29, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "/Users/hellozhang/Desktop/dj/SIFRank_关键词提取/embeddings/word_emb_elmo.py", line 29, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "/Users/hellozhang/opt/anaconda3/envs/textrank/lib/python3.7/site-packages/numpy/lib/arraypad.py", line 748, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "/Users/hellozhang/opt/anaconda3/envs/textrank/lib/python3.7/site-packages/numpy/lib/arraypad.py", line 519, in _as_pairs
raise ValueError("index can't contain negative values")
ValueError: index can't contain negative values
请问 这个问题怎么处理啊

test.py文件执行报错

Traceback (most recent call last):
File "D:/Codes/Information-extraction/keyword-extraction/SIFRank_zh/test/test.py", line 16, in
ELMO = word_emb_elmo.WordEmbeddings(model_file)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\embeddings\word_emb_elmo.py", line 19, in init
self.elmo = Embedder(model_path)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\elmo.py", line 107, in init
self.model, self.config = self.get_model()
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\elmo.py", line 163, in get_model
model.load_model(self.model_dir)
File "D:\Codes\Information-extraction\keyword-extraction\SIFRank_zh\elmoformanylangs\frontend.py", line 206, in load_model
map_location=lambda storage, loc: storage))
File "D:\softwares\miniconda\envs\torch1.8\lib\site-packages\torch\nn\modules\module.py", line 1224, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ConvTokenEmbedder:
size mismatch for word_emb_layer.embedding.weight: copying a param with shape torch.Size([140384, 100]) from checkpoint, the shape in current model is torch.Size([71222, 100]).
size mismatch for char_emb_layer.embedding.weight: copying a param with shape torch.Size([15889, 50]) from checkpoint, the shape in current model is torch.Size([6169, 50]).

model/input_representation.py 下stanfordcorenlp模块未注释

model/input_representation.py 下stanfordcorenlp模块未注释。导致运行失败
Traceback (most recent call last):
File "test.py", line 7, in
from model.method import SIFRank, SIFRank_plus
File "/apdcephfs/private_markowu/proj/SIFRank_zh-master/model/method.py", line 9, in
from model import input_representation
File "/apdcephfs/private_markowu/proj/SIFRank_zh-master/model/input_representation.py", line 9, in
from stanfordcorenlp import StanfordCoreNLP
ModuleNotFoundError: No module named 'stanfordcorenlp'

匹配时报错

哈喽,大神,我又来了,这次是这个报错:
image

运行test/test.py报错

报错信息如下:
Model loaded succeed
2020-03-03 14:04:31,192 INFO: 1 batches, avg len: 153.0
Traceback (most recent call last):
File "D:/my_code/github项目/MeteorMan's nlp_lib/关键词抽取/SIFRank_zh-master/test/test.py", line 22, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\model\method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\embeddings\sent_emb_sif_backup.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = context_embeddings_alignment(elmo_embeddings, tokens_segmented)
File "D:\my_code\github项目\MeteorMan's nlp_lib\关键词抽取\SIFRank_zh-master\embeddings\sent_emb_sif_backup.py", line 90, in context_embeddings_alignment
emb = elmo_embeddings[i, 1, j, :]
IndexError: too many indices for tensor of dimension 3

这里面对于elmo_embeddings的处理是否存在问题?

Run the test.py, follow is the error message:

Traceback (most recent call last):
File "test.py", line 22, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "../model/method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "../embeddings/sent_emb_sif.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = context_embeddings_alignment(elmo_embeddings, tokens_segmented)
File "../embeddings/sent_emb_sif.py", line 90, in context_embeddings_alignment
emb = elmo_embeddings[i, 1, j, :]
IndexError: too many indices for tensor of dimension 3

ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (3,2) and requested shape (2,2)

Model loaded succeed
2022-06-10 00:47:25,759 INFO: 1 batches, avg len: 77.5
Traceback (most recent call last):
File "D:\BaDouAI\SIFRank_zh-master\main.py", line 16, in
keyphrases = SIFRank(text, SIF, zh_model, N=15,elmo_layers_weight=elmo_layers_weight)
File "D:\BaDouAI\SIFRank_zh-master\model\method.py", line 179, in SIFRank
sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
File "D:\BaDouAI\SIFRank_zh-master\embeddings\sent_emb_sif.py", line 49, in get_tokenized_sent_embeddings
elmo_embeddings = self.word_embeddor.get_tokenized_words_embeddings(tokens_segmented)
File "D:\BaDouAI\SIFRank_zh-master\embeddings\word_emb_elmo.py", line 30, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,abs(max_len-emb.shape[1])),(0,0)) , mode='constant') for emb in elmo_embedding]
File "D:\BaDouAI\SIFRank_zh-master\embeddings\word_emb_elmo.py", line 30, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,abs(max_len-emb.shape[1])),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\arraypad.py", line 746, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\arraypad.py", line 521, in _as_pairs
return np.broadcast_to(x, (ndim, 2)).tolist()
File "<array_function internals>", line 6, in broadcast_to
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\stride_tricks.py", line 180, in broadcast_to
return _broadcast_to(array, shape, subok=subok, readonly=True)
File "C:\Users\limuo\Anaconda3\envs\KeyWordsExtraction\lib\site-packages\numpy\lib\stride_tricks.py", line 125, in _broadcast_to
op_flags=['readonly'], itershape=shape, order='C')
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (3,2) and requested shape (2,2)

可视化

您好,我想请问一下文章里的图4可视化是怎么做的,可以分享一下代码嘛?

代码报错--

word_emb_elmo.py 中np.pad函数报错,提示index为负数

index can't contain negative values

Traceback (most recent call last):
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 38, in
embs = elmo.get_tokenized_words_embeddings(sents)
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 29, in get_tokenized_words_embeddings
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "D:/algo/SIFRank_zh-master/embeddings/word_emb_elmo.py", line 29, in
elmo_embedding = [np.pad(emb, pad_width=((0,0),(0,max_len-emb.shape[1]),(0,0)) , mode='constant') for emb in elmo_embedding]
File "<array_function internals>", line 6, in pad
File "D:\Anaconda\envs\baidu\lib\site-packages\numpy\lib\arraypad.py", line 748, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
File "D:\Anaconda\envs\baidu\lib\site-packages\numpy\lib\arraypad.py", line 519, in _as_pairs
raise ValueError("index can't contain negative values")
ValueError: index can't contain negative values

求问大佬,这个报错是什么原因啊?

批量处理句子

首先感谢分享您的工作.
请问是否实现有批量处理句子的接口, 即用类似batch的方式而不是单个句子进行提交.
期待答复

处理数据

您好请问一下如果要跑整本书的文字量,要跑多久?

是否还有其他类似的工作?

您好!拜读了您这篇关于无监督关键词抽取的论文,我看到相关工作以及模型的对比中,您的工作首次将 pretrained model 引入到无监督关键词抽取中来。想向您请教一下目前还有没有其他类似工作出现?您对这一方向的未来发展有怎么的看法呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.