649453932 / chinese-text-classification-pytorch Goto Github PK
View Code? Open in Web Editor NEW中文文本分类,TextCNN,TextRNN,FastText,TextRCNN,BiLSTM_Attention,DPCNN,Transformer,基于pytorch,开箱即用。
License: MIT License
中文文本分类,TextCNN,TextRNN,FastText,TextRCNN,BiLSTM_Attention,DPCNN,Transformer,基于pytorch,开箱即用。
License: MIT License
Hi guys, I 'm really appreciated for the algorithms you have provided.
Could you please use utf-8 encoding everywhere?
E.g., FastText.py, 16-17 lines should be the following:
self.class_list = [x.strip() for x in open(
dataset + '/data/class.txt', **encoding='utf-8'**).readlines()]
Otherwise I get the expected error:
dataset + '/data/class.txt').readlines()] # 类别名单
File "C:\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 51: character maps to <undefined>
Could you please add it to each algorithm?
在multi-head attention 那儿求出attention之后的view()似乎会让顺序错误。
我自己试了一试, 假设h=3, batch =1, 句子长度4, 词向量5。
a = torch.randn(3,4,5)
a
tensor([[[-0.1241, 0.0364, 1.2337, -0.5907, 0.8305],
[-0.0610, -0.9682, 0.7830, 1.5998, -0.6637],
[ 0.1863, -1.2179, 0.0710, 0.6962, -0.0442],
[ 0.0584, -0.5964, 0.8453, -1.3244, -0.0499]],
[[ 2.7228, 0.6973, -1.2440, 1.8854, 2.3017],
[-0.1034, -1.7281, -1.1495, -0.2478, -0.8541],
[-0.2823, -0.3416, -1.3749, 0.2995, -0.1860],
[-1.1601, 0.9876, 0.2881, -1.8866, -1.3901]],
[[-1.1265, 1.2683, -0.7065, 0.0946, 0.3501],
[-0.1266, 1.2834, -1.2694, 1.1730, -0.3443],
[ 1.4679, 2.1238, 0.2405, -0.4388, 0.8566],
[ 1.8933, 0.4461, 2.2419, 0.6118, -1.5001]]])
a.view(1, -1, 15)
tensor([[[-0.1241, 0.0364, 1.2337, -0.5907, 0.8305, -0.0610, -0.9682,
0.7830, 1.5998, -0.6637, 0.1863, -1.2179, 0.0710, 0.6962,
-0.0442],
[ 0.0584, -0.5964, 0.8453, -1.3244, -0.0499, 2.7228, 0.6973,
-1.2440, 1.8854, 2.3017, -0.1034, -1.7281, -1.1495, -0.2478,
-0.8541],
[-0.2823, -0.3416, -1.3749, 0.2995, -0.1860, -1.1601, 0.9876,
0.2881, -1.8866, -1.3901, -1.1265, 1.2683, -0.7065, 0.0946,
0.3501],
[-0.1266, 1.2834, -1.2694, 1.1730, -0.3443, 1.4679, 2.1238,
0.2405, -0.4388, 0.8566, 1.8933, 0.4461, 2.2419, 0.6118,
-1.5001]]])
可以看到view只是按顺序拼接,并没有做到concat
在util.py构造DatasetIterater时, if len(batches) % self.n_batches != 0 报错n.n_batches =0 了,这跟数据量有关么?我自己的任务训练数据比较少
你好请问,搜狗词向量npz文件是原始文件吗,还是需要处理,想知道文件格式。
我现在处理英文数据,采用斯坦福的glove词向量文件,是txt文件,想知道怎么调整呢
请问在执行utils.py的时候出现了这个错误,sgns.sogou.char是什么文件呢
注意到您使用了搜狗和腾讯Embedding集,问下有针对英文的么
感谢大佬分享!
本人跑代码中的短文本时,结果和你发的一样,然后我就用模型跑长文本,还是THUCNews数据集,用的数据是:https://github.com/gaussic/text-classification-cnn-rnn中用到的数据,下载链接: https://pan.baidu.com/s/1hugrfRu 密码: qfud
pad_size设置成600或1000都不行,日志如下:
Loading data...
50000it [00:27, 1849.42it/s]
5000it [00:02, 1879.22it/s]
10000it [00:05, 1667.15it/s]
Time usage: 0:00:36
<bound method Module.parameters of Model(
(embedding): Embedding(6206, 300)
(convs): ModuleList(
(0): Conv2d(1, 256, kernel_size=(2, 300), stride=(1, 1))
(1): Conv2d(1, 256, kernel_size=(3, 300), stride=(1, 1))
(2): Conv2d(1, 256, kernel_size=(4, 300), stride=(1, 1))
)
(dropout): Dropout(p=0.5)
(fc): Linear(in_features=768, out_features=10, bias=True)
)>
Epoch [1/20]
Iter: 0, Train Loss: 2.2, Train Acc: 17.97%, Val Loss: 3.5, Val Acc: 10.00%, Time: 0:00:03 *
Iter: 100, Train Loss: 1.4e-06, Train Acc: 100.00%, Val Loss: 2.3e+01, Val Acc: 10.00%, Time: 0:00:41
Iter: 200, Train Loss: 1.2e+01, Train Acc: 0.00%, Val Loss: 2.1e+01, Val Acc: 10.00%, Time: 0:01:19
Iter: 300, Train Loss: 1.3e-06, Train Acc: 100.00%, Val Loss: 3.3e+01, Val Acc: 10.00%, Time: 0:01:58
Epoch [2/20]
Iter: 400, Train Loss: 1.8, Train Acc: 45.31%, Val Loss: 1.2e+01, Val Acc: 14.48%, Time: 0:02:35
Iter: 500, Train Loss: 2.4e-05, Train Acc: 100.00%, Val Loss: 2.2e+01, Val Acc: 10.00%, Time: 0:03:13
Iter: 600, Train Loss: 1.0, Train Acc: 68.75%, Val Loss: 3.2, Val Acc: 15.80%, Time: 0:03:51 *
Iter: 700, Train Loss: 0.06, Train Acc: 98.44%, Val Loss: 8.1, Val Acc: 10.00%, Time: 0:04:29
Epoch [3/20]
Iter: 800, Train Loss: 0.0094, Train Acc: 100.00%, Val Loss: 1.6e+01, Val Acc: 10.00%, Time: 0:05:06
Iter: 900, Train Loss: 2.1e+01, Train Acc: 0.00%, Val Loss: 2.3e+01, Val Acc: 10.00%, Time: 0:05:45
Iter: 1000, Train Loss: 0.25, Train Acc: 99.22%, Val Loss: 3.5, Val Acc: 11.32%, Time: 0:06:23
Iter: 1100, Train Loss: 3.9, Train Acc: 0.00%, Val Loss: 3.9, Val Acc: 13.44%, Time: 0:07:01
Epoch [4/20]
Iter: 1200, Train Loss: 0.0073, Train Acc: 100.00%, Val Loss: 1.1e+01, Val Acc: 11.70%, Time: 0:07:38
Iter: 1300, Train Loss: 0.47, Train Acc: 85.94%, Val Loss: 7.9, Val Acc: 18.34%, Time: 0:08:17
Iter: 1400, Train Loss: 0.063, Train Acc: 100.00%, Val Loss: 4.9, Val Acc: 12.82%, Time: 0:08:55
Iter: 1500, Train Loss: 0.47, Train Acc: 89.84%, Val Loss: 3.1, Val Acc: 23.12%, Time: 0:09:33 *
Epoch [5/20]
Iter: 1600, Train Loss: 0.0037, Train Acc: 100.00%, Val Loss: 7.4, Val Acc: 15.36%, Time: 0:10:10
Iter: 1700, Train Loss: 0.027, Train Acc: 100.00%, Val Loss: 1.3e+01, Val Acc: 10.34%, Time: 0:10:49
Iter: 1800, Train Loss: 3.7, Train Acc: 3.91%, Val Loss: 5.2, Val Acc: 11.98%, Time: 0:11:27
Iter: 1900, Train Loss: 0.1, Train Acc: 98.44%, Val Loss: 4.0, Val Acc: 24.80%, Time: 0:12:05
Epoch [6/20]
Iter: 2000, Train Loss: 0.72, Train Acc: 76.56%, Val Loss: 4.2, Val Acc: 28.34%, Time: 0:12:42
Iter: 2100, Train Loss: 0.034, Train Acc: 99.22%, Val Loss: 8.1, Val Acc: 13.46%, Time: 0:13:20
Iter: 2200, Train Loss: 0.4, Train Acc: 92.97%, Val Loss: 4.2, Val Acc: 33.14%, Time: 0:13:59
Iter: 2300, Train Loss: 0.2, Train Acc: 96.09%, Val Loss: 4.6, Val Acc: 22.86%, Time: 0:14:36
Epoch [7/20]
Iter: 2400, Train Loss: 0.16, Train Acc: 96.88%, Val Loss: 3.7, Val Acc: 35.28%, Time: 0:15:13
Iter: 2500, Train Loss: 0.014, Train Acc: 99.22%, Val Loss: 8.2, Val Acc: 11.10%, Time: 0:15:52
No optimization for a long time, auto-stopping...
/usr/local/lib/python3.5/dist-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
Test Loss: 3.3, Test Acc: 23.61%
Precision, Recall and F1-Score...
precision recall f1-score support
a 0.8361 0.0510 0.0961 1000
b 1.0000 0.0140 0.0276 1000
c 1.0000 0.0020 0.0040 1000
d 0.0000 0.0000 0.0000 1000
e 0.0000 0.0000 0.0000 1000
f 0.1711 0.7450 0.2783 1000
g 0.8728 0.5490 0.6740 1000
h 0.0791 0.0510 0.0620 1000
i 0.2523 0.9490 0.3986 1000
j 0.0000 0.0000 0.0000 1000
avg / total 0.4211 0.2361 0.1541 10000
Confusion Matrix...
[[ 51 0 0 0 0 325 0 0 91 533]
[ 3 14 0 0 0 627 1 159 196 0]
[ 1 0 2 0 0 182 3 415 397 0]
[ 2 0 0 0 0 458 56 16 468 0]
[ 3 0 0 0 0 866 4 4 123 0]
[ 0 0 0 0 0 745 12 0 243 0]
[ 1 0 0 0 0 55 549 0 395 0]
[ 0 0 0 0 0 891 0 51 58 0]
[ 0 0 0 0 0 50 1 0 949 0]
[ 0 0 0 0 0 155 3 0 842 0]]
Time usage: 0:00:06
模型跑长文本数据时loss波动太大,训练不出来,这是为什么呢?应该如何修改呢?谢谢!
我现在使用的是flask,但是不知道flask加载对外提供服务,望告知,谢谢了
if pad_size:
if len(token) < pad_size:
token.extend([vocab.get(PAD)] * (pad_size - len(token)))
else:
token = token[:pad_size]
seq_len = pad_size
# word to id
for word in token:
words_line.append(vocab.get(word, vocab.get(UNK)))
contents.append((words_line, int(label), seq_len))
这里,vocab.get(word, vocab.get(UNK))得到上面PAD补长的id,这个id不在字典中,最后都成了UNK的id。
use_word拼写成了ues_word
在utils.py里的load_dataset函数里,第56行
token.extend([vocab.get(PAD)] * (pad_size - len(token)))
在token里padding,获得的是PAD的id
但下面的代码做了word to id,
for word in token:
words_line.append(vocab.get(word, vocab.get(UNK)))
这样的话,因为PAD的id不在词库里,所以PAD都变成了UNK的id了吧,所以个人认为第56行
应该是
token.extend([PAD] * (pad_size - len(token)))
TypeError: Parameter to MergeFrom() must be instance of same class: expected Summary got Summary. for field Event.summary
首先有一个细节,当我先用 --word False
也就是默认的 char 为单位运行代码,会生成相应的char的vocab,接下来如果我改成用 --word True
以词为单位运行代码时,并不会再次生成相应的词的vocab,因为这段代码里面做了判断,如果 vocab.pkl
存在就直接读取了,所以需要手动把 vocab.pkl
先删除掉。
if os.path.exists(config.vocab_path):
vocab = pkl.load(open(config.vocab_path, 'rb'))
else:
vocab = build_vocab(config.train_path, tokenizer=tokenizer, max_size=MAX_VOCAB_SIZE, min_freq=1)
pkl.dump(vocab, open(config.vocab_path, 'wb'))
另外在运行 --word True
时,发现测试集里面的中文分词效果并不好,很多就是整段整段的了,这里可能我自己可以再用分词库去处理一下了。
{'': 0, 'ThinkPad': 1, 'LG': 2, '2011': 3, 'CJ': 4, '明日股市三大猜想及应对策略': 5, 'HTC': 6, '不派息': 7, '图文-火箭常规训练': 8, '2010': 9, '每日晚间实力机构点评热门个股精选': 10, 'E3': 11, 'IdeaPad': 12, '十大机构看后市': 13, '股海导航': 14, '盘面解读:八大机构预测今日市场走向': 15, 'iPhone': 16 ...
parser = argparse.ArgumentParser(description='Chinese Text Classification')
parser.add_argument('--model', type=str, required=True, help='TextRNN_Att')
parser.add_argument('--embedding', default='pre_trained', type=str, help='random ')
parser.add_argument('--word', default=False, type=bool, help='True for word, False for char')
args = parser.parse_args()
usage: run.py [-h] --model MODEL [--embedding EMBEDDING] [--word WORD]
run.py: error: the following arguments are required: --model
Hi,
Thank you for your great work.
I have trained the FastText with your code. Now I want to fine-tune it to accomplish a text binary classification task. Does the pre-trained model support for fine-tuning? And How can I do that?
Thanks!
在测试的时候,不需要dropout吧?
您好,您在定义fasttext的网络结构的时候,对于word,bigram和trigram为什么是在向量的维度上进行拼接(将300维的词向量扩充到900维),而不是直接在句子的维度上拼接(将长度为32的句子扩充到96)呢?刚刚接触这方面的东西,所以没有很理解,还望您能解答一下哈
def forward(self, x):
out_word = self.embedding(x[0])
out_bigram = self.embedding_ngram2(x[2])
out_trigram = self.embedding_ngram3(x[3])
out = torch.cat((out_word, out_bigram, out_trigram), -1)
out = out.mean(dim=1)
链接中的embedding 文件名和大小跟这里的都不一致啊。。
if pad_size:
if len(token) < pad_size:
token.extend([PAD] * (pad_size - len(token)))
else:
token = token[:pad_size]
seq_len = pad_size
如果文本长度超过pad_szie,这里token=token[:pad_size],那pan_size之后的文本是不是没用上,对这里不是很懂
Should it be >= 2
?
请问如果自己的文本不是分九类,Lable只有三四类 该怎么办
def biGramHash(sequence, t, buckets):
t1 = sequence[t - 1] if t - 1 >= 0 else 0
return (t1 * 14918087) % buckets
def triGramHash(sequence, t, buckets):
t1 = sequence[t - 1] if t - 1 >= 0 else 0
t2 = sequence[t - 2] if t - 2 >= 0 else 0
return (t2 * 14918087 * 18408749 + t1 * 14918087) % buckets
您好,有几个问题:
1、buckets=self.n_gram_vocab = 250499 ,这个参数数值250499是怎么来的呢。
2、4918087、14918087、 18408749的数值是怎么来的呢。
3、这种return的计算方式不太明白呢,可以给个学习参考链接嘛。
非常谢谢!
Loading data...
Vocab size: 4762
180000it [02:11, 1369.46it/s]
10000it [00:07, 1290.27it/s]
10000it [00:11, 862.36it/s]
Time usage: 0:02:31
<bound method Module.parameters of Model(
(embedding): Embedding(4762, 300)
(lstm): LSTM(300, 128, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
(tanh1): Tanh()
(tanh2): Tanh()
(fc1): Linear(in_features=256, out_features=64, bias=True)
(fc): Linear(in_features=64, out_features=10, bias=True)
)>
Traceback (most recent call last):
File "run.py", line 53, in
train(config, model, train_iter, dev_iter, test_iter)
File "/home/zgy/wll/Chinese-Text-Classification-Pytorch-master/train_eval.py", line 40, in train
writer = SummaryWriter(log_dir=config.log_path + '/' + time.strftime('%m-%d_%H.%M', time.localtime()))
AttributeError: 'Config' object has no attribute 'log_path'
您好,您的代码给了我很大帮助,关于transformer我想请教一下,一个[batch_size, seq_len, embed_size]的tensor经过transformer的encoder后还是[batch_size, seq_len, embed_size],然后可以在第二个维度上累加为[batch_size, embed_size],进行分类任务吗?
def forward(self, x):
x, _ = x
out = self.embedding(x) # [batch_size, seq_len, embeding]=[128, 32, 300]
out, _ = self.lstm(out)
out = self.fc(out[:, -1, :]) # 句子最后时刻的 hidden state
return out
这里的out[:, -1, :]
不应该是句子最后一个状态(单词)对应的输出么,这里注释的是hidden state
请问作者会考虑,新增RoBERTa、Albert的中文实现吗?
使用了新的数据集,字典5300,训练集750000,验证集8000,测试集83599. 结果不理想。
`
Epoch [1/20]
Iter: 0, Train Loss: 3.7, Train Acc: 0.00%, Val Loss: 3.1, Val Acc: 4.48%, Time: 0:01:14 *
Iter: 100, Train Loss: 2.7e-05, Train Acc: 100.00%, Val Loss: 2.3e+01, Val Acc: 4.46%, Time: 0:05:59
Iter: 200, Train Loss: 9.7e-08, Train Acc: 100.00%, Val Loss: 2.4e+01, Val Acc: 4.47%, Time: 0:10:55
Iter: 300, Train Loss: 4.6e-06, Train Acc: 100.00%, Val Loss: 3.6e+01, Val Acc: 0.96%, Time: 0:15:56
Iter: 400, Train Loss: 0.0001, Train Acc: 100.00%, Val Loss: 2.3e+01, Val Acc: 2.36%, Time: 0:20:57
Iter: 500, Train Loss: 0.0016, Train Acc: 100.00%, Val Loss: 9.8, Val Acc: 18.74%, Time: 0:26:16
Iter: 600, Train Loss: 0.0025, Train Acc: 100.00%, Val Loss: 1.1e+01, Val Acc: 18.76%, Time: 0:32:22
Iter: 700, Train Loss: 0.00082, Train Acc: 100.00%, Val Loss: 1.2e+01, Val Acc: 18.76%, Time: 0:38:37
Iter: 800, Train Loss: 0.00033, Train Acc: 100.00%, Val Loss: 1.4e+01, Val Acc: 18.75%, Time: 0:44:52
Iter: 900, Train Loss: 5e-06, Train Acc: 100.00%, Val Loss: 1.5e+01, Val Acc: 18.75%, Time: 0:51:07
Iter: 1000, Train Loss: 1e-06, Train Acc: 100.00%, Val Loss: 1.5e+01, Val Acc: 18.75%, Time: 0:57:40
`
你好~我想请教一下,我替换了自己的数据之后运行FastText时会出现以下报错:
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed. at c:\n\pytorch_1559129895673\work\aten\src\thnn\generic/ClassNLLCriterion.c:93
出现这个报错的原因应该是num.classes和label数量不一致,或者是label并非从0开始。我注意到了这个问题并确认数据label和class.txt无误,但还是持续报错,请问这样该怎么解决呢0 0
大神你好,请问一个问题:
在 utils.py 中,main()方法将训练集中出现的字词,
重新生成vocab.pkl 以及对应的embedded npz文件。这样训练集中没有出现的字词就是unk,不会出现在vocab.pkl 也没有对应的pretrained embed。
请问为什么一定要加这个步骤了?
不加这个步骤,利用从搜狗新闻训练的所有字词及其预训练embedded向量,也可以顺利模型训练和预测。
1.增加这个步骤,只考虑训练集中出现的字词,可以提高准确性,还是别的考虑,谢谢?
2.另外vocab.pkl 中额外增加了UNK 和PAD,但是对应的embedded npz文件似乎没有对应UNK 和PAD 的id 对应的embed vector?
谢谢您
@649453932
类似于下面这种的:
if name == 'main':
cnn_model = CnnModel()
test_demo = ['三星ST550以全新的拍摄方式超越了以往任何一款数码相机',
'热火vs骑士前瞻:皇帝回乡二番战 东部次席唾手可得新浪体育讯北京时间3月30日7:00']
for i in test_demo:
print(cnn_model.predict(i))
[zgy@localhost Chinese-Text-Classification-Pytorch-master]$ python run.py --model TextCNN
Traceback (most recent call last):
File "run.py", line 30, in
x = import_module('models.' + model_name)
File "/home/zgy/miniconda3/lib/python3.6/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 941, in _find_and_load_unlocked
File "", line 219, in _call_with_frames_removed
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'models'
def next(self):
if self.residue and self.index == self.n_batches:
batches = self.batches[self.index * self.batch_size: len(self.batches)]
self.index += 1
batches = self._to_tensor(batches)
return batches
elif self.index > self.n_batches:
self.index = 0
raise StopIteration
else:
batches = self.batches[self.index * self.batch_size: (self.index + 1) * self.batch_size]
self.index += 1
batches = self._to_tensor(batches)
return batches
这里应该是elif self.index >= self.n_batches吧????
如题。现在按照作者的流程可以再命令行中生成并保存下来模型,但是后面如果想要对新的数据进行预测后续该怎么操作呢?
Traceback (most recent call last):
File "/home/mli/.pyenv/versions/miniconda-latest/envs/li/lib/python3.7/site-packages/numpy/lib/npyio.py", line 460, in load
return pickle.load(fid, **pickle_kwargs)
_pickle.UnpicklingError: invalid load key, '\x0a'.
File "/home/mli/.pyenv/versions/miniconda-latest/envs/li/lib/python3.7/site-packages/numpy/lib/npyio.py", line 463, in load
"Failed to interpret file %s as a pickle" % repr(file))
OSError: Failed to interpret file 'THUCNews/data/embedding_SougouNews.npz' as a pickle
求大佬解答
找不到BiLstm模型,请问是删除了吗?
您好,有前向模块么,供使用的
用自己的数据集跑了下,发现正确率不太行,想具体看看错分的样例,各位大佬有什么思路吗?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.