A Keras Implementation of Attention_based Siamese Manhattan LSTM

Python 100.00%

keras siamese-lstm attention manhattan-distance

sentences_pair_similarity_calculation_siamese_lstm's Introduction

孪生LSTM网络(Siamese-LSTM)

本项目是基于孪生LSTM网络+注意力机制+曼哈顿距离(Manhattan distance)实现的句对相似度计算。
中文训练数据为蚂蚁金服句对数据，约4万组，正负样本比例1:3.6；英文训练数据来自Kaggle上的Quora句对数据，约40万组，正负样本比例1:1.7。新增一组翻译数据：使用Google Translator将Quora数据翻译成中文。

资料

参考文献
- Siamese Recurrent Architectures for Learning Sentence Similarity
- How to predict Quora Question Pairs using Siamese Manhattan LSTM
- **大陆可能无法访问《How to predict...Manhattan LSTM》一文，请直接查看本项目中附件之参考博客
其它数据
- 英文词向量：GoogleNews-vectors-negative300.bin.gz
- 英文词向量：GoogleNews-vectors-negative300.bin.gz的百度网盘地址
- 中文词向量：基于120G中文语料训练的64维、128维词向量
工程参考
- likejazz/Siamese-LSTM Original author's GitHub
- 做个聊天机器人/智能客服一些网络设计思路

使用

训练

$ python3 train.py
$ type cn for Chinese Data or en for English Data

验证

$ python3 predict.py
$ type cn for Chinese Data or en for English Data

预测

$ python3 score.py
$ type cn for Chinese Data or en for English Data

效果

$ 根据数据比例来看，中文训练集的基准准确率应为0.783，英文与翻译数据为0.630
$ =================================================================================================
$ 中文 数据实际训练 5 轮时的效果：使用随机词向量时，训练集十折交叉0.778；使用CN120G词向量时，训练集十折交叉0.789
$ 英文 数据实际训练 5 轮时的效果：使用随机词向量时，训练集十折交叉0.774；使用Google词向量时，训练集十折交叉0.771
$ 翻译 数据实际训练 5 轮时的效果：使用随机词向量时，训练集十折交叉0.755；使用CN120G词向量时，训练集十折交叉0.756
$ =================================================================================================
$ 中文 数据实际训练 8 轮时的效果：使用随机词向量时，训练集十折交叉0.777；使用CN120G词向量时，训练集十折交叉0.787
$ 英文 数据实际训练 8 轮时的效果：使用随机词向量时，训练集十折交叉0.774；使用Google词向量时，训练集十折交叉0.778
$ 翻译 数据实际训练 8 轮时的效果：使用随机词向量时，训练集十折交叉0.786；使用CN120G词向量时，训练集十折交叉0.786
$ =================================================================================================
$ 总结：1.有无预训练词向量几乎不影响结果;2.中文数据上训练几乎没有效果，和英文形成鲜明对比--这可能是因为蚂蚁金服数据间太相似了或者数据量太小，翻译数据集上的实验证明了这一点。

sentences_pair_similarity_calculation_siamese_lstm's People

Contributors

Stargazers

Watchers

Forkers

huiqiangkkx uirbeyondgroup airob huangpichao jingxiaorobin chl916185 jinsongpan nlpformyself battlegg richiesui superrichiesui digitalcompanion jgr98 zhang-shui-shui

sentences_pair_similarity_calculation_siamese_lstm's Issues

How can I get the cn vocab file :CnCorpus-vectors-negative64.bin

Got 0 accuracy by using Colab GPU to run the project

您好，我在Google Colab上运行了您的项目，但是结果显示的准确率只有0，请问您有遇到这个情况吗？谢谢。

求数据集

embedding_path = 'CnCorpus-vectors-negative64.bin' 这个文件。
麻烦博主百忙抽空上传一下数据集，太大的话可以百度云。
谢谢！

Some question

谢谢分享！这里的翻译和中文不都指的 Quora 数据对应的中文吗？
另外想问下，蚂蚁金服的你也尝试翻译了吗？

quora_test.csv file missing

您好，项目中缺少quora_test.csv这个文件，可以上传一下这个文件吗？感谢。

A Keras Implementation of Attention_based Siamese Manhattan LSTM

Hey Junru

我叫孙浩然。我在查怎么构造Siamese模型的时候在github伤看到了你的代码。你使用keras做的。我基本复制了你的代码，然后在本地做，但是略有不同。
我没用embedding层，而是直接：
activations = Bidirectional(LSTM(n_hidden, return_sequences=True), merge_mode='concat')(_input)
activations = Bidirectional(LSTM(n_hidden, return_sequences=True), merge_mode='concat')(activations)
原因稍后讲。然后我将输入的数据直接变成了【【0.21，0.2135..。0.2354】【0.33，0.26..0.25】【0.235,0.235...0.235】】用的glove100d。简单说就是我把embedding层的计算移到外面了。直接把word变成vector。
然后做出来validation是80%左右，跟你的original的代码差不多。然后我保存了model，然后，因为我有glove100d，所以我直接放了几个自己临时写的句子然后把他们转化成相同的input shape：
model = load_model('my_10length_model.h5', custom_objects={'ManDist': ManDist})
texttarget = ['How do I make friends?']
text1 = ['How to make friends?']
text2 = ['Can you tell me how to play the piano?']
text3 = ['Can you tell me the truth of the computer games?']
texttarget = prepare_data(texttarget,MAX_SEQ_LENGTH,embeddings_index)
candidate_1 = prepare_data(text1,MAX_SEQ_LENGTH,embeddings_index)
candidate_2 = prepare_data(text2,MAX_SEQ_LENGTH,embeddings_index)
candidate_3 = prepare_data(text3,MAX_SEQ_LENGTH,embeddings_index)

print(texttarget)

result1 = model.predict([texttarget,candidate_1])
result2 = model.predict([texttarget,candidate_2])
result3 = model.predict([texttarget,candidate_3])
print(result1,result2,result3)
然后result是：
[[0.5311898]] [[0.60761184]] [[0.42349315]]
其中的prepare_data函数是把任意句子通过glove转化成一个有MAX_SEQ_LENGTH个100维的vector的list。
从结果我们可以看出，这个模型认为第二句跟我的target句子是最相似的。进一步说明，这个模型实际上根本不具有普遍性，没办法真正计算任意句子的相似度。
我给你发issue的目的是想问问，你有没有更好的办法可以提高模型对非训练集的句子pair的计算相似度的准确率？

cheers
Haoran

lujunru / sentences_pair_similarity_calculation_siamese_lstm Goto Github PK

sentences_pair_similarity_calculation_siamese_lstm's Introduction

孪生LSTM网络(Siamese-LSTM)

资料

使用

训练

验证

预测

效果

sentences_pair_similarity_calculation_siamese_lstm's People

Contributors

Stargazers

Watchers

Forkers

sentences_pair_similarity_calculation_siamese_lstm's Issues

print(texttarget)

Recommend Projects

Recommend Topics

Recommend Org

Jobs