GithubHelp home page GithubHelp logo

lujunru / sentences_pair_similarity_calculation_siamese_lstm Goto Github PK

View Code? Open in Web Editor NEW
53.0 2.0 15.0 40.97 MB

A Keras Implementation of Attention_based Siamese Manhattan LSTM

Python 100.00%
keras siamese-lstm attention manhattan-distance

sentences_pair_similarity_calculation_siamese_lstm's Introduction

孪生LSTM网络(Siamese-LSTM)

本项目是基于孪生LSTM网络+注意力机制+曼哈顿距离(Manhattan distance)实现的句对相似度计算。
中文训练数据为蚂蚁金服句对数据,约4万组,正负样本比例1:3.6;英文训练数据来自Kaggle上的Quora句对数据,约40万组,正负样本比例1:1.7。新增一组翻译数据:使用Google Translator将Quora数据翻译成中文。

资料

使用

训练

$ python3 train.py
$ type cn for Chinese Data or en for English Data

验证

$ python3 predict.py
$ type cn for Chinese Data or en for English Data

预测

$ python3 score.py
$ type cn for Chinese Data or en for English Data

效果

$ 根据数据比例来看,中文训练集的基准准确率应为0.783,英文与翻译数据为0.630
$ =================================================================================================
$ 中文 数据实际训练 5 轮时的效果:使用随机词向量时,训练集十折交叉0.778;使用CN120G词向量时,训练集十折交叉0.789
$ 英文 数据实际训练 5 轮时的效果:使用随机词向量时,训练集十折交叉0.774;使用Google词向量时,训练集十折交叉0.771
$ 翻译 数据实际训练 5 轮时的效果:使用随机词向量时,训练集十折交叉0.755;使用CN120G词向量时,训练集十折交叉0.756
$ =================================================================================================
$ 中文 数据实际训练 8 轮时的效果:使用随机词向量时,训练集十折交叉0.777;使用CN120G词向量时,训练集十折交叉0.787
$ 英文 数据实际训练 8 轮时的效果:使用随机词向量时,训练集十折交叉0.774;使用Google词向量时,训练集十折交叉0.778
$ 翻译 数据实际训练 8 轮时的效果:使用随机词向量时,训练集十折交叉0.786;使用CN120G词向量时,训练集十折交叉0.786
$ =================================================================================================
$ 总结:1.有无预训练词向量几乎不影响结果;2.中文数据上训练几乎没有效果,和英文形成鲜明对比--这可能是因为蚂蚁金服数据间太相似了或者数据量太小,翻译数据集上的实验证明了这一点。

sentences_pair_similarity_calculation_siamese_lstm's People

Contributors

lujunru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

sentences_pair_similarity_calculation_siamese_lstm's Issues

求数据集

embedding_path = 'CnCorpus-vectors-negative64.bin' 这个文件。
麻烦博主百忙抽空上传一下数据集,太大的话可以百度云。
谢谢!

Some question

谢谢分享!这里的翻译和中文不都指的 Quora 数据对应的中文吗?
另外想问下,蚂蚁金服的你也尝试翻译了吗?

A Keras Implementation of Attention_based Siamese Manhattan LSTM

Hey Junru

我叫孙浩然。我在查怎么构造Siamese模型的时候在github伤看到了你的代码。你使用keras做的。我基本复制了你的代码,然后在本地做,但是略有不同。
我没用embedding层,而是直接:
activations = Bidirectional(LSTM(n_hidden, return_sequences=True), merge_mode='concat')(_input)
activations = Bidirectional(LSTM(n_hidden, return_sequences=True), merge_mode='concat')(activations)
原因稍后讲。然后我将输入的数据直接变成了【【0.21,0.2135..。0.2354】【0.33,0.26..0.25】【0.235,0.235...0.235】】用的glove100d。 简单说就是我把embedding层的计算移到外面了。直接把word变成vector。
然后做出来validation是80%左右,跟你的original的代码差不多。然后我保存了model,然后,因为我有glove100d,所以我直接放了几个自己临时写的句子然后把他们转化成相同的input shape:
model = load_model('my_10length_model.h5', custom_objects={'ManDist': ManDist})
texttarget = ['How do I make friends?']
text1 = ['How to make friends?']
text2 = ['Can you tell me how to play the piano?']
text3 = ['Can you tell me the truth of the computer games?']
texttarget = prepare_data(texttarget,MAX_SEQ_LENGTH,embeddings_index)
candidate_1 = prepare_data(text1,MAX_SEQ_LENGTH,embeddings_index)
candidate_2 = prepare_data(text2,MAX_SEQ_LENGTH,embeddings_index)
candidate_3 = prepare_data(text3,MAX_SEQ_LENGTH,embeddings_index)

print(texttarget)

result1 = model.predict([texttarget,candidate_1])
result2 = model.predict([texttarget,candidate_2])
result3 = model.predict([texttarget,candidate_3])
print(result1,result2,result3)
然后result是:
[[0.5311898]] [[0.60761184]] [[0.42349315]]
其中的prepare_data函数是把任意句子通过glove转化成一个有MAX_SEQ_LENGTH个100维的vector的list。
从结果我们可以看出,这个模型认为第二句跟我的target句子是最相似的。进一步说明,这个模型实际上根本不具有普遍性,没办法真正计算任意句子的相似度。
我给你发issue的目的是想问问,你有没有更好的办法可以提高模型对非训练集的句子pair的计算相似度的准确率?

cheers
Haoran

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.