pengming617 / text_matching Goto Github PK

View Code? Open in Web Editor NEW

466.0 9.0 119.0 10.25 MB

文本匹配的相关模型DSSM,ESIM,ABCNN,BIMPM等，数据集为LCQMC官方数据

Python 100.00%

text_matching's Introduction

本项目实现了语义匹配的几个模型，分别为： DSSM Learning Deep Structured Semantic Models for Web Search using Clickthrough Data

ESIM Enhanced LSTM for Natural Language Inference

Pair-CNN Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks

ABCNN ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs

BIMPM bilateral multi-perspective matching for natural language sentences

论文及papers在文件夹中

text_matching's People

Contributors

Stargazers

Watchers

Forkers

gswyhq shdut alongwilliam happyyolanda yyht allensmile charrnander guanlongtianzi stormbirds laisun wennicholas buptygz zhanglv0209 hit-joseph wibruce wengbenjue wang-ii yuzhiw sduchh baokui zkayx liantieyu chenny0808 useric cd-jarven colinsongf hydercps jiahenghuang lymanbin zhongyunuestc cccshuang jiniaoxu zozoz xxf1158795420 yl1113 sumeng123 tingnie liuweiping2020 shannonyu yc-wind cdj0311 zgd716 xgodlike shenyi666666 meccy tiffen chenruiqingorg cooler122 luoyexuge bzqweiyi duyuankai1992 morindaz debuluoyi superqing1989 xfzhu2003 nidhoggurz iamdsyang nx04 auscenery askintution yumiao1203 lx3528 zjms handwang caoyuji1986 zbn123 qianrenjian greitzmann franklwl db-li liuchenbaidu lubbyanneliu hulumei123 helloqi freya-zxy jingyuhe123 dixit91 jakisou fishredleaf jostree ai-surfing xinhen zwj3539205 gaode123 xhsun1997 dreadice markwjj hakanaku1234 iwaller appkle yanjiamei xtdx bestjex ht281358490 novellll duneryc swuxyj ianliyi1996 sskirito decade000

text_matching's Issues

数据集

作者您好，您的代码写的很清楚，infer.py也正好是我需要的。我想请问您的数据集txt文件是在哪儿获得的呢？我在另一个repository里找到了csv格式的，不太清楚如何转换，希望您能给一个提示，谢谢了。

作者您好，很感激您分享的代码，想请教您在infer.py的运行前，是如何调节输出结果中，判定为0/1的阈值的呢？输出结果当中的后一个array([[0.768, 0.231]])，有什么别的含义吗？

作者您好，很感激您分享的代码，想请教您在infer.py的运行前，是如何调节输出结果中，判定为0/1的阈值的呢？输出结果当中的后一个array([[0.768, 0.231]])，有什么别的含义吗？

Originally posted by @ljh9999 in #1 (comment)

Would you might share the scores among the models you implemented?

验证集精度一直在0.5几附近徘徊

楼主，我跑你的ESIM模型，还有你的数据，训练集就0.9几，可是验证集从一开始就0.5几，这是什么情况呢？

Found Inf or NaN global norm. : Tensor had NaN values

ub16c9@ub16c9-gpu:/media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching$ PYTHONPATH=. python3.6 abcnn/train.py
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.791 seconds.
Prefix dict has been built succesfully.
{'1': 138574, '0': 100192}
WARNING:tensorflow:From /media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching/data_prepare.py:54: VocabularyProcessor.init (from tensorflow.contrib.learn.python.learn.preprocessing.text) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/learn/python/learn/preprocessing/text.py:154: CategoricalVocabulary.init (from tensorflow.contrib.learn.python.learn.preprocessing.categorical_vocabulary) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/learn/python/learn/preprocessing/text.py:170: tokenizer (from tensorflow.contrib.learn.python.learn.preprocessing.text) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
{'1': 4402, '0': 4400}
WARNING:tensorflow:From /media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching/abcnn/abcnn_mdoel.py:99: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2019-06-21 12:59:01.301229: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-21 12:59:01.402848: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-21 12:59:01.403378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
pciBusID: 0000:01:00.0
totalMemory: 10.92GiB freeMemory: 8.62GiB
2019-06-21 12:59:01.403403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-06-21 12:59:01.801177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-21 12:59:01.801215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-06-21 12:59:01.801223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-06-21 12:59:01.801521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8324 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
training 1>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
0it [00:00, ?it/s]2019-06-21 12:59:04.872182: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f5822809400 = {1, 0} Found Inf or NaN global norm.

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[{{node model/VerifyFinite/CheckNumerics}} = CheckNumericsT=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[{{node model/clip_by_global_norm/mul_1/_359}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5200_model/clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "abcnn/train.py", line 122, in
train.trainModel()
File "abcnn/train.py", line 77, in trainModel
_, cost, accuracy = sess.run([model.train_op, model.loss, model.accuracy], feed_dict)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[node model/VerifyFinite/CheckNumerics (defined at /media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching/abcnn/abcnn_mdoel.py:226) = CheckNumericsT=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[{{node model/clip_by_global_norm/mul_1/_359}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5200_model/clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'model/VerifyFinite/CheckNumerics', defined at:
File "abcnn/train.py", line 122, in
train.trainModel()
File "abcnn/train.py", line 56, in trainModel
d0=con.embedding_size, di=50, num_classes=2, num_layers=2)
File "/media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching/abcnn/abcnn_mdoel.py", line 226, in init
grads, _ = tf.clip_by_global_norm(tf.gradients(self.loss, tvars), 5)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/clip_ops.py", line 265, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/numerics.py", line 47, in verify_tensor_all_finite
verify_input = array_ops.check_numerics(t, message=msg)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[node model/VerifyFinite/CheckNumerics (defined at /media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching/abcnn/abcnn_mdoel.py:226) = CheckNumericsT=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[{{node model/clip_by_global_norm/mul_1/_359}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5200_model/clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

ub16c9@ub16c9-gpu:/media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching$

请问这个数据集发布是否经过官方同意的？如果没有的话，建议删除

数据预处理时候label取法

hi,
我看到数据预处理时候，取label时候，首先初始化成[0 0],如果这条数据label是1，那么就是[0 1]，如果这条数据label是0，那么就是[1 0]，这么做是有什么trick在里面吗？

What's the version of tensorflow?

DSSM模型里面给的train.py不对呀，作者可以更新下吗

dssm下的train.py貌似是pair-cnn的train.py

ESIM model is different from the original paper

since the concatenation we described above to compute m_a and m_b can significantly increase the overall parameter size to potentially overfit the models.

Befrore feed the $m_a$ or $m_b$ into BiLSTM, they have reduce the dimension of these concated vectors which mentioned in the originial paper.

esim question

 # 计算a_bar与b_bar每个词语之间的相似度
            with tf.name_scope('word_similarity'):
                attention_weights = tf.matmul(a_bar, tf.transpose(b_bar, [0, 2, 1]))
                attentionsoft_a = tf.nn.softmax(attention_weights)
                attentionsoft_b = tf.nn.softmax(tf.transpose(attention_weights))
                attentionsoft_b = tf.transpose(attentionsoft_b)
                a_hat = tf.matmul(attentionsoft_a, b_bar)
                b_hat = tf.matmul(attentionsoft_b, a_bar)

我感觉这段代码里面 attentionsoft_b 的代码逻辑好像不太对我认为的逻辑是
attentionsoft_b = tf.nn.softmax(tf.transpose(attention_weights, [0, 2, 1]))

请问是不是漏掉DSSM的模型

请问是不是漏掉DSSM的模型，谢谢

跑出来精度不高

您好，想问下为什么我ESIM模型跑出来精度一直不高在0.5-0.6上下，看了评论说l2改成0能提高但是我发现改了后甚至没有变化了

请问论文里的MAP和MRR指标怎么计算？

语义匹配的论文很多都是使用这两个指标来进行比较，但是不知道这两个怎么计算

请问数据集出处是哪里呢？

What's the actual acc for ESIM model?

用项目中的数据集进行训练，实际测试结果达不到论文的预期。

For text match problem, what is the different between question-question match and question-answer match?

I know question-question match is a text similarity problem.
What about question-answer match or question-doc match? It is used in information retrieval.
question-question match is indeed text similarity. But how do you define question-answer similarity?
Thank you!!

貌似有笔误在文本匹配笔记的文档中

letter级trigram，您在举例子#query#,包含开始结束符应该分为#qu，que，uer，ery，ry#。您中间多了几个

请问这几个模型在LCQMC数据集上跑出来最好的结果分别是多少？急！！多谢！！

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble