GithubHelp home page GithubHelp logo

ntmc-community / matchzoo-py Goto Github PK

View Code? Open in Web Editor NEW
496.0 21.0 106.0 612 KB

Facilitating the design, comparison and sharing of deep text matching models.

License: Apache License 2.0

Python 99.61% Makefile 0.39%
text matching deep-learning text-matching neural-network natural-language-processing pytorch

matchzoo-py's Introduction

logo

MatchZoo-py Tweet

PyTorch version of MatchZoo.

Facilitating the design, comparison and sharing of deep text matching models.
MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。

Python 3.6 Pypi Downloads Documentation Status Build Status codecov License Requirements Status Gitter

The goal of MatchZoo is to provide a high-quality codebase for deep text matching research, such as document retrieval, question answering, conversational response ranking, and paraphrase identification. With the unified data processing pipeline, simplified model configuration and automatic hyper-parameters tunning features equipped, MatchZoo is flexible and easy to use.

Tasks Text 1 Text 2 Objective
Paraphrase Indentification string 1 string 2 classification
Textual Entailment text hypothesis classification
Question Answer question answer classification/ranking
Conversation dialog response classification/ranking
Information Retrieval query document ranking

Get Started in 60 Seconds

To train a Deep Semantic Structured Model, make use of MatchZoo customized loss functions and evaluation metrics to define a task:

import torch
import matchzoo as mz

ranking_task = mz.tasks.Ranking(losses=mz.losses.RankCrossEntropyLoss(num_neg=4))
ranking_task.metrics = [
    mz.metrics.NormalizedDiscountedCumulativeGain(k=3),
    mz.metrics.MeanAveragePrecision()
]

Prepare input data:

train_pack = mz.datasets.wiki_qa.load_data('train', task=ranking_task)
valid_pack = mz.datasets.wiki_qa.load_data('dev', task=ranking_task)

Preprocess your input data in three lines of code, keep track parameters to be passed into the model:

preprocessor = mz.models.ArcI.get_default_preprocessor()
train_processed = preprocessor.fit_transform(train_pack)
valid_processed = preprocessor.transform(valid_pack)

Generate pair-wise training data on-the-fly:

trainset = mz.dataloader.Dataset(
    data_pack=train_processed,
    mode='pair',
    num_dup=1,
    num_neg=4,
    batch_size=32
)
validset = mz.dataloader.Dataset(
    data_pack=valid_processed,
    mode='point',
    batch_size=32
)

Define padding callback and generate data loader:

padding_callback = mz.models.ArcI.get_default_padding_callback()

trainloader = mz.dataloader.DataLoader(
    dataset=trainset,
    stage='train',
    callback=padding_callback
)
validloader = mz.dataloader.DataLoader(
    dataset=validset,
    stage='dev',
    callback=padding_callback
)

Initialize the model, fine-tune the hyper-parameters:

model = mz.models.ArcI()
model.params['task'] = ranking_task
model.params['embedding_output_dim'] = 100
model.params['embedding_input_dim'] = preprocessor.context['embedding_input_dim']
model.guess_and_fill_missing_params()
model.build()

Trainer is used to control the training flow:

optimizer = torch.optim.Adam(model.parameters())

trainer = mz.trainers.Trainer(
    model=model,
    optimizer=optimizer,
    trainloader=trainloader,
    validloader=validloader,
    epochs=10
)

trainer.run()

References

Tutorials

English Documentation

If you're interested in the cutting-edge research progress, please take a look at awaresome neural models for semantic match.

Install

MatchZoo-py is dependent on PyTorch. Two ways to install MatchZoo-py:

Install MatchZoo-py from Pypi:

pip install matchzoo-py

Install MatchZoo-py from the Github source:

git clone https://github.com/NTMC-Community/MatchZoo-py.git
cd MatchZoo-py
python setup.py install

Models

Citation

If you use MatchZoo in your research, please use the following BibTex entry.

@inproceedings{Guo:2019:MLP:3331184.3331403,
 author = {Guo, Jiafeng and Fan, Yixing and Ji, Xiang and Cheng, Xueqi},
 title = {MatchZoo: A Learning, Practicing, and Developing System for Neural Text Matching},
 booktitle = {Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval},
 series = {SIGIR'19},
 year = {2019},
 isbn = {978-1-4503-6172-9},
 location = {Paris, France},
 pages = {1297--1300},
 numpages = {4},
 url = {http://doi.acm.org/10.1145/3331184.3331403},
 doi = {10.1145/3331184.3331403},
 acmid = {3331403},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {matchzoo, neural network, text matching},
} 

Development Team

​ ​ ​ ​

faneshion
Yixing Fan

Core Dev
ASST PROF, ICT

Chriskuei
Jiangui Chen

Core Dev
PhD. ICT

caiyinqiong
Yinqiong Cai

Core Dev
M.S. ICT

pl8787
Liang Pang

Core Dev
ASST PROF, ICT

lixinsu
Lixin Su

Dev
PhD. ICT

ChrisRBXiong
Ruibin Xiong

Dev
M.S. ICT

dyuyang
Yuyang Ding

Dev
M.S. ICT

rgtjf
Junfeng Tian

Dev
M.S. ECNU

wqh17101
Qinghua Wang

Documentation
B.S. Shandong Univ.

Contribution

Please make sure to read the Contributing Guide before creating a pull request. If you have a MatchZoo-related paper/project/compnent/tool, send a pull request to this awesome list!

Thank you to all the people who already contributed to MatchZoo!

Bo Wang, Zeyi Wang, Liu Yang, Zizhen Wang, Zhou Yang, Jianpeng Hou, Lijuan Chen, Yukun Zheng, Niuguo Cheng, Dai Zhuyun, Aneesh Joshi, Zeno Gantner, Kai Huang, stanpcf, ChangQF, Mike Kellogg

Project Organizers

  • Jiafeng Guo
    • Institute of Computing Technology, Chinese Academy of Sciences
    • Homepage
  • Yanyan Lan
    • Institute of Computing Technology, Chinese Academy of Sciences
    • Homepage
  • Xueqi Cheng
    • Institute of Computing Technology, Chinese Academy of Sciences
    • Homepage

License

Apache-2.0

Copyright (c) 2019-present, Yixing Fan (faneshion)

matchzoo-py's People

Contributors

albert-ma avatar caiyinqiong avatar chriskuei avatar chrisrbxiong avatar dyuyang avatar faneshion avatar lixinsu avatar matthew-z avatar rgtjf avatar wqh17101 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

matchzoo-py's Issues

Dataset Builder creates duplicate query-document pairs & model predictions are odd

I have the following issue, which is really odd and affects the evaluation of the neural models.
I build my data using the auto preparer and I came to realize, that when I try to make predictions on the test set, some document-query pairs are duplicated.
I am not sure why this is happening, my first guess would be in order to fill up the missing examples until the batch size, but this does not seem to be the case.

Here's most of my code:

model, prpr, dsb, dlb = preparer.prepare(model_class,
                                             train_pack
                                             )

    train_prepr = prpr.transform(train_pack)
    valid_prepr = prpr.transform(valid_pack)
    test_prepr = prpr.transform(test_pack)

    mz.dataloader.dataset_builder.DatasetBuilder()
    train_dataset = dsb.build(train_prepr)
    valid_dataset = dsb.build(valid_prepr)
    test_dataset = dsb.build(test_prepr)

    train_dl = dlb.build(train_dataset, stage='train')
    valid_dl = dlb.build(valid_dataset, stage='dev')
    test_dl = dlb.build(test_dataset, stage='test')

# training the model etc....

    test_preds = pd.DataFrame(trainer.predict(test_dl), columns=['pred'])
    test_preds['id_left'] = test_dl.id_left
    test_preds['id_right'] = test_dl._dataset[:][0]['id_right']
    test_preds['length_right'] = test_dl._dataset[:][0]['length_right']

Now, it seems that the duplicates are created through the dataset builder, but I don't understand why.

    test_dataset._data_pack.frame().duplicated(['id_left', 'id_right']).sum() 
>> 297
    test_pack.frame().duplicated(['id_left', 'id_right']).sum() 
>>0
    test_prepr.frame().duplicated(['id_left', 'id_right']).sum()
>> 0

Even more odd, is the fact that those predictions have different scores for the same document-query pairs. And those are not even always close to each other - so this can't be some rounding error or so. This is very weird, how is it possible that without re-training the model, I can get so much different predictions for the same query-document pairs in inference time???


    print(test_preds[test_preds.duplicated(['id_right', 'id_left'],
                                           keep=False)].sort_values(['id_left', 'id_right'])
          )

>>
           pred  id_left                   id_right  length_right
466  -10.889746   33-1-1  47-07395           896
499   -9.492123   33-1-1  47-07395           896
677   -6.880966   33-1-1  47-07395           896
496  -10.781660   33-1-1  98-33779           535
678   -7.954109   33-1-1  98-33779           535
1044 -11.102488   33-1-1 98-33779           535
508   -6.497414   33-1-1  95-23333           244
1326  -7.466503   33-1-1  95-23333           244

In this replicated example the model used was KNRM, but I think this happens in other models too.

Question about wiki qa dataset

I make some analysis on wiki qa dataset:

  • training set:
    Left num: 2118; Right num: 18841;Relation num: 20360;positive example (with label 1) num: 1040(5.1%
  • dev set:
    Left num: 296;Right num: 2708;Relation num: 2733;positive example num: 140(5.12%
  • test set:
    Left num: 633;Right num: 5961;Relation num: 6165;positive example num: 293(4.75%

I wonder if this is the official way to combine question and answer, because the proportion of positive examples in three set is only 5%, which means if a model outputs 0 forever, it can achieve 95% accuracy? And the best performence of BERT on this dataset is just 95%. The proportion of positive and negative examples is too imbalance?

cuDNN error when run sample code in README

Hi anyone,

I run the sample code in README, but got following error:

: block: [80,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1573049310284/work/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [80,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
  0%|                                                                                                                                         | 0/32 [00:14<?, ?it/s]
Traceback (most recent call last):
  File "test_matchzoo.py", line 62, in <module>
    trainer.run()
  File "/home/wup/MatchZoo-py/matchzoo/trainers/trainer.py", line 227, in run
    self._run_epoch()
  File "/home/wup/MatchZoo-py/matchzoo/trainers/trainer.py", line 252, in _run_epoch
    outputs = self._model(inputs)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wup/MatchZoo-py/matchzoo/models/arci.py", line 187, in forward
    conv_left = self.conv_left(embed_left)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

My os is ubuntu 18.04
pytorch 1.3.1
cuda 10.0
cudnn 7.6.4

I can finely run another pytorch code at this server. Could you please help me?

bert model

I have get an error when i run your example of bert model, could you tell mw how to solve it ? Thanks!

padding_callback = mz.models.Bert.get_default_padding_callback()
trainloader = mz.dataloader.DataLoader(
dataset=trainset,
batch_size=20,
stage='train',
resample=True,
sort=False,
callback=padding_callback
)
testloader = mz.dataloader.DataLoader(
dataset=testset,
batch_size=20,
stage='dev',
callback=padding_callback
)

TypeError Traceback (most recent call last)
in
6 resample=True,
7 sort=False,
----> 8 callback=padding_callback
9 )
10 testloader = mz.dataloader.DataLoader(

TypeError: init() got an unexpected keyword argument 'batch_size'

Error during training because of float length of sequence(?)

Describe the bug

Hi, I have the following issue:
when I am trying to train my model using trainer.run(), I get the following error:

Traceback (most recent call last):
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-67-041e2033e90a>", line 1, in <module>
    trainer.run()
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/trainers/trainer.py", line 227, in run
    self._run_epoch()
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/trainers/trainer.py", line 251, in _run_epoch
    for step, (inputs, target) in pbar:
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/tqdm/std.py", line 1091, in __iter__
    for obj in iterable:
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/dataloader/dataloader.py", line 112, in __iter__
    self._handle_callbacks_on_batch_unpacked(x, y)
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/dataloader/dataloader.py", line 134, in _handle_callbacks_on_batch_unpacked
    self._callback.on_batch_unpacked(x, y)
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/dataloader/callbacks/padding.py", line 158, in on_batch_unpacked
    self._pad_word_value, dtype=dtype)
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/numpy/core/numeric.py", line 325, in full
    a = empty(shape, dtype, order)
TypeError: 'numpy.float64' object cannot be interpreted as an integer

I am not 100% sure, but it seems to me that the error is caused by the fact that in my preprocessed datapack, length_right is a float instead of an int (that seems to be the case in the toy datasets.).

>> toy_datapack.frame()[['length_right','length_left']]
Out[13]: 
    length_right  length_left
0             58           29
1             41           29
2             41           29
3             61           29
4            128           29
5            126           85
6            128           85

while

train_pack = mz.DataPack(relation=relation[relation.id_left.isin(qids['train'])].reset_index(drop=True),
                             left=left[left.index.isin(qids['train'])],
                             # right=right_train,
                             right=right_dict['train'],
                             )
    train_pack.frame().head().dtypes

Out[78]: 
id_left          object
text_left        object
id_right         object
text_right       object
length_right    float64
label           float64
dtype: object

It also seems weird to me that this is happening, since to my understanding, the built-in python len function should return an int.

right_train['length_right'] = right_train.text_right.apply(len)
Out[15]: 

                                                                  text_right  length_right
id_right                                                                                  
clueweb09-en0007-21-42346  Welcome | Logout Log In | Sign Up The Huffingt...          4039
clueweb09-enwp03-01-16807  Ann Dunham From Wikipedia, the free encycloped...         32225
clueweb09-en0010-93-11767  Home Contact Us Bookmark Us Receive Family Tre...          5112
clueweb09-enwp01-36-17161  Maya Soetoro-Ng From Wikipedia, the free encyc...          8279
clueweb09-enwp00-34-05344  Barack Obama, Sr. From Wikipedia, the free enc...         14448
clueweb09-enwp00-34-05347  Barack Obama, Sr. From Wikipedia, the free enc...         14478

I am preparing my data using mz.autoprepare and the models I've tried to use are KNRM and DRMM, but the same issue still occurs.

My matchzoo.version`. = 1.1.1

How to use gpu to run trainer?

In the MatchZoo-py/tutorials/ranking/drmm.ipynb, I change the parameter 'device' in trainloader = mz.dataloader.DataLoader and trainer = mz.trainers.Trainer as device = torch.device('cuda'). The nvidia-smi shows I have an available gpu. But it seems still ran on cpu. Can you help me with this? Thank you!

KeyError:'ngram_left'

Hi, I encountered a data processing problem. When I call the diin.py model, I use this model default padding callback.
`@classmethod
def get_default_padding_callback(
cls,
fixed_length_left: int = 10,
fixed_length_right: int = 30,
pad_word_value: typing.Union[int, str] = 0,
pad_word_mode: str = 'pre',
with_ngram: bool = True,
fixed_ngram_length: int = None,
pad_ngram_value: typing.Union[int, str] = 0,
pad_ngram_mode: str = 'pre'
) -> BaseCallback:
"""
Model default padding callback.

    The padding callback's on_batch_unpacked would pad a batch of data to
    a fixed length.

    :return: Default padding callback.
    """
    return callbacks.BasicPadding(
        fixed_length_left=fixed_length_left,
        fixed_length_right=fixed_length_right,
        pad_word_value=pad_word_value,
        pad_word_mode=pad_word_mode,
        with_ngram=with_ngram,
        fixed_ngram_length=fixed_ngram_length,
        pad_ngram_value=pad_ngram_value,
        pad_ngram_mode=pad_ngram_mode
    )`

Then it is processed in the padding.py file:
if self._with_ngram: ngram_length_left = max([len(w) for k in x['ngram_left'] for w in k]) ngram_length_right = max([len(w) for k in x['ngram_right'] for w in k])
But here I encountered an error:
KeyError:'ngram_left'
Do you know how this should be solved?

为什么demo跑出来的acc值都不变,有没有交流群

想问问有没有交流群可以交流心得的
想问下,为什么跑项目esim的demo例子,多个epoch的acc值一直不变。同样的我改成了中文的,情况也是一样,能跑通但是每个epoch的验证acc值都是不变

RuntimeError: The size of tensor a (68) must match the size of tensor b (67) at non-singleton dimension 0

Hi!

I encountered an error prompt (such as title) when running dssm.ipynb on my local machine. My environment is Windows10+Pytorch1.2+cuda9.0. Could you please tell me the solution.

Thanks.


RuntimeError Traceback (most recent call last)
in
----> 1 trainer.run()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\trainers\trainer.py in run(self)
230 for epoch in range(self._start_epoch, self._epochs + 1):
231 self._epoch = epoch
--> 232 self._run_epoch()
233 self._run_scheduler()
234 if self._early_stopping.should_stop_early:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\trainers\trainer.py in _run_epoch(self)
258 # Caculate all losses and sum them up
259 loss = torch.sum(
--> 260 *[c(outputs, target) for c in self._criterions]
261 )
262 self._backward(loss)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\trainers\trainer.py in (.0)
258 # Caculate all losses and sum them up
259 loss = torch.sum(
--> 260 *[c(outputs, target) for c in self._criterions]
261 )
262 self._backward(loss)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
545 result = self._slow_forward(*input, **kwargs)
546 else:
--> 547 result = self.forward(*input, **kwargs)
548 for hook in self._forward_hooks.values():
549 hook_result = hook(self, input, result)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\losses\rank_hinge_loss.py in forward(self, y_pred, y_true)
63 y_pos, y_neg, y_true,
64 margin=self.margin,
---> 65 reduction=self.reduction
66 )
67

~\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\nn\functional.py in margin_ranking_loss(input1, input2, target, margin, size_average, reduce, reduction)
2206 raise RuntimeError(("margin_ranking_loss does not support scalars, got sizes: "
2207 "input1: {}, input2: {}, target: {} ".format(input1.size(), input2.size(), target.size())))
-> 2208 return torch.margin_ranking_loss(input1, input2, target, margin, reduction_enum)
2209
2210

RuntimeError: The size of tensor a (68) must match the size of tensor b (67) at non-singleton dimension 0

TypeError when use sample code in README.md

Thanks for your project. I install this package and run the code in README.md, but I got the error. Following is the detail log:

Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|████████████████████████████████████████████████████████████████████████████████| 2118/2118 [00:00<00:00, 10964.68it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████████████████████████████████████████████████████████████████████████| 18841/18841 [00:03<00:00, 6015.32it/s]
Processing text_right with append:   0%|                                                                                                                                             | 0/18841 [00:00<?, ?it/s]Traceback (most recent call last):
  File "test_matchzoo.py", line 15, in <module>
    train_processed = preprocessor.fit_transform(train_pack)
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/engine/base_preprocessor.py", line 97, in fit_transform
    return self.fit(data_pack, verbose=verbose) \
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/preprocessors/basic_preprocessor.py", line 110, in fit
    verbose=verbose)
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/preprocessors/build_unit_from_data_pack.py", line 32, in build_unit_from_data_pack
    data_pack.apply_on_text(corpus.append, mode=mode, verbose=verbose)
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/data_pack/data_pack.py", line 246, in wrapper
    func(target, *args, **kwargs)
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/data_pack/data_pack.py", line 401, in apply_on_text
    self._apply_on_text_right(func, rename, verbose=verbose)
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/data_pack/data_pack.py", line 410, in _apply_on_text_right
    self._right[name] = self._right['text_right'].progress_apply(func)
  File "/public/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 730, in inner
    func = df._is_builtin_func(func)
  File "/public/anaconda3/lib/python3.7/site-packages/pandas/core/base.py", line 660, in _is_builtin_func
    return self._builtin_table.get(arg, arg)
TypeError: unhashable type: 'list'
Processing text_right with append:   0%|                                                                                                                                             | 0/18841 [00:00<?, ?it/s]

Matchzoo version is 1.1

mz.auto.Tuner , Validator not satifised.

Describe the Question

Please provide a clear and concise description of what the question is.
I am on matchzoo-py1.1.1 and when I try to play with the codemz.auto.Tuner , i get an error when trying to tune the model.

Describe your attempts

  • I walked through the tutorials
    please provide model_tuning tutorial for match-py ,when i use mz.auto.Tuner() to tune my model,i get an error as follow:

tuner = mz.auto.Tuner(
params=model.params,
train_data=train,
test_data=test,
num_runs=5
)

ValueError: Validator not satifised.
The validator's definition is as follows:
validator=lambda x: isinstance(x, np.ndarray)

  • I checked the documentation
  • I checked to make sure that this is not a duplicate question

You may also provide a Minimal, Complete, and Verifiable example you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).

In addition, figure out your MatchZoo version by running import matchzoo; matchzoo.__version__.

Add metric learning facilities

I have a task in which I need to match several texts with huge (~1M texts) database. Even though available models are pretty fast, this is not enough. The best way to tackle this problem is learning embeddings of texts and match them using Kd-tree etc. But as far as I understand, all models accept two texts and map them into class probs or rank value.

Is possible to somehow use matchzoo in the paradigm of metric learning already or will it be in the future?

DIIN model empty sequence error

I'm running DIIN model for document classificaition task, somehow I came into ValueError: max() arg is an empty sequence which indicate matchzoo/dataloader/callbacks/padding.py, line 114

ngram_length_left = max([len(w)
for k in x['ngram_left'] for w in k])
ngram_length_right = max([len(w)
for k in x['ngram_right'] for w in k])
It seems that x['ngram_left'] value is a null list, I wonder wheter it should be x['text_left'] and I have a try which turns out TypeError: object of type 'numpy.int64' has no len(). Then I take a peak at the value type of both but still have no clue.
Any help will be appreciated.
MZ version: 1.1

Inefficient DataLoader

  • I checked to make sure that this is not a duplicate issue
  • I'm submitting the request to the correct repository (for model requests, see here)

Is your feature request related to a problem? Please describe.

They way that dataloader reads elements from dataset and datapack is very inefficient in pytorch Matchzoo. The dataloader retrieves each element from the datapack separately, and group them into a batch.

However, the __getitem__ of datapack is quite expensive. To return a batch of 1000 elements, the __getitem__ method of datapack will be called 1000 times (the argument is an index at each time), and it usually takes 5-10 seconds to produce a single batch (when the batch size = 1000).

In contrast, in the TF version of Matchzoo, the __getitem__ method of datapack will be called with a list of indices together, so each batch will only call it once. As a result, the DataGenerator is much more efficient than the DataLoader.

Describe the solution you'd like

Avoid using the auto-batching provided by DataLoader. Instead, the dataset should pass batches to Dataloader: https://pytorch.org/docs/stable/data.html#disable-automatic-batching

Using a DataGenerator to sample and batch data, and use DataLoader only for accelerating data transfer to GPU .

Auto-prepare does not work with some models

Hello everyone, and thanks a lot for this amazing framework!

Auto-preparer does not seem to work with some specific models, in particular: DSSM and CDSSM. I am not sure whether this is an exhaustive list, but I tried quite a few more (ArcI, DRMM, KNRM, ConvKNRM, MatchSRNN & MVLSTM) and they seemed to be working as expected.

To Reproduce

import torch
import matchzoo as mz

task = mz.tasks.Ranking(losses=mz.losses.RankCrossEntropyLoss(num_neg=4))
task.metrics = [
    mz.metrics.NormalizedDiscountedCumulativeGain(k=20),
    mz.metrics.MeanAveragePrecision()
]

# Prepare input data
train_pack = mz.datasets.wiki_qa.load_data('train', task=task)
valid_pack = mz.datasets.wiki_qa.load_data('dev', task=task)

# Auto prepare model etc.

preparer = mz.auto.Preparer(task)

model_class = mz.models.DSSM
# or
model_class = mz.models.CDSSM

model, prpr, dsb, dlb = preparer.prepare(model_class,
                                         train_pack
                                         )



train_prepr = prpr.transform(train_pack)
valid_prepr = prpr.transform(valid_pack)

train_dataset = dsb.build(train_prepr)
valid_dataset = dsb.build(valid_prepr)

train_dl = dlb.build(train_dataset)
valid_dl = dlb.build(valid_dataset)


# make it (T)rain
optimizer = torch.optim.Adam(model.parameters())

trainer = mz.trainers.Trainer(
    model=model,
    optimizer=optimizer,
    trainloader=train_dl,
    validloader=valid_dl,
    epochs=2
)

trainer.run()

For CDSSM I get RuntimeError: Given groups=1, weight of size 3 419 3, expected input[40, 9654, 11] to have 419 channels, but got 9654 channels instead.

My matchzoo.version == 1.1.1

AttributeError: type object 'Path' has no attribute 'expanduser'

Hi,
My python versions is 2.7 and 3.6.
When I import torch and then import matchzoo as mz in python 2.7 I receive this error:
"AttributeError: type object 'Path' has no attribute 'expanduser'"
and in python3 I receive this error:
"ImportError: cannot import name 'ensure_object'"
while matchzoo-py installed completely successfully.
I would appreciate it if you guide me.

Many thanks before all

DSSM tutorial code: RuntimeError Expected object of scalar type Float but got scalar type Double

Describe the bug

When I run the tutorial code for DSSM, the following error appears:
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

please find the details below:


RuntimeError Traceback (most recent call last)
in
----> 1 trainer.run()

~/opt/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1.1-py3.7.egg/matchzoo/trainers/trainer.py in run(self)
225 for epoch in range(self._start_epoch, self._epochs + 1):
226 self._epoch = epoch
--> 227 self._run_epoch()
228 self._run_scheduler()
229 if self._early_stopping.should_stop_early:

~/opt/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1.1-py3.7.egg/matchzoo/trainers/trainer.py in _run_epoch(self)
250 disable=not self._verbose) as pbar:
251 for step, (inputs, target) in pbar:
--> 252 outputs = self._model(inputs)
253 # Caculate all losses and sum them up
254 loss = torch.sum(

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),

~/opt/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1.1-py3.7.egg/matchzoo/models/dssm.py in forward(self, inputs)
89
90 # print (inputs['ngram_left'])
---> 91 # print (inputs['ngram_right'])
92 # Process left & right input.
93 input_left, input_right = inputs['ngram_left'], inputs['ngram_right'].float()

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
115 def forward(self, input):
116 for module in self:
--> 117 input = module(input)
118 return input
119

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
115 def forward(self, input):
116 for module in self:
--> 117 input = module(input)
118 return input
119

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/linear.py in forward(self, input)
89
90 def forward(self, input: Tensor) -> Tensor:
---> 91 return F.linear(input, self.weight, self.bias)
92
93 def extra_repr(self) -> str:

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py in linear(input, weight, bias)
1672 if input.dim() == 2 and bias is not None:
1673 # fused op is marginally faster
-> 1674 ret = torch.addmm(bias, input, weight.t())
1675 else:
1676 output = input.matmul(weight.t())

RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

Describe your attempts

I am trying to address this problem according to the following post:
https://stackoverflow.com/questions/56741087/how-to-fix-runtimeerror-expected-object-of-scalar-type-float-but-got-scalar-typ/56741419

But it seems not working by adding float().

Context

  • OS: MAC OS 10.15.6
  • Hardware: CPU only
  • MatchZoo-py version: 1.1.1

Use of trainer/runer and number of training epochs

Hi, I have a question regarding choosing the epochs and doing hyperparameter tuning in general.

I am currently using matchzoo.trainers.trainer to train my models with the default number of epochs(=10).

Does this always end training in epoch=10, or it keeps some sort of checkpoints and then restores the checkpoint/model in the epoch were the validation result is best? This is not very clear to me from the documentation, and there's a lot of confusion given that there are different tutorials/documentations in matchzoo and matchzoo-py.

Apart from that, my question is:

  • If training stops always on the 10th epoch, how can I make it stop and restore the model that achieves the best results based on a metric from the validation score? Ideally, I would like to do this with checkpoints, rather than using matchzoo.auto.tuner.tuner and re-training the model over and over, or some sort of other hacky solution. I guess there should be already something in place to do that.

  • If the trainer indeed restores the checkpoint with the highest score, after the 10 epochs are finished running: Which metric is used to determine the highest score? Is it just the first metric in the list of task.metrics?

Thank you for your help!

A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection

  • I checked to make sure that this is not a duplicate issue
  • I'm submitting the request to the correct repository (for model requests, see here)

Is your feature request related to a problem? Please describe.

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like

A clear and concise description of what you want to happen.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional Information

Other things you want the developers to know.

Load_glove_embedding with toy dataset

Hi,

When I use toy dataset I see the following error in this line:
glove_embedding = mz.datasets.embeddings.load_glove_embedding(dimension=300)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to <undefined>

I would appreciate it if you guide me how to solve this problem.
I wish you answer my questions a little bit faster since the time passes quickly :)

Thanks

bert processor

When i use bert processor to tranform my dataset, there will appear a warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (694 > 512). Running this sequence through the model will result in indexing errors.

But my dataset don't have so long sequence!
And it will lead to an error when training. Do you know how to solve it? Thanks!
Matchzoo version 1.1.1

KeyError:'ngram_left'

Hi, I encountered a data processing problem. When I call the diin.py model, I use this model default padding callback.
`@classmethod
def get_default_padding_callback(
cls,
fixed_length_left: int = 10,
fixed_length_right: int = 30,
pad_word_value: typing.Union[int, str] = 0,
pad_word_mode: str = 'pre',
with_ngram: bool = True,
fixed_ngram_length: int = None,
pad_ngram_value: typing.Union[int, str] = 0,
pad_ngram_mode: str = 'pre'
) -> BaseCallback:

return callbacks.BasicPadding(
    fixed_length_left=fixed_length_left,
    fixed_length_right=fixed_length_right,
    pad_word_value=pad_word_value,
    pad_word_mode=pad_word_mode,
    with_ngram=with_ngram,
    fixed_ngram_length=fixed_ngram_length,
    pad_ngram_value=pad_ngram_value,
    pad_ngram_mode=pad_ngram_mode
)`

Then it is processed in the padding.py file:
if self._with_ngram: ngram_length_left = max([len(w) for k in x['ngram_left'] for w in k]) ngram_length_right = max([len(w) for k in x['ngram_right'] for w in k])
But here I encountered an error:
KeyError:'ngram_left'
Do you know how this should be solved?

For drmm model, dense_output should be flipped at dim=-1?

Hi,

In drmm model,

Should it be
x = torch.einsum('bl,bl->b', torch.flip(dense_output,(-1,)), attention_probs)

Instead of
x = torch.einsum('bl,bl->b', dense_output, attention_probs)

After I revise this, I got the training loss reduction much faster than this as in
https://github.com/NTMC-Community/MatchZoo-py/blob/master/tutorials/ranking/drmm.ipynb

Below are my training logs:

Epoch 1/10: 100%|██████████| 319/319 [02:36<00:00, 6.02it/s, loss=2.167][Iter-319 Loss-2.134]:
Validation: normalized_discounted_cumulative_gain@3(0.0): 0.5825 - normalized_discounted_cumulative_gain@5(0.0): 0.6421 - mean_average_precision(0.0): 0.6019
Epoch 1/10: 100%|██████████| 319/319 [02:44<00:00, 1.93it/s, loss=2.167]
Epoch 2/10: 100%|█████████▉| 318/319 [01:11<00:00, 6.27it/s, loss=1.729][Iter-638 Loss-1.776]:
Epoch 2/10: 100%|██████████| 319/319 [01:19<00:00, 4.02it/s, loss=0.877]
0%| | 0/319 [00:00<?, ?it/s] Validation: normalized_discounted_cumulative_gain@3(0.0): 0.5726 - normalized_discounted_cumulative_gain@5(0.0): 0.6363 - mean_average_precision(0.0): 0.589

How to load large embedding efficiently?

Describe the Question

I tried to load 840B+300d GloVe using mz.embedding.load_from_file. However, it utilizes more than 60+ GB memory, which looks abnormal.

from pathlib import Path
import matchzoo as mz


_glove_6B_embedding_url = "http://nlp.stanford.edu/data/glove.6B.zip"
_glove_840B_embedding_url = "http://nlp.stanford.edu/data/glove.840B.300d.zip"


def load_glove_embedding(dimension: int = 50, size="6B") -> mz.embedding.Embedding:
    """
    Return the pretrained glove embedding.

    :param dimension: the size of embedding dimension, the value can only be
        50, 100, or 300.
    :return: The :class:`mz.embedding.Embedding` object.
    """
    file_name = 'glove.{}.{}d.txt'.format(size, dimension)
    file_path = (Path(mz.USER_DATA_DIR) / 'glove').joinpath(file_name)

    if not file_path.exists():
        if size=="6B":
            url = _glove_6B_embedding_url
        elif size == "840B":
            url = _glove_840B_embedding_url
        else:
            raise ValueError("Incorrect Size for GloVe: %d" % size)

        mz.utils.get_file('glove_embedding',
                                        url,
                                        extract=True,
                                        cache_dir=mz.USER_DATA_DIR,
                                        cache_subdir='glove')

    return mz.embedding.load_from_file(file_path=str(file_path), mode='glove')

embedding = load_glove_embedding(300, "840B")

Describe your attempts

The TF version matchzoo uses pandas to read the GloVe file, and requires much less memory.

Add on-the-fly sampling

As far as I understand, right now text pairs are picked directly using Datapack samples. However it is almost impossible to mine "hard" negative samples before training. Just adding into class 0/ rank 0.0 random pairs which are not corresponding is very wasteful and inefficient.

How to get the actual rank using trainer.predict()?

Describe the Question

I am trying to get the rank out of a trained model (using trainer). However, when I do trainer.predict() I get back a numpy array of shape num_qids x 1. The number of query ids .predict returns is depending on the dataloader dl passed on trainer.predict(dl).

In other words, as I understand I get a score (probably the first metric I've defined on metrics?) for each query id. However, what I need is a ranked list of documents for each query id, rather than a single score.

How can I get that? I could find no solution through the tutorials.

My code looks like:


    trainer.run()

    # Evaluation
    print('Validation results:')
    print(trainer.evaluate(valid_dl))
    print('Test results:')
    print(trainer.evaluate(test_dl))


    val_preds = trainer.predict(valid_dl)
    test_preds = trainer.predict(train_dl)

val_preds.shape
>> Out[18]: (150, 1)
valid_dl.label.shape
>> Out[19]: (150,)

Evaluation mode

Do you think that the evaluation function would be needed to set model.eval()?

Load Data

Hi,

I have Robust04 data for testing drmm. I do not know how to load this data.
I would appreciate it if you guide me.

Many thanks before all

bugs of preprocessor in version 1.1.1

running the example in readme file got the following error:

raceback (most recent call last):: 0%| | 0/18841 [00:00<?, ?it/s]
File "", line 1, in
File "/root/download/MatchZoo-py/matchzoo/engine/base_preprocessor.py", line 97, in fit_transform
return self.fit(data_pack, verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/preprocessors/basic_preprocessor.py", line 110, in fit
verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/preprocessors/build_unit_from_data_pack.py", line 32, in build_unit_from_data_pack
data_pack.apply_on_text(corpus.append, mode=mode, verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/data_pack/data_pack.py", line 246, in wrapper
func(target, *args, **kwargs)
File "/root/download/MatchZoo-py/matchzoo/data_pack/data_pack.py", line 401, in apply_on_text
self._apply_on_text_right(func, rename, verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/data_pack/data_pack.py", line 410, in _apply_on_text_right
self._right[name] = self._right['text_right'].progress_apply(func)
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 733, in inner
func = df._is_builtin_func(func)
File "/opt/conda/lib/python3.7/site-packages/pandas-0.24.2-py3.7-linux-x86_64.egg/pandas/core/base.py", line 660, in _is_builtin_func
return self._builtin_table.get(arg, arg)
TypeError: unhashable type: 'list'

TypeError: unhashable type: 'list'

Describe the bug

Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 96%|█████████▌| 18113/18841 [00:10<00:00, 2020.19it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 97%|█████████▋| 18317/18841 [00:10<00:00, 2011.95it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 98%|█████████▊| 18520/18841 [00:10<00:00, 1890.53it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 18841/18841 [00:11<00:00, 1703.43it/s]

Processing text_right with append: 0%| | 0/18841 [00:00<?, ?it/s]

TypeError Traceback (most recent call last)
in ()
1 preprocessor = mz.models.ArcI.get_default_preprocessor()
----> 2 train_processed = preprocessor.fit_transform(train_pack)
3 valid_processed = preprocessor.transform(valid_pack)

/home/anaconda3/lib/python3.6/site-packages/matchzoo/engine/base_preprocessor.py in fit_transform(self, data_pack, verbose)
95 :param verbose: Verbosity.
96 """
---> 97 return self.fit(data_pack, verbose=verbose)
98 .transform(data_pack, verbose=verbose)
99

/home/anaconda3/lib/python3.6/site-packages/matchzoo/preprocessors/basic_preprocessor.py in fit(self, data_pack, verbose)
108 flatten=False,
109 mode='right',
--> 110 verbose=verbose)
111 data_pack = data_pack.apply_on_text(fitted_filter_unit.transform,
112 mode='right', verbose=verbose)

/home/anaconda3/lib/python3.6/site-packages/matchzoo/preprocessors/build_unit_from_data_pack.py in build_unit_from_data_pack(unit, data_pack, mode, flatten, verbose)
30 data_pack.apply_on_text(corpus.extend, mode=mode, verbose=verbose)
31 else:
---> 32 data_pack.apply_on_text(corpus.append, mode=mode, verbose=verbose)
33 if verbose:
34 description = 'Building ' + unit.class.name + \

/home/anaconda3/lib/python3.6/site-packages/matchzoo/data_pack/data_pack.py in wrapper(self, inplace, *args, **kwargs)
244 target = self.copy()
245
--> 246 func(target, *args, **kwargs)
247
248 if not inplace:

/home/anaconda3/lib/python3.6/site-packages/matchzoo/data_pack/data_pack.py in apply_on_text(self, func, mode, rename, verbose)
399 self._apply_on_text_left(func, rename, verbose=verbose)
400 elif mode == 'right':
--> 401 self._apply_on_text_right(func, rename, verbose=verbose)
402 else:
403 raise ValueError(f"{mode} is not a valid mode type."

/home/anaconda3/lib/python3.6/site-packages/matchzoo/data_pack/data_pack.py in _apply_on_text_right(self, func, rename, verbose)
408 if verbose:
409 tqdm.pandas(desc="Processing " + name + " with " + func.name)
--> 410 self._right[name] = self._right['text_right'].progress_apply(func)
411 else:
412 self._right[name] = self._right['text_right'].apply(func)

/home/anaconda3/lib/python3.6/site-packages/tqdm/std.py in inner(df, func, *args, **kwargs)

/home/anaconda3/lib/python3.6/site-packages/pandas/core/base.py in _is_builtin_func(self, arg)
658 otherwise return the arg
659 """
--> 660 return self._builtin_table.get(arg, arg)
661
662

TypeError: unhashable type: 'list'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.