ntmc-community / matchzoo-py Goto Github PK

View Code? Open in Web Editor NEW

496.0 496.0 106.0 612 KB

Facilitating the design, comparison and sharing of deep text matching models.

License: Apache License 2.0

Python 99.61% Makefile 0.39%

deep-learning matching natural-language-processing neural-network pytorch text text-matching

matchzoo-py's People

Contributors

Stargazers

Watchers

Forkers

strategist922 adewin aizest joequan chrisrbxiong jellying cesar456 leechikara ngduyanhece hungph-dev-ict qianrenjian tonydeep liubinlong08 matthew-z surefirelin zhangmingkun1 wudapeng268 dyuyang xiaohehuanshu alanaw1 fishredleaf renshuhuai-andy tk1363704 sjliu0920 nthanhtam yaosheng42 carmanzhang jbowles sysuzyq loveningbo arita37 qfxlcyc xueshang-liulp lliai lxgend dreaming-world cemeiq horsedongmin mathlf2015 leonlin3 qazcy1983 jlc175 yqpub ancrilin calmandfaraway qiangairesearcher shimengfeng 1jasonzhang ruizewang wuchen95 rotcx naive-cuimei zhengdanyang1 sheepsody dalek-who chenlu19 guojiantao zhengwr ericdoug-qi nurikol yingrui-yang blue-birman jessie0624 posion1982 ishugaepov anshiquanshu66 jepsonwong xhjcxxl pipilove chiang97912 qiulikun stalkermustang kuroko730 lewpeng97 wenjunyang chwlsunny dilegentmancha feiying12343 zzg-971030 fushier lizhiming8288 ctufts nicemartin panminiii zfjsail qiuchili vivian-wh yuansaijie0604 muttermal selfsoda lichuanxiang onkarsabnis l294265421 yhy-2000 huang-a123 xiaoxuegao499 jmvargas shawn-hub-hit rabacca8855 avinashmudireddy31

matchzoo-py's Issues

Load_glove_embedding with toy dataset

Hi,

When I use toy dataset I see the following error in this line:
glove_embedding = mz.datasets.embeddings.load_glove_embedding(dimension=300)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to <undefined>

I would appreciate it if you guide me how to solve this problem.
I wish you answer my questions a little bit faster since the time passes quickly :)

Thanks

A Compare-Aggregate Model with Dynamic-Clip Attention for Answer Selection

I checked to make sure that this is not a duplicate issue
I'm submitting the request to the correct repository (for model requests, see here)

Is your feature request related to a problem? Please describe.

A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like

A clear and concise description of what you want to happen.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional Information

Other things you want the developers to know.

Evaluation mode

Do you think that the evaluation function would be needed to set model.eval()?

TypeError: unhashable type: 'list'

Describe the bug

Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 96%|█████████▌| 18113/18841 [00:10<00:00, 2020.19it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 97%|█████████▋| 18317/18841 [00:10<00:00, 2011.95it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 98%|█████████▊| 18520/18841 [00:10<00:00, 1890.53it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 18841/18841 [00:11<00:00, 1703.43it/s]

Processing text_right with append: 0%| | 0/18841 [00:00<?, ?it/s]

TypeError Traceback (most recent call last)
in ()
1 preprocessor = mz.models.ArcI.get_default_preprocessor()
----> 2 train_processed = preprocessor.fit_transform(train_pack)
3 valid_processed = preprocessor.transform(valid_pack)

/home/anaconda3/lib/python3.6/site-packages/matchzoo/engine/base_preprocessor.py in fit_transform(self, data_pack, verbose)
95 :param verbose: Verbosity.
96 """
---> 97 return self.fit(data_pack, verbose=verbose)
98 .transform(data_pack, verbose=verbose)
99

/home/anaconda3/lib/python3.6/site-packages/matchzoo/preprocessors/basic_preprocessor.py in fit(self, data_pack, verbose)
108 flatten=False,
109 mode='right',
--> 110 verbose=verbose)
111 data_pack = data_pack.apply_on_text(fitted_filter_unit.transform,
112 mode='right', verbose=verbose)

/home/anaconda3/lib/python3.6/site-packages/matchzoo/preprocessors/build_unit_from_data_pack.py in build_unit_from_data_pack(unit, data_pack, mode, flatten, verbose)
30 data_pack.apply_on_text(corpus.extend, mode=mode, verbose=verbose)
31 else:
---> 32 data_pack.apply_on_text(corpus.append, mode=mode, verbose=verbose)
33 if verbose:
34 description = 'Building ' + unit.class.name + \

/home/anaconda3/lib/python3.6/site-packages/matchzoo/data_pack/data_pack.py in wrapper(self, inplace, *args, **kwargs)
244 target = self.copy()
245
--> 246 func(target, *args, **kwargs)
247
248 if not inplace:

/home/anaconda3/lib/python3.6/site-packages/matchzoo/data_pack/data_pack.py in apply_on_text(self, func, mode, rename, verbose)
399 self._apply_on_text_left(func, rename, verbose=verbose)
400 elif mode == 'right':
--> 401 self._apply_on_text_right(func, rename, verbose=verbose)
402 else:
403 raise ValueError(f"{mode} is not a valid mode type."

/home/anaconda3/lib/python3.6/site-packages/matchzoo/data_pack/data_pack.py in _apply_on_text_right(self, func, rename, verbose)
408 if verbose:
409 tqdm.pandas(desc="Processing " + name + " with " + func.name)
--> 410 self._right[name] = self._right['text_right'].progress_apply(func)
411 else:
412 self._right[name] = self._right['text_right'].apply(func)

/home/anaconda3/lib/python3.6/site-packages/tqdm/std.py in inner(df, func, *args, **kwargs)

/home/anaconda3/lib/python3.6/site-packages/pandas/core/base.py in _is_builtin_func(self, arg)
658 otherwise return the arg
659 """
--> 660 return self._builtin_table.get(arg, arg)
661
662

TypeError: unhashable type: 'list'

Implement DIIN model

For drmm model, can I use other value as label?

I am wondering if I can use label = 0.5 or other float value as label to train and test drmm model? How about duet and matchpyramid?

Implement MatchSRNN model

Implement HCRN model

Question about wiki qa dataset

I make some analysis on wiki qa dataset：

training set：
Left num: 2118； Right num: 18841；Relation num: 20360；positive example (with label 1) num: 1040（5.1%）
dev set：
Left num: 296；Right num: 2708；Relation num: 2733；positive example num: 140（5.12%）
test set：
Left num: 633；Right num: 5961；Relation num: 6165；positive example num: 293（4.75%）

I wonder if this is the official way to combine question and answer, because the proportion of positive examples in three set is only 5%, which means if a model outputs 0 forever, it can achieve 95% accuracy? And the best performence of BERT on this dataset is just 95%. The proportion of positive and negative examples is too imbalance?

DRMM

add the year and source of papers in response retrival module

Add the year and source of papers in response retrival module, to make the information of paper more complete.

bert model

I have get an error when i run your example of bert model, could you tell mw how to solve it ? Thanks!

padding_callback = mz.models.Bert.get_default_padding_callback()
trainloader = mz.dataloader.DataLoader(
dataset=trainset,
batch_size=20,
stage='train',
resample=True,
sort=False,
callback=padding_callback
)
testloader = mz.dataloader.DataLoader(
dataset=testset,
batch_size=20,
stage='dev',
callback=padding_callback
)

TypeError Traceback (most recent call last)
in
6 resample=True,
7 sort=False,
----> 8 callback=padding_callback
9 )
10 testloader = mz.dataloader.DataLoader(

TypeError: init() got an unexpected keyword argument 'batch_size'

mz.auto.Tuner , Validator not satifised.

Describe the Question

Please provide a clear and concise description of what the question is.
I am on matchzoo-py1.1.1 and when I try to play with the codemz.auto.Tuner , i get an error when trying to tune the model.

Describe your attempts

I walked through the tutorials
please provide model_tuning tutorial for match-py ,when i use mz.auto.Tuner() to tune my model,i get an error as follow:

tuner = mz.auto.Tuner(
params=model.params,
train_data=train,
test_data=test,
num_runs=5
)

ValueError: Validator not satifised.
The validator's definition is as follows:
validator=lambda x: isinstance(x, np.ndarray)

I checked the documentation
I checked to make sure that this is not a duplicate question

You may also provide a Minimal, Complete, and Verifiable example you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).

In addition, figure out your MatchZoo version by running import matchzoo; matchzoo.__version__.

DIIN model empty sequence error

I'm running DIIN model for document classificaition task, somehow I came into ValueError: max() arg is an empty sequence which indicate matchzoo/dataloader/callbacks/padding.py, line 114

MatchZoo-py/matchzoo/dataloader/callbacks/padding.py

Lines 136 to 139 in 49548ad

 ngram_length_left = max([len(w) 

 for k in x['ngram_left'] for w in k]) 

 ngram_length_right = max([len(w) 

 for k in x['ngram_right'] for w in k])

It seems that x['ngram_left'] value is a null list, I wonder wheter it should be x['text_left'] and I have a try which turns out TypeError: object of type 'numpy.int64' has no len(). Then I take a peak at the value type of both but still have no clue.
Any help will be appreciated.
MZ version: 1.1

DSSM tutorial code: RuntimeError Expected object of scalar type Float but got scalar type Double

Describe the bug

When I run the tutorial code for DSSM, the following error appears:
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

please find the details below:

RuntimeError Traceback (most recent call last)
in
----> 1 trainer.run()

~/opt/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1.1-py3.7.egg/matchzoo/trainers/trainer.py in run(self)
225 for epoch in range(self._start_epoch, self._epochs + 1):
226 self._epoch = epoch
--> 227 self._run_epoch()
228 self._run_scheduler()
229 if self._early_stopping.should_stop_early:

~/opt/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1.1-py3.7.egg/matchzoo/trainers/trainer.py in _run_epoch(self)
250 disable=not self._verbose) as pbar:
251 for step, (inputs, target) in pbar:
--> 252 outputs = self._model(inputs)
253 # Caculate all losses and sum them up
254 loss = torch.sum(

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),

~/opt/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1.1-py3.7.egg/matchzoo/models/dssm.py in forward(self, inputs)
89
90 # print (inputs['ngram_left'])
---> 91 # print (inputs['ngram_right'])
92 # Process left & right input.
93 input_left, input_right = inputs['ngram_left'], inputs['ngram_right'].float()

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
115 def forward(self, input):
116 for module in self:
--> 117 input = module(input)
118 return input
119

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/linear.py in forward(self, input)
89
90 def forward(self, input: Tensor) -> Tensor:
---> 91 return F.linear(input, self.weight, self.bias)
92
93 def extra_repr(self) -> str:

~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py in linear(input, weight, bias)
1672 if input.dim() == 2 and bias is not None:
1673 # fused op is marginally faster
-> 1674 ret = torch.addmm(bias, input, weight.t())
1675 else:
1676 output = input.matmul(weight.t())

RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm

Describe your attempts

I am trying to address this problem according to the following post:
https://stackoverflow.com/questions/56741087/how-to-fix-runtimeerror-expected-object-of-scalar-type-float-but-got-scalar-typ/56741419

But it seems not working by adding float().

Context

OS: MAC OS 10.15.6
Hardware: CPU only
MatchZoo-py version: 1.1.1

Add on-the-fly sampling

As far as I understand, right now text pairs are picked directly using Datapack samples. However it is almost impossible to mine "hard" negative samples before training. Just adding into class 0/ rank 0.0 random pairs which are not corresponding is very wasteful and inefficient.

DRMM Tutorial

TypeError when use sample code in README.md

Thanks for your project. I install this package and run the code in README.md, but I got the error. Following is the detail log:

Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|████████████████████████████████████████████████████████████████████████████████| 2118/2118 [00:00<00:00, 10964.68it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████████████████████████████████████████████████████████████████████████| 18841/18841 [00:03<00:00, 6015.32it/s]
Processing text_right with append:   0%|                                                                                                                                             | 0/18841 [00:00<?, ?it/s]Traceback (most recent call last):
  File "test_matchzoo.py", line 15, in <module>
    train_processed = preprocessor.fit_transform(train_pack)
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/engine/base_preprocessor.py", line 97, in fit_transform
    return self.fit(data_pack, verbose=verbose) \
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/preprocessors/basic_preprocessor.py", line 110, in fit
    verbose=verbose)
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/preprocessors/build_unit_from_data_pack.py", line 32, in build_unit_from_data_pack
    data_pack.apply_on_text(corpus.append, mode=mode, verbose=verbose)
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/data_pack/data_pack.py", line 246, in wrapper
    func(target, *args, **kwargs)
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/data_pack/data_pack.py", line 401, in apply_on_text
    self._apply_on_text_right(func, rename, verbose=verbose)
  File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/data_pack/data_pack.py", line 410, in _apply_on_text_right
    self._right[name] = self._right['text_right'].progress_apply(func)
  File "/public/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 730, in inner
    func = df._is_builtin_func(func)
  File "/public/anaconda3/lib/python3.7/site-packages/pandas/core/base.py", line 660, in _is_builtin_func
    return self._builtin_table.get(arg, arg)
TypeError: unhashable type: 'list'
Processing text_right with append:   0%|                                                                                                                                             | 0/18841 [00:00<?, ?it/s]

Matchzoo version is 1.1

Use of trainer/runer and number of training epochs

Hi, I have a question regarding choosing the epochs and doing hyperparameter tuning in general.

I am currently using matchzoo.trainers.trainer to train my models with the default number of epochs(=10).

Does this always end training in epoch=10, or it keeps some sort of checkpoints and then restores the checkpoint/model in the epoch were the validation result is best? This is not very clear to me from the documentation, and there's a lot of confusion given that there are different tutorials/documentations in matchzoo and matchzoo-py.

Apart from that, my question is:

If training stops always on the 10th epoch, how can I make it stop and restore the model that achieves the best results based on a metric from the validation score? Ideally, I would like to do this with checkpoints, rather than using matchzoo.auto.tuner.tuner and re-training the model over and over, or some sort of other hacky solution. I guess there should be already something in place to do that.
If the trainer indeed restores the checkpoint with the highest score, after the 10 epochs are finished running: Which metric is used to determine the highest score? Is it just the first metric in the list of task.metrics?

Thank you for your help!

For drmm model, dense_output should be flipped at dim=-1?

Hi,

In drmm model,

Should it be
x = torch.einsum('bl,bl->b', torch.flip(dense_output,(-1,)), attention_probs)

Instead of
x = torch.einsum('bl,bl->b', dense_output, attention_probs)

After I revise this, I got the training loss reduction much faster than this as in
https://github.com/NTMC-Community/MatchZoo-py/blob/master/tutorials/ranking/drmm.ipynb

Below are my training logs:

Epoch 1/10: 100%|██████████| 319/319 [02:36<00:00, 6.02it/s, loss=2.167][Iter-319 Loss-2.134]:
Validation: normalized_discounted_cumulative_gain@3(0.0): 0.5825 - normalized_discounted_cumulative_gain@5(0.0): 0.6421 - mean_average_precision(0.0): 0.6019
Epoch 1/10: 100%|██████████| 319/319 [02:44<00:00, 1.93it/s, loss=2.167]
Epoch 2/10: 100%|█████████▉| 318/319 [01:11<00:00, 6.27it/s, loss=1.729][Iter-638 Loss-1.776]:
Epoch 2/10: 100%|██████████| 319/319 [01:19<00:00, 4.02it/s, loss=0.877]
0%| | 0/319 [00:00<?, ?it/s] Validation: normalized_discounted_cumulative_gain@3(0.0): 0.5726 - normalized_discounted_cumulative_gain@5(0.0): 0.6363 - mean_average_precision(0.0): 0.589

How to use gpu to run trainer?

In the MatchZoo-py/tutorials/ranking/drmm.ipynb, I change the parameter 'device' in trainloader = mz.dataloader.DataLoader and trainer = mz.trainers.Trainer as device = torch.device('cuda'). The nvidia-smi shows I have an available gpu. But it seems still ran on cpu. Can you help me with this? Thank you!

Update bert ranking tutorial

The tutorials in /ranking/bert.ipynb is out-of-date!

KeyError:'ngram_left'

Hi, I encountered a data processing problem. When I call the diin.py model, I use this model default padding callback.
`@classmethod
def get_default_padding_callback(
cls,
fixed_length_left: int = 10,
fixed_length_right: int = 30,
pad_word_value: typing.Union[int, str] = 0,
pad_word_mode: str = 'pre',
with_ngram: bool = True,
fixed_ngram_length: int = None,
pad_ngram_value: typing.Union[int, str] = 0,
pad_ngram_mode: str = 'pre'
) -> BaseCallback:
"""
Model default padding callback.

    The padding callback's on_batch_unpacked would pad a batch of data to
    a fixed length.

    :return: Default padding callback.
    """
    return callbacks.BasicPadding(
        fixed_length_left=fixed_length_left,
        fixed_length_right=fixed_length_right,
        pad_word_value=pad_word_value,
        pad_word_mode=pad_word_mode,
        with_ngram=with_ngram,
        fixed_ngram_length=fixed_ngram_length,
        pad_ngram_value=pad_ngram_value,
        pad_ngram_mode=pad_ngram_mode
    )`

Then it is processed in the padding.py file:
if self._with_ngram: ngram_length_left = max([len(w) for k in x['ngram_left'] for w in k]) ngram_length_right = max([len(w) for k in x['ngram_right'] for w in k])
But here I encountered an error：
KeyError:'ngram_left'
Do you know how this should be solved?

Implement MatchPyramid model

The num of batches produced by a dataloader is wrong when num_workers > 0

Describe the bug

The num of batches produced by a dataloader is wrong when num_workers > 0

To Reproduce

Colab: https://colab.research.google.com/drive/1keblQN7GL_R9r1T6_82zU2AMP-4hjnJX

As the batch_size =1, the dataloader should produce 100 batches.
However, it produced num_workers * 100 batches.

How to print all the parameters of a pre-trained model?

For example, if I have a model.pt file of a drmm model. How can I print all the parameters of each layer?

cuDNN error when run sample code in README

Hi anyone,

I run the sample code in README, but got following error:

: block: [80,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1573049310284/work/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [80,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
  0%|                                                                                                                                         | 0/32 [00:14<?, ?it/s]
Traceback (most recent call last):
  File "test_matchzoo.py", line 62, in <module>
    trainer.run()
  File "/home/wup/MatchZoo-py/matchzoo/trainers/trainer.py", line 227, in run
    self._run_epoch()
  File "/home/wup/MatchZoo-py/matchzoo/trainers/trainer.py", line 252, in _run_epoch
    outputs = self._model(inputs)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wup/MatchZoo-py/matchzoo/models/arci.py", line 187, in forward
    conv_left = self.conv_left(embed_left)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

My os is ubuntu 18.04
pytorch 1.3.1
cuda 10.0
cudnn 7.6.4

I can finely run another pytorch code at this server. Could you please help me?

AttributeError: type object 'Path' has no attribute 'expanduser'

Hi,
My python versions is 2.7 and 3.6.
When I import torch and then import matchzoo as mz in python 2.7 I receive this error:
"AttributeError: type object 'Path' has no attribute 'expanduser'"
and in python3 I receive this error:
"ImportError: cannot import name 'ensure_object'"
while matchzoo-py installed completely successfully.
I would appreciate it if you guide me.

Many thanks before all

How to get the actual rank using trainer.predict()?

Describe the Question

I am trying to get the rank out of a trained model (using trainer). However, when I do trainer.predict() I get back a numpy array of shape num_qids x 1. The number of query ids .predict returns is depending on the dataloader dl passed on trainer.predict(dl).

In other words, as I understand I get a score (probably the first metric I've defined on metrics?) for each query id. However, what I need is a ranked list of documents for each query id, rather than a single score.

How can I get that? I could find no solution through the tutorials.

My code looks like:


    trainer.run()

    # Evaluation
    print('Validation results:')
    print(trainer.evaluate(valid_dl))
    print('Test results:')
    print(trainer.evaluate(test_dl))


    val_preds = trainer.predict(valid_dl)
    test_preds = trainer.predict(train_dl)

val_preds.shape
>> Out[18]: (150, 1)
valid_dl.label.shape
>> Out[19]: (150,)

How to load large embedding efficiently?

Describe the Question

I tried to load 840B+300d GloVe using mz.embedding.load_from_file. However, it utilizes more than 60+ GB memory, which looks abnormal.

from pathlib import Path
import matchzoo as mz


_glove_6B_embedding_url = "http://nlp.stanford.edu/data/glove.6B.zip"
_glove_840B_embedding_url = "http://nlp.stanford.edu/data/glove.840B.300d.zip"


def load_glove_embedding(dimension: int = 50, size="6B") -> mz.embedding.Embedding:
    """
    Return the pretrained glove embedding.

    :param dimension: the size of embedding dimension, the value can only be
        50, 100, or 300.
    :return: The :class:`mz.embedding.Embedding` object.
    """
    file_name = 'glove.{}.{}d.txt'.format(size, dimension)
    file_path = (Path(mz.USER_DATA_DIR) / 'glove').joinpath(file_name)

    if not file_path.exists():
        if size=="6B":
            url = _glove_6B_embedding_url
        elif size == "840B":
            url = _glove_840B_embedding_url
        else:
            raise ValueError("Incorrect Size for GloVe: %d" % size)

        mz.utils.get_file('glove_embedding',
                                        url,
                                        extract=True,
                                        cache_dir=mz.USER_DATA_DIR,
                                        cache_subdir='glove')

    return mz.embedding.load_from_file(file_path=str(file_path), mode='glove')

embedding = load_glove_embedding(300, "840B")

Describe your attempts

The TF version matchzoo uses pandas to read the GloVe file, and requires much less memory.

KeyError:'ngram_left'

return callbacks.BasicPadding(
    fixed_length_left=fixed_length_left,
    fixed_length_right=fixed_length_right,
    pad_word_value=pad_word_value,
    pad_word_mode=pad_word_mode,
    with_ngram=with_ngram,
    fixed_ngram_length=fixed_ngram_length,
    pad_ngram_value=pad_ngram_value,
    pad_ngram_mode=pad_ngram_mode
)`

Inefficient DataLoader

I checked to make sure that this is not a duplicate issue
I'm submitting the request to the correct repository (for model requests, see here)

Is your feature request related to a problem? Please describe.

They way that dataloader reads elements from dataset and datapack is very inefficient in pytorch Matchzoo. The dataloader retrieves each element from the datapack separately, and group them into a batch.

However, the __getitem__ of datapack is quite expensive. To return a batch of 1000 elements, the __getitem__ method of datapack will be called 1000 times (the argument is an index at each time), and it usually takes 5-10 seconds to produce a single batch (when the batch size = 1000).

In contrast, in the TF version of Matchzoo, the __getitem__ method of datapack will be called with a list of indices together, so each batch will only call it once. As a result, the DataGenerator is much more efficient than the DataLoader.

Describe the solution you'd like

Avoid using the auto-batching provided by DataLoader. Instead, the dataset should pass batches to Dataloader: https://pytorch.org/docs/stable/data.html#disable-automatic-batching

Using a DataGenerator to sample and batch data, and use DataLoader only for accelerating data transfer to GPU .

Implement aNMM model

Implement the BiMPM model

How can I run models on GPU?

How can run models on GPU?

Dataset Builder creates duplicate query-document pairs & model predictions are odd

I have the following issue, which is really odd and affects the evaluation of the neural models.
I build my data using the auto preparer and I came to realize, that when I try to make predictions on the test set, some document-query pairs are duplicated.
I am not sure why this is happening, my first guess would be in order to fill up the missing examples until the batch size, but this does not seem to be the case.

Here's most of my code:

model, prpr, dsb, dlb = preparer.prepare(model_class,
                                             train_pack
                                             )

    train_prepr = prpr.transform(train_pack)
    valid_prepr = prpr.transform(valid_pack)
    test_prepr = prpr.transform(test_pack)

    mz.dataloader.dataset_builder.DatasetBuilder()
    train_dataset = dsb.build(train_prepr)
    valid_dataset = dsb.build(valid_prepr)
    test_dataset = dsb.build(test_prepr)

    train_dl = dlb.build(train_dataset, stage='train')
    valid_dl = dlb.build(valid_dataset, stage='dev')
    test_dl = dlb.build(test_dataset, stage='test')

# training the model etc....

    test_preds = pd.DataFrame(trainer.predict(test_dl), columns=['pred'])
    test_preds['id_left'] = test_dl.id_left
    test_preds['id_right'] = test_dl._dataset[:][0]['id_right']
    test_preds['length_right'] = test_dl._dataset[:][0]['length_right']

Now, it seems that the duplicates are created through the dataset builder, but I don't understand why.

    test_dataset._data_pack.frame().duplicated(['id_left', 'id_right']).sum() 
>> 297
    test_pack.frame().duplicated(['id_left', 'id_right']).sum() 
>>0
    test_prepr.frame().duplicated(['id_left', 'id_right']).sum()
>> 0

Even more odd, is the fact that those predictions have different scores for the same document-query pairs. And those are not even always close to each other - so this can't be some rounding error or so. This is very weird, how is it possible that without re-training the model, I can get so much different predictions for the same query-document pairs in inference time???


    print(test_preds[test_preds.duplicated(['id_right', 'id_left'],
                                           keep=False)].sort_values(['id_left', 'id_right'])
          )

>>
           pred  id_left                   id_right  length_right
466  -10.889746   33-1-1  47-07395           896
499   -9.492123   33-1-1  47-07395           896
677   -6.880966   33-1-1  47-07395           896
496  -10.781660   33-1-1  98-33779           535
678   -7.954109   33-1-1  98-33779           535
1044 -11.102488   33-1-1 98-33779           535
508   -6.497414   33-1-1  95-23333           244
1326  -7.466503   33-1-1  95-23333           244

In this replicated example the model used was KNRM, but I think this happens in other models too.

Implement Duet Model!

Add duet!

Add metric learning facilities

I have a task in which I need to match several texts with huge (~1M texts) database. Even though available models are pretty fast, this is not enough. The best way to tackle this problem is learning embeddings of texts and match them using Kd-tree etc. But as far as I understand, all models accept two texts and map them into class probs or rank value.

Is possible to somehow use matchzoo in the paradigm of metric learning already or will it be in the future?

RuntimeError: The size of tensor a (68) must match the size of tensor b (67) at non-singleton dimension 0

Hi!

I encountered an error prompt (such as title) when running dssm.ipynb on my local machine. My environment is Windows10+Pytorch1.2+cuda9.0. Could you please tell me the solution.

Thanks.

RuntimeError Traceback (most recent call last)
in
----> 1 trainer.run()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\trainers\trainer.py in run(self)
230 for epoch in range(self._start_epoch, self._epochs + 1):
231 self._epoch = epoch
--> 232 self._run_epoch()
233 self._run_scheduler()
234 if self._early_stopping.should_stop_early:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\trainers\trainer.py in _run_epoch(self)
258 # Caculate all losses and sum them up
259 loss = torch.sum(
--> 260 *[c(outputs, target) for c in self._criterions]
261 )
262 self._backward(loss)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\trainers\trainer.py in (.0)
258 # Caculate all losses and sum them up
259 loss = torch.sum(
--> 260 *[c(outputs, target) for c in self._criterions]
261 )
262 self._backward(loss)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
545 result = self._slow_forward(*input, **kwargs)
546 else:
--> 547 result = self.forward(*input, **kwargs)
548 for hook in self._forward_hooks.values():
549 hook_result = hook(self, input, result)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\losses\rank_hinge_loss.py in forward(self, y_pred, y_true)
63 y_pos, y_neg, y_true,
64 margin=self.margin,
---> 65 reduction=self.reduction
66 )
67

~\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\nn\functional.py in margin_ranking_loss(input1, input2, target, margin, size_average, reduce, reduction)
2206 raise RuntimeError(("margin_ranking_loss does not support scalars, got sizes: "
2207 "input1: {}, input2: {}, target: {} ".format(input1.size(), input2.size(), target.size())))
-> 2208 return torch.margin_ranking_loss(input1, input2, target, margin, reduction_enum)
2209
2210

RuntimeError: The size of tensor a (68) must match the size of tensor b (67) at non-singleton dimension 0

Auto-prepare does not work with some models

Hello everyone, and thanks a lot for this amazing framework!

Auto-preparer does not seem to work with some specific models, in particular: DSSM and CDSSM. I am not sure whether this is an exhaustive list, but I tried quite a few more (ArcI, DRMM, KNRM, ConvKNRM, MatchSRNN & MVLSTM) and they seemed to be working as expected.

To Reproduce

import torch
import matchzoo as mz

task = mz.tasks.Ranking(losses=mz.losses.RankCrossEntropyLoss(num_neg=4))
task.metrics = [
    mz.metrics.NormalizedDiscountedCumulativeGain(k=20),
    mz.metrics.MeanAveragePrecision()
]

# Prepare input data
train_pack = mz.datasets.wiki_qa.load_data('train', task=task)
valid_pack = mz.datasets.wiki_qa.load_data('dev', task=task)

# Auto prepare model etc.

preparer = mz.auto.Preparer(task)

model_class = mz.models.DSSM
# or
model_class = mz.models.CDSSM

model, prpr, dsb, dlb = preparer.prepare(model_class,
                                         train_pack
                                         )



train_prepr = prpr.transform(train_pack)
valid_prepr = prpr.transform(valid_pack)

train_dataset = dsb.build(train_prepr)
valid_dataset = dsb.build(valid_prepr)

train_dl = dlb.build(train_dataset)
valid_dl = dlb.build(valid_dataset)


# make it (T)rain
optimizer = torch.optim.Adam(model.parameters())

trainer = mz.trainers.Trainer(
    model=model,
    optimizer=optimizer,
    trainloader=train_dl,
    validloader=valid_dl,
    epochs=2
)

trainer.run()

For CDSSM I get RuntimeError: Given groups=1, weight of size 3 419 3, expected input[40, 9654, 11] to have 419 channels, but got 9654 channels instead.

My matchzoo.version == 1.1.1

Implement MVLSTM model

Error during training because of float length of sequence(?)

Describe the bug

Hi, I have the following issue:
when I am trying to train my model using trainer.run(), I get the following error:

Traceback (most recent call last):
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-67-041e2033e90a>", line 1, in <module>
    trainer.run()
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/trainers/trainer.py", line 227, in run
    self._run_epoch()
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/trainers/trainer.py", line 251, in _run_epoch
    for step, (inputs, target) in pbar:
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/tqdm/std.py", line 1091, in __iter__
    for obj in iterable:
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/dataloader/dataloader.py", line 112, in __iter__
    self._handle_callbacks_on_batch_unpacked(x, y)
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/dataloader/dataloader.py", line 134, in _handle_callbacks_on_batch_unpacked
    self._callback.on_batch_unpacked(x, y)
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/dataloader/callbacks/padding.py", line 158, in on_batch_unpacked
    self._pad_word_value, dtype=dtype)
  File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/numpy/core/numeric.py", line 325, in full
    a = empty(shape, dtype, order)
TypeError: 'numpy.float64' object cannot be interpreted as an integer

I am not 100% sure, but it seems to me that the error is caused by the fact that in my preprocessed datapack, length_right is a float instead of an int (that seems to be the case in the toy datasets.).

>> toy_datapack.frame()[['length_right','length_left']]
Out[13]: 
    length_right  length_left
0             58           29
1             41           29
2             41           29
3             61           29
4            128           29
5            126           85
6            128           85

while

train_pack = mz.DataPack(relation=relation[relation.id_left.isin(qids['train'])].reset_index(drop=True),
                             left=left[left.index.isin(qids['train'])],
                             # right=right_train,
                             right=right_dict['train'],
                             )
    train_pack.frame().head().dtypes

Out[78]: 
id_left          object
text_left        object
id_right         object
text_right       object
length_right    float64
label           float64
dtype: object

It also seems weird to me that this is happening, since to my understanding, the built-in python len function should return an int.

right_train['length_right'] = right_train.text_right.apply(len)
Out[15]: 

                                                                  text_right  length_right
id_right                                                                                  
clueweb09-en0007-21-42346  Welcome | Logout Log In | Sign Up The Huffingt...          4039
clueweb09-enwp03-01-16807  Ann Dunham From Wikipedia, the free encycloped...         32225
clueweb09-en0010-93-11767  Home Contact Us Bookmark Us Receive Family Tre...          5112
clueweb09-enwp01-36-17161  Maya Soetoro-Ng From Wikipedia, the free encyc...          8279
clueweb09-enwp00-34-05344  Barack Obama, Sr. From Wikipedia, the free enc...         14448
clueweb09-enwp00-34-05347  Barack Obama, Sr. From Wikipedia, the free enc...         14478

I am preparing my data using mz.autoprepare and the models I've tried to use are KNRM and DRMM, but the same issue still occurs.

My matchzoo.version`. = 1.1.1

RE2 (Simple and Effective Text Matching with Richer Alignment Features)

As the title mentioned ，it's a new model for sentence matching . FYI

How to load a pretrained model?

After using trainer.save() to save the model. How can I load this model and evaluate it? Thanks!

bert processor

When i use bert processor to tranform my dataset, there will appear a warning:

Token indices sequence length is longer than the specified maximum sequence length for this model (694 > 512). Running this sequence through the model will result in indexing errors.

But my dataset don't have so long sequence!
And it will lead to an error when training. Do you know how to solve it? Thanks!
Matchzoo version 1.1.1

For text match problem, what is the different between question-question match and question-answer match?

I know question-question match is a text similarity problem.
What about question-answer match or question-doc match? It is used in information retrieval.
question-question match is indeed text similarity. But how do you define question-answer similarity?
Thank you!!

bugs of preprocessor in version 1.1.1

running the example in readme file got the following error:

raceback (most recent call last):: 0%| | 0/18841 [00:00<?, ?it/s]
File "", line 1, in
File "/root/download/MatchZoo-py/matchzoo/engine/base_preprocessor.py", line 97, in fit_transform
return self.fit(data_pack, verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/preprocessors/basic_preprocessor.py", line 110, in fit
verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/preprocessors/build_unit_from_data_pack.py", line 32, in build_unit_from_data_pack
data_pack.apply_on_text(corpus.append, mode=mode, verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/data_pack/data_pack.py", line 246, in wrapper
func(target, *args, **kwargs)
File "/root/download/MatchZoo-py/matchzoo/data_pack/data_pack.py", line 401, in apply_on_text
self._apply_on_text_right(func, rename, verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/data_pack/data_pack.py", line 410, in _apply_on_text_right
self._right[name] = self._right['text_right'].progress_apply(func)
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 733, in inner
func = df._is_builtin_func(func)
File "/opt/conda/lib/python3.7/site-packages/pandas-0.24.2-py3.7-linux-x86_64.egg/pandas/core/base.py", line 660, in _is_builtin_func
return self._builtin_table.get(arg, arg)
TypeError: unhashable type: 'list'

Load Data

Hi,

I have Robust04 data for testing drmm. I do not know how to load this data.
I would appreciate it if you guide me.

Many thanks before all

Loading word2vec embedding exceeds the memory limit

Describe the bug

My machine restart when loading word2vec embedding causes the memory issue several times.
Is it a bug?

this problem also occours at Matchzoo(tensorflow version).
MatchZoo 807

为什么demo跑出来的acc值都不变，有没有交流群

想问问有没有交流群可以交流心得的
想问下，为什么跑项目esim的demo例子，多个epoch的acc值一直不变。同样的我改成了中文的，情况也是一样，能跑通但是每个epoch的验证acc值都是不变

	ngram_length_left = max([len(w)
	for k in x['ngram_left'] for w in k])
	ngram_length_right = max([len(w)
	for k in x['ngram_right'] for w in k])

ntmc-community / matchzoo-py Goto Github PK

matchzoo-py's People

Contributors

Stargazers

Watchers

Forkers

matchzoo-py's Issues

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional Information

Describe the bug

Describe the Question

Describe your attempts

Describe the bug

Describe your attempts

Context

Describe the bug

To Reproduce

Describe the Question

Describe the Question

Describe your attempts

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

To Reproduce

Describe the bug

Describe the bug

Recommend Projects

Recommend Topics

Recommend Org

Jobs