ntmc-community / matchzoo-py Goto Github PK
View Code? Open in Web Editor NEWFacilitating the design, comparison and sharing of deep text matching models.
License: Apache License 2.0
Facilitating the design, comparison and sharing of deep text matching models.
License: Apache License 2.0
Hi,
When I use toy dataset I see the following error in this line:
glove_embedding = mz.datasets.embeddings.load_glove_embedding(dimension=300)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to <undefined>
I would appreciate it if you guide me how to solve this problem.
I wish you answer my questions a little bit faster since the time passes quickly :)
Thanks
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
A clear and concise description of what you want to happen.
A clear and concise description of any alternative solutions or features you've considered.
Other things you want the developers to know.
Do you think that the evaluation function would be needed to set model.eval()
?
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 96%|█████████▌| 18113/18841 [00:10<00:00, 2020.19it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 97%|█████████▋| 18317/18841 [00:10<00:00, 2011.95it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 98%|█████████▊| 18520/18841 [00:10<00:00, 1890.53it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████| 18841/18841 [00:11<00:00, 1703.43it/s]
Processing text_right with append: 0%| | 0/18841 [00:00<?, ?it/s]
TypeError Traceback (most recent call last)
in ()
1 preprocessor = mz.models.ArcI.get_default_preprocessor()
----> 2 train_processed = preprocessor.fit_transform(train_pack)
3 valid_processed = preprocessor.transform(valid_pack)
/home/anaconda3/lib/python3.6/site-packages/matchzoo/engine/base_preprocessor.py in fit_transform(self, data_pack, verbose)
95 :param verbose: Verbosity.
96 """
---> 97 return self.fit(data_pack, verbose=verbose)
98 .transform(data_pack, verbose=verbose)
99
/home/anaconda3/lib/python3.6/site-packages/matchzoo/preprocessors/basic_preprocessor.py in fit(self, data_pack, verbose)
108 flatten=False,
109 mode='right',
--> 110 verbose=verbose)
111 data_pack = data_pack.apply_on_text(fitted_filter_unit.transform,
112 mode='right', verbose=verbose)
/home/anaconda3/lib/python3.6/site-packages/matchzoo/preprocessors/build_unit_from_data_pack.py in build_unit_from_data_pack(unit, data_pack, mode, flatten, verbose)
30 data_pack.apply_on_text(corpus.extend, mode=mode, verbose=verbose)
31 else:
---> 32 data_pack.apply_on_text(corpus.append, mode=mode, verbose=verbose)
33 if verbose:
34 description = 'Building ' + unit.class.name + \
/home/anaconda3/lib/python3.6/site-packages/matchzoo/data_pack/data_pack.py in wrapper(self, inplace, *args, **kwargs)
244 target = self.copy()
245
--> 246 func(target, *args, **kwargs)
247
248 if not inplace:
/home/anaconda3/lib/python3.6/site-packages/matchzoo/data_pack/data_pack.py in apply_on_text(self, func, mode, rename, verbose)
399 self._apply_on_text_left(func, rename, verbose=verbose)
400 elif mode == 'right':
--> 401 self._apply_on_text_right(func, rename, verbose=verbose)
402 else:
403 raise ValueError(f"{mode} is not a valid mode type."
/home/anaconda3/lib/python3.6/site-packages/matchzoo/data_pack/data_pack.py in _apply_on_text_right(self, func, rename, verbose)
408 if verbose:
409 tqdm.pandas(desc="Processing " + name + " with " + func.name)
--> 410 self._right[name] = self._right['text_right'].progress_apply(func)
411 else:
412 self._right[name] = self._right['text_right'].apply(func)
/home/anaconda3/lib/python3.6/site-packages/tqdm/std.py in inner(df, func, *args, **kwargs)
/home/anaconda3/lib/python3.6/site-packages/pandas/core/base.py in _is_builtin_func(self, arg)
658 otherwise return the arg
659 """
--> 660 return self._builtin_table.get(arg, arg)
661
662
TypeError: unhashable type: 'list'
Implement DIIN model
I am wondering if I can use label = 0.5 or other float value as label to train and test drmm model? How about duet and matchpyramid?
Implement MatchSRNN model
Implement HCRN model
I make some analysis on wiki qa dataset:
I wonder if this is the official way to combine question and answer, because the proportion of positive examples in three set is only 5%, which means if a model outputs 0 forever, it can achieve 95% accuracy? And the best performence of BERT on this dataset is just 95%. The proportion of positive and negative examples is too imbalance?
Add the year and source of papers in response retrival module, to make the information of paper more complete.
I have get an error when i run your example of bert model, could you tell mw how to solve it ? Thanks!
padding_callback = mz.models.Bert.get_default_padding_callback()
trainloader = mz.dataloader.DataLoader(
dataset=trainset,
batch_size=20,
stage='train',
resample=True,
sort=False,
callback=padding_callback
)
testloader = mz.dataloader.DataLoader(
dataset=testset,
batch_size=20,
stage='dev',
callback=padding_callback
)
TypeError Traceback (most recent call last)
in
6 resample=True,
7 sort=False,
----> 8 callback=padding_callback
9 )
10 testloader = mz.dataloader.DataLoader(
TypeError: init() got an unexpected keyword argument 'batch_size'
Please provide a clear and concise description of what the question is.
I am on matchzoo-py1.1.1 and when I try to play with the codemz.auto.Tuner , i get an error when trying to tune the model.
tuner = mz.auto.Tuner(
params=model.params,
train_data=train,
test_data=test,
num_runs=5
)
ValueError: Validator not satifised.
The validator's definition is as follows:
validator=lambda x: isinstance(x, np.ndarray)
You may also provide a Minimal, Complete, and Verifiable example you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).
In addition, figure out your MatchZoo version by running import matchzoo; matchzoo.__version__
.
I'm running DIIN model for document classificaition task, somehow I came into ValueError: max() arg is an empty sequence
which indicate matchzoo/dataloader/callbacks/padding.py, line 114
MatchZoo-py/matchzoo/dataloader/callbacks/padding.py
Lines 136 to 139 in 49548ad
x['ngram_left']
value is a null list, I wonder wheter it should be x['text_left']
and I have a try which turns out TypeError: object of type 'numpy.int64' has no len()
. Then I take a peak at the value type of both but still have no clue.MZ version: 1.1
When I run the tutorial code for DSSM, the following error appears:
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm
please find the details below:
RuntimeError Traceback (most recent call last)
in
----> 1 trainer.run()
~/opt/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1.1-py3.7.egg/matchzoo/trainers/trainer.py in run(self)
225 for epoch in range(self._start_epoch, self._epochs + 1):
226 self._epoch = epoch
--> 227 self._run_epoch()
228 self._run_scheduler()
229 if self._early_stopping.should_stop_early:
~/opt/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1.1-py3.7.egg/matchzoo/trainers/trainer.py in _run_epoch(self)
250 disable=not self._verbose) as pbar:
251 for step, (inputs, target) in pbar:
--> 252 outputs = self._model(inputs)
253 # Caculate all losses and sum them up
254 loss = torch.sum(
~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
~/opt/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1.1-py3.7.egg/matchzoo/models/dssm.py in forward(self, inputs)
89
90 # print (inputs['ngram_left'])
---> 91 # print (inputs['ngram_right'])
92 # Process left & right input.
93 input_left, input_right = inputs['ngram_left'], inputs['ngram_right'].float()
~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
115 def forward(self, input):
116 for module in self:
--> 117 input = module(input)
118 return input
119
~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
115 def forward(self, input):
116 for module in self:
--> 117 input = module(input)
118 return input
119
~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
--> 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),
~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/modules/linear.py in forward(self, input)
89
90 def forward(self, input: Tensor) -> Tensor:
---> 91 return F.linear(input, self.weight, self.bias)
92
93 def extra_repr(self) -> str:
~/opt/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py in linear(input, weight, bias)
1672 if input.dim() == 2 and bias is not None:
1673 # fused op is marginally faster
-> 1674 ret = torch.addmm(bias, input, weight.t())
1675 else:
1676 output = input.matmul(weight.t())
RuntimeError: Expected object of scalar type Float but got scalar type Double for argument #2 'mat1' in call to _th_addmm
I am trying to address this problem according to the following post:
https://stackoverflow.com/questions/56741087/how-to-fix-runtimeerror-expected-object-of-scalar-type-float-but-got-scalar-typ/56741419
But it seems not working by adding float().
As far as I understand, right now text pairs are picked directly using Datapack samples. However it is almost impossible to mine "hard" negative samples before training. Just adding into class 0/ rank 0.0 random pairs which are not corresponding is very wasteful and inefficient.
Thanks for your project. I install this package and run the code in README.md
, but I got the error. Following is the detail log:
Processing text_left with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|████████████████████████████████████████████████████████████████████████████████| 2118/2118 [00:00<00:00, 10964.68it/s]
Processing text_right with chain_transform of Tokenize => Lowercase => PuncRemoval: 100%|██████████████████████████████████████████████████████████████████████████████| 18841/18841 [00:03<00:00, 6015.32it/s]
Processing text_right with append: 0%| | 0/18841 [00:00<?, ?it/s]Traceback (most recent call last):
File "test_matchzoo.py", line 15, in <module>
train_processed = preprocessor.fit_transform(train_pack)
File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/engine/base_preprocessor.py", line 97, in fit_transform
return self.fit(data_pack, verbose=verbose) \
File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/preprocessors/basic_preprocessor.py", line 110, in fit
verbose=verbose)
File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/preprocessors/build_unit_from_data_pack.py", line 32, in build_unit_from_data_pack
data_pack.apply_on_text(corpus.append, mode=mode, verbose=verbose)
File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/data_pack/data_pack.py", line 246, in wrapper
func(target, *args, **kwargs)
File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/data_pack/data_pack.py", line 401, in apply_on_text
self._apply_on_text_right(func, rename, verbose=verbose)
File "/public/anaconda3/lib/python3.7/site-packages/matchzoo_py-1.1-py3.7.egg/matchzoo/data_pack/data_pack.py", line 410, in _apply_on_text_right
self._right[name] = self._right['text_right'].progress_apply(func)
File "/public/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 730, in inner
func = df._is_builtin_func(func)
File "/public/anaconda3/lib/python3.7/site-packages/pandas/core/base.py", line 660, in _is_builtin_func
return self._builtin_table.get(arg, arg)
TypeError: unhashable type: 'list'
Processing text_right with append: 0%| | 0/18841 [00:00<?, ?it/s]
Matchzoo version is 1.1
Hi, I have a question regarding choosing the epochs and doing hyperparameter tuning in general.
I am currently using matchzoo.trainers.trainer
to train my models with the default number of epochs(=10).
Does this always end training in epoch=10, or it keeps some sort of checkpoints and then restores the checkpoint/model in the epoch were the validation result is best? This is not very clear to me from the documentation, and there's a lot of confusion given that there are different tutorials/documentations in matchzoo and matchzoo-py.
Apart from that, my question is:
If training stops always on the 10th epoch, how can I make it stop and restore the model that achieves the best results based on a metric from the validation score? Ideally, I would like to do this with checkpoints, rather than using matchzoo.auto.tuner.tuner
and re-training the model over and over, or some sort of other hacky solution. I guess there should be already something in place to do that.
If the trainer indeed restores the checkpoint with the highest score, after the 10 epochs are finished running: Which metric is used to determine the highest score? Is it just the first metric in the list of task.metrics
?
Thank you for your help!
Hi,
In drmm model,
Should it be
x = torch.einsum('bl,bl->b', torch.flip(dense_output,(-1,)), attention_probs)
Instead of
x = torch.einsum('bl,bl->b', dense_output, attention_probs)
After I revise this, I got the training loss reduction much faster than this as in
https://github.com/NTMC-Community/MatchZoo-py/blob/master/tutorials/ranking/drmm.ipynb
Below are my training logs:
Epoch 1/10: 100%|██████████| 319/319 [02:36<00:00, 6.02it/s, loss=2.167][Iter-319 Loss-2.134]:
Validation: normalized_discounted_cumulative_gain@3(0.0): 0.5825 - normalized_discounted_cumulative_gain@5(0.0): 0.6421 - mean_average_precision(0.0): 0.6019
Epoch 1/10: 100%|██████████| 319/319 [02:44<00:00, 1.93it/s, loss=2.167]
Epoch 2/10: 100%|█████████▉| 318/319 [01:11<00:00, 6.27it/s, loss=1.729][Iter-638 Loss-1.776]:
Epoch 2/10: 100%|██████████| 319/319 [01:19<00:00, 4.02it/s, loss=0.877]
0%| | 0/319 [00:00<?, ?it/s] Validation: normalized_discounted_cumulative_gain@3(0.0): 0.5726 - normalized_discounted_cumulative_gain@5(0.0): 0.6363 - mean_average_precision(0.0): 0.589
In the MatchZoo-py/tutorials/ranking/drmm.ipynb, I change the parameter 'device' in trainloader = mz.dataloader.DataLoader and trainer = mz.trainers.Trainer as device = torch.device('cuda'). The nvidia-smi shows I have an available gpu. But it seems still ran on cpu. Can you help me with this? Thank you!
The tutorials in /ranking/bert.ipynb
is out-of-date!
Hi, I encountered a data processing problem. When I call the diin.py model, I use this model default padding callback.
`@classmethod
def get_default_padding_callback(
cls,
fixed_length_left: int = 10,
fixed_length_right: int = 30,
pad_word_value: typing.Union[int, str] = 0,
pad_word_mode: str = 'pre',
with_ngram: bool = True,
fixed_ngram_length: int = None,
pad_ngram_value: typing.Union[int, str] = 0,
pad_ngram_mode: str = 'pre'
) -> BaseCallback:
"""
Model default padding callback.
The padding callback's on_batch_unpacked would pad a batch of data to
a fixed length.
:return: Default padding callback.
"""
return callbacks.BasicPadding(
fixed_length_left=fixed_length_left,
fixed_length_right=fixed_length_right,
pad_word_value=pad_word_value,
pad_word_mode=pad_word_mode,
with_ngram=with_ngram,
fixed_ngram_length=fixed_ngram_length,
pad_ngram_value=pad_ngram_value,
pad_ngram_mode=pad_ngram_mode
)`
Then it is processed in the padding.py file:
if self._with_ngram: ngram_length_left = max([len(w) for k in x['ngram_left'] for w in k]) ngram_length_right = max([len(w) for k in x['ngram_right'] for w in k])
But here I encountered an error:
KeyError:'ngram_left'
Do you know how this should be solved?
Implement MatchPyramid model
The num of batches produced by a dataloader is wrong when num_workers > 0
Colab: https://colab.research.google.com/drive/1keblQN7GL_R9r1T6_82zU2AMP-4hjnJX
As the batch_size =1, the dataloader should produce 100 batches.
However, it produced num_workers * 100 batches.
For example, if I have a model.pt file of a drmm model. How can I print all the parameters of each layer?
Hi anyone,
I run the sample code in README, but got following error:
: block: [80,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1573049310284/work/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [80,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
0%| | 0/32 [00:14<?, ?it/s]
Traceback (most recent call last):
File "test_matchzoo.py", line 62, in <module>
trainer.run()
File "/home/wup/MatchZoo-py/matchzoo/trainers/trainer.py", line 227, in run
self._run_epoch()
File "/home/wup/MatchZoo-py/matchzoo/trainers/trainer.py", line 252, in _run_epoch
outputs = self._model(inputs)
File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/wup/MatchZoo-py/matchzoo/models/arci.py", line 187, in forward
conv_left = self.conv_left(embed_left)
File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/public/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 202, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
My os is ubuntu 18.04
pytorch 1.3.1
cuda 10.0
cudnn 7.6.4
I can finely run another pytorch code at this server. Could you please help me?
Hi,
My python versions is 2.7 and 3.6.
When I import torch and then import matchzoo as mz in python 2.7 I receive this error:
"AttributeError: type object 'Path' has no attribute 'expanduser'"
and in python3 I receive this error:
"ImportError: cannot import name 'ensure_object'"
while matchzoo-py installed completely successfully.
I would appreciate it if you guide me.
Many thanks before all
I am trying to get the rank out of a trained model (using trainer). However, when I do trainer.predict()
I get back a numpy array of shape num_qids x 1
. The number of query ids .predict
returns is depending on the dataloader dl
passed on trainer.predict(dl)
.
In other words, as I understand I get a score (probably the first metric I've defined on metrics
?) for each query id. However, what I need is a ranked list of documents for each query id, rather than a single score.
How can I get that? I could find no solution through the tutorials.
My code looks like:
trainer.run()
# Evaluation
print('Validation results:')
print(trainer.evaluate(valid_dl))
print('Test results:')
print(trainer.evaluate(test_dl))
val_preds = trainer.predict(valid_dl)
test_preds = trainer.predict(train_dl)
val_preds.shape
>> Out[18]: (150, 1)
valid_dl.label.shape
>> Out[19]: (150,)
I tried to load 840B+300d GloVe using mz.embedding.load_from_file
. However, it utilizes more than 60+ GB memory, which looks abnormal.
from pathlib import Path
import matchzoo as mz
_glove_6B_embedding_url = "http://nlp.stanford.edu/data/glove.6B.zip"
_glove_840B_embedding_url = "http://nlp.stanford.edu/data/glove.840B.300d.zip"
def load_glove_embedding(dimension: int = 50, size="6B") -> mz.embedding.Embedding:
"""
Return the pretrained glove embedding.
:param dimension: the size of embedding dimension, the value can only be
50, 100, or 300.
:return: The :class:`mz.embedding.Embedding` object.
"""
file_name = 'glove.{}.{}d.txt'.format(size, dimension)
file_path = (Path(mz.USER_DATA_DIR) / 'glove').joinpath(file_name)
if not file_path.exists():
if size=="6B":
url = _glove_6B_embedding_url
elif size == "840B":
url = _glove_840B_embedding_url
else:
raise ValueError("Incorrect Size for GloVe: %d" % size)
mz.utils.get_file('glove_embedding',
url,
extract=True,
cache_dir=mz.USER_DATA_DIR,
cache_subdir='glove')
return mz.embedding.load_from_file(file_path=str(file_path), mode='glove')
embedding = load_glove_embedding(300, "840B")
The TF version matchzoo uses pandas to read the GloVe file, and requires much less memory.
Hi, I encountered a data processing problem. When I call the diin.py model, I use this model default padding callback.
`@classmethod
def get_default_padding_callback(
cls,
fixed_length_left: int = 10,
fixed_length_right: int = 30,
pad_word_value: typing.Union[int, str] = 0,
pad_word_mode: str = 'pre',
with_ngram: bool = True,
fixed_ngram_length: int = None,
pad_ngram_value: typing.Union[int, str] = 0,
pad_ngram_mode: str = 'pre'
) -> BaseCallback:
return callbacks.BasicPadding(
fixed_length_left=fixed_length_left,
fixed_length_right=fixed_length_right,
pad_word_value=pad_word_value,
pad_word_mode=pad_word_mode,
with_ngram=with_ngram,
fixed_ngram_length=fixed_ngram_length,
pad_ngram_value=pad_ngram_value,
pad_ngram_mode=pad_ngram_mode
)`
Then it is processed in the padding.py file:
if self._with_ngram: ngram_length_left = max([len(w) for k in x['ngram_left'] for w in k]) ngram_length_right = max([len(w) for k in x['ngram_right'] for w in k])
But here I encountered an error:
KeyError:'ngram_left'
Do you know how this should be solved?
They way that dataloader reads elements from dataset and datapack is very inefficient in pytorch Matchzoo. The dataloader retrieves each element from the datapack separately, and group them into a batch.
However, the __getitem__
of datapack is quite expensive. To return a batch of 1000 elements, the __getitem__
method of datapack will be called 1000 times (the argument is an index at each time), and it usually takes 5-10 seconds to produce a single batch (when the batch size = 1000).
In contrast, in the TF version of Matchzoo, the __getitem__
method of datapack will be called with a list of indices together, so each batch will only call it once. As a result, the DataGenerator is much more efficient than the DataLoader.
Avoid using the auto-batching provided by DataLoader. Instead, the dataset should pass batches to Dataloader: https://pytorch.org/docs/stable/data.html#disable-automatic-batching
Using a DataGenerator to sample and batch data, and use DataLoader only for accelerating data transfer to GPU .
How can run models on GPU?
I have the following issue, which is really odd and affects the evaluation of the neural models.
I build my data using the auto preparer and I came to realize, that when I try to make predictions on the test set, some document-query pairs are duplicated.
I am not sure why this is happening, my first guess would be in order to fill up the missing examples until the batch size, but this does not seem to be the case.
Here's most of my code:
model, prpr, dsb, dlb = preparer.prepare(model_class,
train_pack
)
train_prepr = prpr.transform(train_pack)
valid_prepr = prpr.transform(valid_pack)
test_prepr = prpr.transform(test_pack)
mz.dataloader.dataset_builder.DatasetBuilder()
train_dataset = dsb.build(train_prepr)
valid_dataset = dsb.build(valid_prepr)
test_dataset = dsb.build(test_prepr)
train_dl = dlb.build(train_dataset, stage='train')
valid_dl = dlb.build(valid_dataset, stage='dev')
test_dl = dlb.build(test_dataset, stage='test')
# training the model etc....
test_preds = pd.DataFrame(trainer.predict(test_dl), columns=['pred'])
test_preds['id_left'] = test_dl.id_left
test_preds['id_right'] = test_dl._dataset[:][0]['id_right']
test_preds['length_right'] = test_dl._dataset[:][0]['length_right']
Now, it seems that the duplicates are created through the dataset builder, but I don't understand why.
test_dataset._data_pack.frame().duplicated(['id_left', 'id_right']).sum()
>> 297
test_pack.frame().duplicated(['id_left', 'id_right']).sum()
>>0
test_prepr.frame().duplicated(['id_left', 'id_right']).sum()
>> 0
Even more odd, is the fact that those predictions have different scores for the same document-query pairs. And those are not even always close to each other - so this can't be some rounding error or so. This is very weird, how is it possible that without re-training the model, I can get so much different predictions for the same query-document pairs in inference time???
print(test_preds[test_preds.duplicated(['id_right', 'id_left'],
keep=False)].sort_values(['id_left', 'id_right'])
)
>>
pred id_left id_right length_right
466 -10.889746 33-1-1 47-07395 896
499 -9.492123 33-1-1 47-07395 896
677 -6.880966 33-1-1 47-07395 896
496 -10.781660 33-1-1 98-33779 535
678 -7.954109 33-1-1 98-33779 535
1044 -11.102488 33-1-1 98-33779 535
508 -6.497414 33-1-1 95-23333 244
1326 -7.466503 33-1-1 95-23333 244
In this replicated example the model used was KNRM
, but I think this happens in other models too.
Add duet!
I have a task in which I need to match several texts with huge (~1M texts) database. Even though available models are pretty fast, this is not enough. The best way to tackle this problem is learning embeddings of texts and match them using Kd-tree etc. But as far as I understand, all models accept two texts and map them into class probs or rank value.
Is possible to somehow use matchzoo in the paradigm of metric learning already or will it be in the future?
Hi!
I encountered an error prompt (such as title) when running dssm.ipynb on my local machine. My environment is Windows10+Pytorch1.2+cuda9.0. Could you please tell me the solution.
Thanks.
RuntimeError Traceback (most recent call last)
in
----> 1 trainer.run()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\trainers\trainer.py in run(self)
230 for epoch in range(self._start_epoch, self._epochs + 1):
231 self._epoch = epoch
--> 232 self._run_epoch()
233 self._run_scheduler()
234 if self._early_stopping.should_stop_early:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\trainers\trainer.py in _run_epoch(self)
258 # Caculate all losses and sum them up
259 loss = torch.sum(
--> 260 *[c(outputs, target) for c in self._criterions]
261 )
262 self._backward(loss)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\trainers\trainer.py in (.0)
258 # Caculate all losses and sum them up
259 loss = torch.sum(
--> 260 *[c(outputs, target) for c in self._criterions]
261 )
262 self._backward(loss)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
545 result = self._slow_forward(*input, **kwargs)
546 else:
--> 547 result = self.forward(*input, **kwargs)
548 for hook in self._forward_hooks.values():
549 hook_result = hook(self, input, result)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\matchzoo\losses\rank_hinge_loss.py in forward(self, y_pred, y_true)
63 y_pos, y_neg, y_true,
64 margin=self.margin,
---> 65 reduction=self.reduction
66 )
67
~\AppData\Local\Continuum\anaconda3\lib\site-packages\torch\nn\functional.py in margin_ranking_loss(input1, input2, target, margin, size_average, reduce, reduction)
2206 raise RuntimeError(("margin_ranking_loss does not support scalars, got sizes: "
2207 "input1: {}, input2: {}, target: {} ".format(input1.size(), input2.size(), target.size())))
-> 2208 return torch.margin_ranking_loss(input1, input2, target, margin, reduction_enum)
2209
2210
RuntimeError: The size of tensor a (68) must match the size of tensor b (67) at non-singleton dimension 0
Hello everyone, and thanks a lot for this amazing framework!
Auto-preparer does not seem to work with some specific models, in particular: DSSM and CDSSM. I am not sure whether this is an exhaustive list, but I tried quite a few more (ArcI, DRMM, KNRM, ConvKNRM, MatchSRNN & MVLSTM) and they seemed to be working as expected.
import torch
import matchzoo as mz
task = mz.tasks.Ranking(losses=mz.losses.RankCrossEntropyLoss(num_neg=4))
task.metrics = [
mz.metrics.NormalizedDiscountedCumulativeGain(k=20),
mz.metrics.MeanAveragePrecision()
]
# Prepare input data
train_pack = mz.datasets.wiki_qa.load_data('train', task=task)
valid_pack = mz.datasets.wiki_qa.load_data('dev', task=task)
# Auto prepare model etc.
preparer = mz.auto.Preparer(task)
model_class = mz.models.DSSM
# or
model_class = mz.models.CDSSM
model, prpr, dsb, dlb = preparer.prepare(model_class,
train_pack
)
train_prepr = prpr.transform(train_pack)
valid_prepr = prpr.transform(valid_pack)
train_dataset = dsb.build(train_prepr)
valid_dataset = dsb.build(valid_prepr)
train_dl = dlb.build(train_dataset)
valid_dl = dlb.build(valid_dataset)
# make it (T)rain
optimizer = torch.optim.Adam(model.parameters())
trainer = mz.trainers.Trainer(
model=model,
optimizer=optimizer,
trainloader=train_dl,
validloader=valid_dl,
epochs=2
)
trainer.run()
For CDSSM I get RuntimeError: Given groups=1, weight of size 3 419 3, expected input[40, 9654, 11] to have 419 channels, but got 9654 channels instead
.
My matchzoo.version == 1.1.1
Implement MVLSTM model
Hi, I have the following issue:
when I am trying to train my model using trainer.run()
, I get the following error:
Traceback (most recent call last):
File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-67-041e2033e90a>", line 1, in <module>
trainer.run()
File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/trainers/trainer.py", line 227, in run
self._run_epoch()
File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/trainers/trainer.py", line 251, in _run_epoch
for step, (inputs, target) in pbar:
File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/tqdm/std.py", line 1091, in __iter__
for obj in iterable:
File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/dataloader/dataloader.py", line 112, in __iter__
self._handle_callbacks_on_batch_unpacked(x, y)
File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/dataloader/dataloader.py", line 134, in _handle_callbacks_on_batch_unpacked
self._callback.on_batch_unpacked(x, y)
File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/matchzoo/dataloader/callbacks/padding.py", line 158, in on_batch_unpacked
self._pad_word_value, dtype=dtype)
File "/Users/xx/.conda/envs/QL_QA/lib/python3.7/site-packages/numpy/core/numeric.py", line 325, in full
a = empty(shape, dtype, order)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
I am not 100% sure, but it seems to me that the error is caused by the fact that in my preprocessed datapack, length_right
is a float
instead of an int
(that seems to be the case in the toy datasets.).
>> toy_datapack.frame()[['length_right','length_left']]
Out[13]:
length_right length_left
0 58 29
1 41 29
2 41 29
3 61 29
4 128 29
5 126 85
6 128 85
while
train_pack = mz.DataPack(relation=relation[relation.id_left.isin(qids['train'])].reset_index(drop=True),
left=left[left.index.isin(qids['train'])],
# right=right_train,
right=right_dict['train'],
)
train_pack.frame().head().dtypes
Out[78]:
id_left object
text_left object
id_right object
text_right object
length_right float64
label float64
dtype: object
It also seems weird to me that this is happening, since to my understanding, the built-in python len
function should return an int
.
right_train['length_right'] = right_train.text_right.apply(len)
Out[15]:
text_right length_right
id_right
clueweb09-en0007-21-42346 Welcome | Logout Log In | Sign Up The Huffingt... 4039
clueweb09-enwp03-01-16807 Ann Dunham From Wikipedia, the free encycloped... 32225
clueweb09-en0010-93-11767 Home Contact Us Bookmark Us Receive Family Tre... 5112
clueweb09-enwp01-36-17161 Maya Soetoro-Ng From Wikipedia, the free encyc... 8279
clueweb09-enwp00-34-05344 Barack Obama, Sr. From Wikipedia, the free enc... 14448
clueweb09-enwp00-34-05347 Barack Obama, Sr. From Wikipedia, the free enc... 14478
I am preparing my data using mz.autoprepare
and the models I've tried to use are KNRM
and DRMM
, but the same issue still occurs.
My matchzoo.version`. = 1.1.1
As the title mentioned ,it's a new model for sentence matching . FYI
After using trainer.save() to save the model. How can I load this model and evaluate it? Thanks!
When i use bert processor to tranform my dataset, there will appear a warning:
Token indices sequence length is longer than the specified maximum sequence length for this model (694 > 512). Running this sequence through the model will result in indexing errors.
But my dataset don't have so long sequence!
And it will lead to an error when training. Do you know how to solve it? Thanks!
Matchzoo version 1.1.1
I know question-question match is a text similarity problem.
What about question-answer match or question-doc match? It is used in information retrieval.
question-question match is indeed text similarity. But how do you define question-answer similarity?
Thank you!!
running the example in readme file got the following error:
raceback (most recent call last):: 0%| | 0/18841 [00:00<?, ?it/s]
File "", line 1, in
File "/root/download/MatchZoo-py/matchzoo/engine/base_preprocessor.py", line 97, in fit_transform
return self.fit(data_pack, verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/preprocessors/basic_preprocessor.py", line 110, in fit
verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/preprocessors/build_unit_from_data_pack.py", line 32, in build_unit_from_data_pack
data_pack.apply_on_text(corpus.append, mode=mode, verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/data_pack/data_pack.py", line 246, in wrapper
func(target, *args, **kwargs)
File "/root/download/MatchZoo-py/matchzoo/data_pack/data_pack.py", line 401, in apply_on_text
self._apply_on_text_right(func, rename, verbose=verbose)
File "/root/download/MatchZoo-py/matchzoo/data_pack/data_pack.py", line 410, in _apply_on_text_right
self._right[name] = self._right['text_right'].progress_apply(func)
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 733, in inner
func = df._is_builtin_func(func)
File "/opt/conda/lib/python3.7/site-packages/pandas-0.24.2-py3.7-linux-x86_64.egg/pandas/core/base.py", line 660, in _is_builtin_func
return self._builtin_table.get(arg, arg)
TypeError: unhashable type: 'list'
Hi,
I have Robust04 data for testing drmm. I do not know how to load this data.
I would appreciate it if you guide me.
Many thanks before all
My machine restart when loading word2vec embedding causes the memory issue several times.
Is it a bug?
this problem also occours at Matchzoo(tensorflow version).
MatchZoo 807
想问问有没有交流群可以交流心得的
想问下,为什么跑项目esim的demo例子,多个epoch的acc值一直不变。同样的我改成了中文的,情况也是一样,能跑通但是每个epoch的验证acc值都是不变
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.