lixinsu / lion Goto Github PK

View Code? Open in Web Editor NEW

5.0 3.0 0.0 191 KB

semantic matching toolkit

License: Apache License 2.0

Python 100.00%

semantic-matching bert xlnet

lion's Introduction

Lion

Text pair classification toolkit.

Usage

Preprocessing:

Transform the dataset to the standard formation . We currently support snli, qnli and quoraqp. Please write your own transformation scripts for other dataset.
python lion/data/dataset_utils/quoraqp.py convert-dataset --indir INDIR --outdir OUTDIR
Preprocess the dataset.
python lion/data/processor.py process-dataset --in_dir IN_DIR --out_dir OUT_DIR --splits ['train'|'dev'|'test'] --tokenizer_name [spacy/bert/xlnet] --vocab_file FILE_PATH --max_length SEQUENCE_LENGTH

Training:

Create a directory for saving model and put the config file in it .
Edit the config file, modifying the train file and dev file path .
Run lion/training/trainer.py
For example:
python lion/training/trainer.py --train --output_dir experiments/QQP/esim/ .

Hyper-parameter searching

Create a directory for saving model and put the config file in it .
Edit the config file, modifying the train file and dev file path .
Edit the tuned_params.yaml For example:

hidden_size:
    - 100
    - 200
    - 300
dropout:
    - 0.1
    - 0.2

Run python lion/training/search_parameter.py --parent_dir experiments/QQP/esim/hidden_dim/

Evaluation:

python lion/training/trainer.py --evaluate --output_dir experiments/QQP/esim/ --dev_file your_dev_path

Testing:

python lion/training/trainer.py --predict --output_dir experiments/QQP/esim/ --test_file your_test_file

Models

Model	Quora QP	SNLI	QNLI
BiMPM	86.9	86.0	80.5
Esim	88.4	87.4	81.4
BERT	91.3	91.1	91.7
XLNET	91.5	91.6	91.9

Note: All the performance in the above table is tested on the dev set. The hyperparameter we used for these models are all in the experiments/DATASET/MODEL directory.

How to use ELMO

Just write this in your config file: use_elmo: concat or only and remember to set the word_dim correctly. For example if you use ELMO embedding only, set the word_dim: 1024 or set the word_dim: 1324 if you use ELMO and GLove together.

License

Apache-2.0

lion's People

Contributors

Stargazers

Watchers

lion's Issues

Automatically set `classes`

Detect the number of classes in the training set, rather than manually setting classes

Add model

Match lstm

[dataset error]

Traceback (most recent call last):
File "lion/training/trainer.py", line 33, in
train_model('lion/configs/test_bimpm_1.yaml')
File "lion/training/trainer.py", line 25, in train_model
model.train_epoch(train_loader)
File "/home/fanyixing/users/mxy/lion/lion/training/model.py", line 85, in train_epoch
for ex in tqdm(data_loader):
File "/home/fanyixing/users/wangsu/anaconda3/envs/pytorch/lib/python3.6/site-packages/tqdm/_tqdm.py", line 979, in iter
for obj in iterable:
File "/home/fanyixing/users/wangsu/anaconda3/envs/pytorch/lib/python3.6/site-packa
ges/torch/utils/data/dataloader.py", line 336, in next
return self._process_next_batch(batch)
File "/home/fanyixing/users/wangsu/anaconda3/envs/pytorch/lib/python3.6/site-packa
ges/torch/utils/data/dataloader.py", line 357, in _process_next_batch
raise batch.exc_type(batch.exc_msg)

TypeError: Traceback (most recent call last):
File "/home/fanyixing/users/wangsu/anaconda3/envs/pytorch/lib/python3.6/site-packa
ges/torch/utils/data/dataloader.py", line 106, in worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/fanyixing/users/wangsu/anaconda3/envs/pytorch/lib/python3.6/site-packa
ges/torch/utils/data/dataloader.py", line 106, in
samples = collate_fn([dataset[i] for i in batch_indices])
File "/home/fanyixing/users/mxy/lion/lion/data/dataset.py", line 26, in getitem
return self.vectorize(self.examples[index])
File "/home/fanyixing/users/mxy/lion/lion/data/dataset.py", line 56, in vectorize
Bchar = torch.LongTensor([make_char(char_dict, w) for w in ex['Btokens']])
File "/home/fanyixing/users/mxy/lion/lion/data/dataset.py", line 56, in
Bchar = torch.LongTensor([make_char(char_dict, w) for w in ex['Btokens']])
File "/home/fanyixing/users/mxy/lion/lion/data/dataset.py", line 48, in make_char
return [char_dict(t) for t in token[:8]] + [char_dict(t_) for t_ in token[-8:]]
File "/home/fanyixing/users/mxy/lion/lion/data/dataset.py", line 48, in │
return [char_dict(t_) for t_ in token[:8]] + [char_dict(t_) for t_ in token[-8:]]
TypeError: 'Dictionary' object is not callable