GithubHelp home page GithubHelp logo

namisan / mt-dnn Goto Github PK

View Code? Open in Web Editor NEW
2.2K 58.0 414.0 2.01 MB

Multi-Task Deep Neural Networks for Natural Language Understanding

License: MIT License

Python 97.34% Dockerfile 0.17% Shell 2.48%
multi-task-learning natural-language-understanding deep-learning microsoft ranking named-entity-recognition bert machine-reading-comprehension nlp pytorch

mt-dnn's Introduction

License: MIT Travis-CI

New Release
We released Adversarial training for both LM pre-training/finetuning and f-divergence.

Large-scale Adversarial training for LMs: ALUM code.
Hybrid Neural Network Model for Commonsense Reasoning: HNN code
If you want to use the old version, please use following cmd to clone the code:
git clone -b v0.1 https://github.com/namisan/mt-dnn.git

Update
We regret to inform you that due the policy changing, it no longer provides a public storage for model sharing. We are working hard to find a solution.

Multi-Task Deep Neural Networks for Natural Language Understanding

This PyTorch package implements the Multi-Task Deep Neural Networks (MT-DNN) for Natural Language Understanding, as described in:

Xiaodong Liu*, Pengcheng He*, Weizhu Chen and Jianfeng Gao
Multi-Task Deep Neural Networks for Natural Language Understanding
ACL 2019
*: Equal contribution

Xiaodong Liu, Pengcheng He, Weizhu Chen and Jianfeng Gao
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
arXiv version

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao and Jiawei Han
On the Variance of the Adaptive Learning Rate and Beyond
ICLR 2020

Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao and Tuo Zhao
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization
ACL 2020

Xiaodong Liu, Yu Wang, Jianshu Ji, Hao Cheng, Xueyun Zhu, Emmanuel Awa, Pengcheng He, Weizhu Chen, Hoifung Poon, Guihong Cao, Jianfeng Gao
The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding
ACL 2020

Hao Cheng and Xiaodong Liu and Lis Pereira and Yaoliang Yu and Jianfeng Gao
Posterior Differential Regularization with f-divergence for Improving Model Robustness
NAACL 2021

Quickstart

Setup Environment

Install via pip:

  1. python3.6
    Reference to download and install : https://www.python.org/downloads/release/python-360/

  2. install requirements
    > pip install -r requirements.txt

Use docker:

  1. Pull docker
    > docker pull allenlao/pytorch-mt-dnn:v1.3

  2. Run docker
    > docker run -it --rm --runtime nvidia allenlao/pytorch-mt-dnn:v1.3 bash
    Please refer to the following link if you first use docker: https://docs.docker.com/

Train a toy MT-DNN model

  1. Download data
    > sh download.sh
    Please refer to download GLUE dataset: https://gluebenchmark.com/

  2. Preprocess data
    > sh experiments/glue/prepro.sh

  3. Training
    > python train.py

Note that we ran experiments on 4 V100 GPUs for base MT-DNN models. You may need to reduce batch size for other GPUs.

GLUE Result reproduce

  1. MTL refinement: refine MT-DNN (shared layers), initialized with the pre-trained BERT model, via MTL using all GLUE tasks excluding WNLI to learn a new shared representation.
    Note that we ran this experiment on 8 V100 GPUs (32G) with a batch size of 32.

    • Preprocess GLUE data via the aforementioned script
    • Training:
      >scripts\run_mt_dnn.sh
  2. Finetuning: finetune MT-DNN to each of the GLUE tasks to get task-specific models.
    Here, we provide two examples, STS-B and RTE. You can use similar scripts to finetune all the GLUE tasks.

    • Finetune on the STS-B task
      > scripts\run_stsb.sh
      You should get about 90.5/90.4 on STS-B dev in terms of Pearson/Spearman correlation.
    • Finetune on the RTE task
      > scripts\run_rte.sh
      You should get about 83.8 on RTE dev in terms of accuracy.

SciTail & SNIL Result reproduce (Domain Adaptation)

  1. Domain Adaptation on SciTail
    >scripts\scitail_domain_adaptation_bash.sh

  2. Domain Adaptation on SNLI
    >scripts\snli_domain_adaptation_bash.sh

Sequence Labeling Task

  1. Preprocess data
    a) Download NER data to data/ner including: {train/valid/test}.txt
    b) Convert NER data to the canonical format: > python experiments\ner\prepro.py --data data\ner --output_dir data\canonical_data
    c) Preprocess the canonical data to the MT-DNN format: > python prepro_std.py --root_dir data\canonical_data --task_def experiments\ner\ner_task_def.yml --model bert-base-uncased

  2. Training
    > python train.py --data_dir <data-path> --init_checkpoint <bert-base-uncased> --train_dataset squad,squad-v2 --test_dataset squad,squad-v2 --task_def experiments\squad\squad_task_def.yml

Question Answer Task

  1. Preprocess data
    a) Download SQuAD data to data/squad including: {train/valid}.txt and then change file name to: {squad_train/squad_dev}.json
    b) Convert data to the MT-DNN format: > python experiments\squad\squad_prepro.py --root_dir data\canonical_data --task_def experiments\squad\squad_task_def.yml --model bert-base-uncased

  2. Training
    > python train.py --data_dir <data-path> --init_checkpoint <bert-model> --train_dataset ner --test_dataset ner --task_def experiments\ner\ner_task_def.yml

SMART

Adv training at the fine-tuning stages: > python train.py --data_dir <data-path> --init_checkpoint <bert/mt-dnn-model> --train_dataset mnli --test_dataset mnli_matched,mnli_mismatched --task_def experiments\glue\glue_task_def.yml --adv_train --adv_opt 1

Extract embeddings

  1. Extracting embeddings of a pair text example
    >python extractor.py --do_lower_case --finput input_examples\pair-input.txt --foutput input_examples\pair-output.json --bert_model bert-base-uncased --checkpoint mt_dnn_models\mt_dnn_base.pt
    Note that the pair of text is split by a special token |||. You may refer input_examples\pair-output.json as example.

  2. Extracting embeddings of a single sentence example
    >python extractor.py --do_lower_case --finput input_examples\single-input.txt --foutput input_examples\single-output.json --bert_model bert-base-uncased --checkpoint mt_dnn_models\mt_dnn_base.pt

Speed up Training

  1. Gradient Accumulation
    If you have small GPUs, you may need to use the gradient accumulation to make training stable.
    For example, if you use the flag: --grad_accumulation_step 4 during the training, the actual batch size will be batch_size * 4.

  2. FP16 The current version of MT-DNN also supports FP16 training, and please install apex.
    You just need to turn on the flag during the training: --fp16
    Please refer the script: scripts\run_mt_dnn_gc_fp16.sh

Convert Tensorflow BERT model to the MT-DNN format

Here, we go through how to convert a Chinese Tensorflow BERT model into mt-dnn format.

  1. Download BERT model from the Google bert web: https://github.com/google-research/bert

  2. Run the following script for MT-DNN format
    python scripts\convert_tf_to_pt.py --tf_checkpoint_root chinese_L-12_H-768_A-12\ --pytorch_checkpoint_path chinese_L-12_H-768_A-12\bert_base_chinese.pt

TODO

  • Publish pretrained Tensorflow checkpoints.

FAQ

Did you share the pretrained mt-dnn models?

Yes, we released the pretrained shared embedings via MTL which are aligned to BERT base/large models: mt_dnn_base.pt and mt_dnn_large.pt.
To obtain the similar models:

  1. run the >sh scripts\run_mt_dnn.sh, and then pick the best checkpoint based on the average dev preformance of MNLI/RTE.
  2. strip the task-specific layers via scritps\strip_model.py.

Why SciTail/SNLI do not enable SAN?

For SciTail/SNLI tasks, the purpose is to test generalization of the learned embedding and how easy it is adapted to a new domain instead of complicated model structures for a direct comparison with BERT. Thus, we use a linear projection on the all domain adaptation settings.

What is the difference between V1 and V2

The difference is in the QNLI dataset. Please refere to the GLUE official homepage for more details. If you want to formulate QNLI as pair-wise ranking task as our paper, make sure that you use the old QNLI data.
Then run the prepro script with flags: > sh experiments/glue/prepro.sh --old_glue
If you have issues to access the old version of the data, please contact the GLUE team.

Did you fine-tune single task for your GLUE leaderboard submission?

We can use the multi-task refinement model to run the prediction and produce a reasonable result. But to achieve a better result, it requires a fine-tuneing on each task. It is worthing noting the paper in arxiv is a littled out-dated and on the old GLUE dataset. We will update the paper as we mentioned below.

Notes and Acknowledgments

BERT pytorch is from: https://github.com/huggingface/pytorch-pretrained-BERT
BERT: https://github.com/google-research/bert
We also used some code from: https://github.com/kevinduh/san_mrc

Related Projects/Codebase

  1. Pretrained UniLM: https://github.com/microsoft/unilm
  2. Pretrained Response Generation Model: https://github.com/microsoft/DialoGPT
  3. Internal MT-DNN repo: https://github.com/microsoft/mt-dnn

How do I cite MT-DNN?

@inproceedings{liu2019mt-dnn,
    title = "Multi-Task Deep Neural Networks for Natural Language Understanding",
    author = "Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1441",
    pages = "4487--4496"
}


@article{liu2019mt-dnn-kd,
  title={Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding},
  author={Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng},
  journal={arXiv preprint arXiv:1904.09482},
  year={2019}
}


@inproceedings{liu2019radam,
  title={On the Variance of the Adaptive Learning Rate and Beyond},
  author={Liu, Liyuan and Jiang, Haoming and He, Pengcheng and Chen, Weizhu and Liu, Xiaodong and Gao, Jianfeng and Han, Jiawei},
  booktitle={International Conference on Learning Representations},
  year={2020}
}

@inproceedings{jiang2019smart,
  title={SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization},
  author={Jiang, Haoming and He, Pengcheng and Chen, Weizhu and Liu, Xiaodong and Gao, Jianfeng and Zhao, Tuo},
 booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
  year={2020}
}

@article{liu2020mtmtdnn,
  title={The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding},
  author={Liu, Xiaodong and Wang, Yu and Ji, Jianshu and Cheng, Hao and Zhu, Xueyun and Awa, Emmanuel and He, Pengcheng and Chen, Weizhu and Poon, Hoifung and Cao, Guihong and Jianfeng Gao},
  journal={arXiv preprint arXiv:2002.07972},
  year={2020}
}

@inproceedings{liu2020mtmtdnn,
    title = "The {M}icrosoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding",
    author={Liu, Xiaodong and Wang, Yu and Ji, Jianshu and Cheng, Hao and Zhu, Xueyun and Awa, Emmanuel and He, Pengcheng and Chen, Weizhu and Poon, Hoifung and Cao, Guihong and Jianfeng Gao},
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.acl-demos.16",
    year = "2020"
}

@article{cheng2020posterior,
  title={Posterior Differential Regularization with f-divergence for Improving Model Robustness},
  author={Cheng, Hao and Liu, Xiaodong and Pereira, Lis and Yu, Yaoliang and Gao, Jianfeng},
  booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
  year = "2021",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.naacl-main.85",
  doi = "10.18653/v1/2021.naacl-main.85",
}

Contact Information

For help or issues using MT-DNN, please submit a GitHub issue.

For personal communication related to this package, please contact Xiaodong Liu ([email protected]), Yu Wang ([email protected]), Pengcheng He ([email protected]), Weizhu Chen ([email protected]), Jianshu Ji ([email protected]), Hao Cheng ([email protected]) or Jianfeng Gao ([email protected]).

mt-dnn's People

Contributors

anselmwang avatar bigbird01 avatar brittonwinterrose avatar chenweizhu avatar gjamsue avatar hao-cheng avatar liyuanlucasliu avatar mariamedp avatar maydaytyh avatar mehrad0711 avatar monomagentaeggroll avatar namisan avatar pidugusundeep avatar rajratnpranesh avatar raspberryice avatar renee-wrn avatar wugh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mt-dnn's Issues

Performance on QNLI dataset?

On the glue leaderboard, the QNLI performance can get 96.0. But when I train the network, the dev accuracy can only get 92.2. I don't have V100 GPU. The training setting is batch size 8 on 8 TITAN X. Can you provide some details when you train the network on QNLI dataset?

Why not using SAN module during finetuning on RTE (or other pair tasks)

As per the current scripts/run_rte.sh, you are giving answer_opt=0 for rte. I am curious what was the reason of throwing away the SAN module for finetuning of the pair tasks and going back to using pooler.

  1. Is this what you use for glue submission or you use SAN for finetuning for submission?
  2. If not, do you know the perf difference?

两个关于模型的问题

你好
想请教你们两个问题。
1. multi-task带来的促进应该和task之间的联系有关,什么时候task能够带来好的提升,以至于什么时候task反而会影响对方的表现?有没有一些理论的分析呢?
2.文章中的多个task的objective function,是否考虑过给他们设置不同的权重呢,比如最终的loss = 权重1loss1+权重2loss2....

Use task specific layers after training

Is there any way to load the model after training and classify based on the task-specific dense layers (I assume for each task a specific dense layer is learned for classification) or are these not stored with the model.pt file and lost? E.g. if one wants to predict classes for samples of an unseen dataset on the, let's say, specific MNLI dense layer.

How to Improve Performance on WNLI

I just noticed that the performance on WNLI have got a huge improvement! That's really marvelous. Can't wait to know more details about it.

Can't resume training the model

When the procedure is interrupted accidentally. I want to resume the training procedure with the checkpoint in checkpoints/model_*.pt. But I don't know why it comes out some error:

Traceback (most recent call last):
File "../train.py", line 352, in
main()
File "../train.py", line 314, in main
model.update(batch_meta, batch_data)
File "/home/mt-dnn/mt_dnn/model.py", line 171, in update
self.optimizer.step()
File "/home/mt-dnn/module/bert_optim.py", line 120, in step
exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: Expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #4 'other'

I didn't change any other code but just hyperparameters. How to deal with such a problem?

Please include a license

Looking forward to inclusion of an open source license, otherwise the default copyrights apply which doesn't allow reproduction.

Intro Question about model.

Hi,

Thanks for the repo and your model. Thanks for your time.

So I have an intro question about the model.

So this model takes a pretrained BERT model and then jointly fine tunes it on 4 tasks: single sentence classification, pairwise text similarity, pairwise text classification and pairwise ranking.
For pairwise text classification it has an additional stochastic answer network attached to the head but for the other three tasks it just attaches the specific loss function for the three tasks.

The whole network is trained end to end.

Is this correct?

Thanks for the help and God bless!

Tokenization

prepro.py uses the tokenizer for bert-base-uncased. Then if we use a large model and/or an uncased model, would there be inconsistencies?

Why fine-tuning in single-task setting (not as stated in the paper)?

In the arxiv paper it is stated:

In the multi-task fine-tuning stage, we use minibatch based stochastic gradient descent (SGD) to
learn the parameters of our model (i.e., the parameters of all shared layers and task-specific layers) as shown in Algorithm 1. In each epoch, a
mini-batch b_t
is selected(e.g., among all 9 GLUE
tasks), and the model is updated according to the
task-specific objective for the task t. This approximately optimizes the sum of all multi-task objectives.

If I understand it correctly, in your code this multi-task fine-tuning stage is called MTL refinement. Then why do you fine-tune for each task in single task setting in your fine-tuning stage? There is no such stage in the original paper.
Also, in run_mt_dnn.sh there are lines:
train_datasets="mnli,rte,qqp,qnli,mrpc,sst,cola,stsb" test_datasets="mnli_matched,mnli_mismatched,rte"
Why do you only test on mnli and rte and not test on all other tasks?
I would also like to ask if I can switch from BERT large to BERT base there because i only have one 1080 GTX card.

Thank you.

Changes to made for training for a specific purpose.

What are the changes we have to make in the tran.py file in order to train for specific purpose(STSB). Some of the changes I have made are following:-

  1. train_dataset=stsb
  2. test_dataset=stsb
    Apart from that is there anything else I need to change because right now I am getting all predictions as 0.
    Even the run_stsb.sh is giving all zero prediction instead of predicting similarity score between 0 and 5.
    Please help!!!!

Evaluation mode ignored for dropout list

Dear,

while a nn.ModuleList() is used for the scoring_list, this is not the case for the dropout_list:

mt-dnn/mt_dnn/matcher.py

Lines 16 to 24 in eb0aef4

self.dropout_list = []
self.bert_config = BertConfig.from_dict(opt)
self.bert = BertModel(self.bert_config)
if opt['update_bert_opt'] > 0:
for p in self.bert.parameters():
p.requires_grad = False
mem_size = self.bert_config.hidden_size
self.decoder_opt = opt['answer_opt']
self.scoring_list = nn.ModuleList()

This seems to have an impact when switching the network mode in class MTDNNModel from training to evaluation. The scoring_list appears to be correctly switched to evaluation mode while this is not happening for the dropout_list. As a consequence, dropout is active at prediction time.

Jan Luts

GPU memory usage keeps increasing during training

I tried to finetune the pretrained mt-dnn model on a specific task, but the GPU memory usage keeps growing. After training for 1000 iterations, the memory is fully occupied, causing 'run out of memory' error. The experiment was conducted on 1080TI.

What is the range of score in stsb task?

While printing the values of test_predictions in stsb task I was getting 0 for all cases. Then I printed the scores. I thought the scores were from 0 to 5 where 5 representing most similar but when I printed them I found that some of the similarity scores were negative. Can you clarify what is the range of similarity score.
Another problem I am getting is that the test results change if I run the model again( I am not training the model, I am just predicting the test set). I thought it was due to the dropout, so I made the dropout 0 but I am still getting different results every time I ran it.
I even saw that you have used self.network.eval() which makes the network in testing mode thus even if there is any dropout layer then it ignores it.

Ensembling MT-DNN models

Hello!
I'd like to know more of the details of ensembling multiple MT-DNN models.
Specifically for the STS-B task, could you describe the steps you took to get the ensemble? Given the large model sizes, did you load them to the GPUs sequentially and just saved the individual model outputs to be averaged later? I'd appreciate your response!

Apply the architecture to individual task

Hi,

I am now working on a sentiment analysis task and found that this architecture might be a good choice for my application.

However, I am quite confused on how to use this architecture on my custom data. More specifically,
if I would like to directly apply this architecture to my custom data. What I should do? Is there a more detailed tutorial for the use of the architecture?

Why using "ratio" argument?

If "ratio" argument is provided, then in each epoch only random_picks=min(len(MNLI task batches) * args.ratio, len(all other tasks batches)) from len(all other tasks batches) are used, instead of using all other tasks batches. Such behavior seems strange to me, considering the fact that MNLI has most examples throughout the GLUE dataset. Could you please explain, what is the benefit of not using some of the examples from smaller datasets on each epoch?
Thank you very much.

Multi-gpu training not working

I'm trying to finetune this model on a specific task. But when I run the example scripts of run_rte.sh, the multi-gpu training is not working when I set CUDA_VISIBLE_DEVICES="0,1,2,3", it seems that only the first gpu is working.

I have only 4 1080 ti gpus, when I run the run_rte.sh script, I can only adjust the batch size to 1 to avoid out of memory, can this be possible training on 1080Ti?

Demo tool

Hi,
Is there any online demo interface for mt-dnn? I am working on a task similar to textual entailment and I would like to see how mt-dnn performs on a very small subset of my dataset?

`--multi_gpu_on` not working

I specified --multi_gpu_on and set CUDA_VISIBLE_DEVICES to multiple devices, but it looks like only one GPU is used? Why is that?

how to feed data, task sampling or data sampling

I had tried to do multitask learning on original BERT for seven Chinese tasks. I tried to feed data with task sampling and data sampling. The results are somehow confusing. When feeding model using task sampling strategy just like MTDNN, the mtl trained model can't achieve better results than single-task-training, But when i apply data sampling, mtl trained model can achieve better results than BERT, EIRNE on lcqmc, chnsenticorp and xnli. Is there any theoritical guidance for mtl on different datasets

Why is time prolonged in multi-GPU training?

I wonder why adding the number of gpus will increase the remaining training time under the same batch_size condition.
image
The picture above shows the situation of a single GPU.When I used 4 v100s to train the network, the time increased to two hours.

Shared task-specific heads

When mtl_opt is on (by default), do all GLUE tasks with the same n_class share the same task-specific classifier? Intuitively this seems a little strange that as distinct tasks as SST and QQP use the same classifier.
Another question is, answer_opt is set to 0 in all scripts, so when does SAN apply?
Thank you.

SciTail not using SAN

Looks like by default SciTail does not use SAN, because it uses the default value for --answer_opt, which is 0. Why is that?

prepro.py fails with assert len(lines) %2 == 0

I'm on commit 25c3bd1

05/27/2019 03:38:53 Loaded 5463 QNLI test samples
Traceback (most recent call last):
File "prepro.py", line 352, in
main(args)
File "prepro.py", line 193, in main
qnnli_train_data = load_qnnli(qnli_train_path, GLOBAL_MAP['qnli'])
File "/home/markus/mt-dnn/data_utils/glue_utils.py", line 113, in load_qnnli
assert len(lines) % 2 == 0
AssertionError

How to get the reported result?

Thanks for your work and we have a problem with the reproduce. We use the released pretrained model "mt_dnn_large.pt" and directly run "run_stsb.sh", "run_rte" and so on, however, we get low accuracy. I want to know is there any problem with the running procedure or the released model? Or something else such as the number of GPU and bach size?

Running fine_tuning and domain_adaption scripts using my trained model instead of mt_dnn_base.pt

Hello,

I have trained the mt-dnn, initialized with mt_dnn_large.pt, with some new input and with the following parameters

`{'log_file': 'checkpoints/mt-dnn-NL-labels_adamax_answer_opt1_gc0_ggc1_2019-04-15T2138/log.log', 'init_checkpoint': '../mt_dnn_models/mt_dnn_large.pt', 'data_dir': '../data/mt_dnn', 'data_sort_on': False, 'name': 'farmer', 'train_datasets': ['mnli', 'rte', 'qnli'], 'test_datasets': ['mnli_matched', 'mnli_mismatched', 'rte'], 'pw_tasks': ['qnnli'], 'update_bert_opt': 0, 'multi_gpu_on': True, 'mem_cum_type': 'simple', 'answer_num_turn': 5, 'answer_mem_drop_p': 0.1, 'answer_att_hidden_size': 128, 'answer_att_type': 'bilinear', 'answer_rnn_type': 'gru', 'answer_sum_att_type': 'bilinear', 'answer_merge_opt': 1, 'answer_mem_type': 1, 'answer_dropout_p': 0.1, 'answer_weight_norm_on': False, 'dump_state_on': False, 'answer_opt': [1, 1, 1], 'label_size': '3,2,2', 'mtl_opt': 0, 'ratio': 0, 'mix_opt': 0, 'max_seq_len': 512, 'init_ratio': 1, 'cuda': True, 'log_per_updates': 500, 'epochs': 5, 'batch_size': 16, 'batch_size_eval': 8, 'optimizer': 'adamax', 'grad_clipping': 0.0, 'global_grad_clipping': 1.0, 'weight_decay': 0, 'learning_rate': 5e-05, 'momentum': 0, 'warmup': 0.1, 'warmup_schedule': 'warmup_linear', 'vb_dropout': True, 'dropout_p': 0.1, 'dropout_w': 0.0, 'bert_dropout_p': 0.1, 'ema_opt': 0, 'ema_gamma': 0.995, 'have_lr_scheduler': True, 'multi_step_lr': '10,20,30', 'freeze_layers': -1, 'embedding_opt': 0, 'lr_gamma': 0.5, 'bert_l2norm': 0.0, 'scheduler_type': 'ms', 'output_dir': 'checkpoints/mt-dnn-NL-labels_adamax_answer_opt1_gc0_ggc1_2019-04-15T2138', 'seed': 2018, 'task_config_path': 'configs/tasks_config.json', 'tasks_dropout_p': [0.1, 0.1, 0.1]}``

I train this model for e.g., 4 epochs and I want to perform the fine tuning and domain adaptation experiments based on the resulting model, "checkpoint/model_4.pt". Therefore, in run_rte.sh or snli_domain_adaptation_bash.sh I replace the initial checkpoint from ../mt_dnn_models/mt_dnn_base.pt to checkpoint/model_4.pt.

Doing so, I get the following error

export CUDA_VISIBLE_DEVICES=4,5
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Namespace(answer_att_hidden_size=128, answer_att_type='bilinear', answer_dropout_p=0.1, answer_mem_drop_p=0.1, answer_mem_type=1, answer_merge_opt=1, answer_num_turn=5, answer_opt=0, answer_rnn_type='gru', answer_sum_att_type='bilinear', answer_weight_norm_on=False, batch_size=32, batch_size_eval=8, bert_dropout_p=0.1, bert_l2norm=0.0, cuda=True, data_dir='../data/mt_dnn/', data_sort_on=False, dropout_p=0.1, dropout_w=0.0, dump_state_on=False, ema_gamma=0.995, ema_opt=0, embedding_opt=0, epochs=5, freeze_layers=-1, global_grad_clipping=1.0, grad_clipping=0.0, have_lr_scheduler=True, init_checkpoint='checkpoints/mt-dnn-NL-labels_adamax_answer_opt1_gc0_ggc1_2019-04-15T2138/model_4.pt', init_ratio=1, label_size='3', learning_rate=2e-05, log_file='checkpoints/mt-dnn-rte_adamax_answer_opt0_gc0_ggc1_2019-04-16T1636/log.log', log_per_updates=500, lr_gamma=0.5, max_seq_len=512, mem_cum_type='simple', mix_opt=0, momentum=0, mtl_opt=0, multi_gpu_on=False, multi_step_lr='10,20,30', name='farmer', optimizer='adamax', output_dir='checkpoints/mt-dnn-rte_adamax_answer_opt0_gc0_ggc1_2019-04-16T1636', pw_tasks=['qnnli'], ratio=0, scheduler_type='ms', seed=2018, task_config_path='configs/tasks_config.json', test_datasets=['rte'], train_datasets=['rte'], update_bert_opt=0, vb_dropout=True, warmup=0.1, warmup_schedule='warmup_linear', weight_decay=0)
04/16/2019 04:36:15 0
04/16/2019 04:36:15 Launching the MT-DNN training
04/16/2019 04:36:15 Loading ../data/mt_dnn/rte_train.json as task 0
Loaded 2490 samples out of 2490
04/16/2019 04:36:15 2
Loaded 277 samples out of 277
Loaded 3000 samples out of 3000
04/16/2019 04:36:15 ####################
04/16/2019 04:36:15 {'log_file': 'checkpoints/mt-dnn-rte_adamax_answer_opt0_gc0_ggc1_2019-04-16T1636/log.log', 'init_checkpoint': 'checkpoints/mt-dnn-NL-labels_adamax_answer_opt1_gc0_ggc1_2019-04-15T2138/model_4.pt', 'data_dir': '../data/mt_dnn/', 'data_sort_on': False, 'name': 'farmer', 'train_datasets': ['rte'], 'test_datasets': ['rte'], 'pw_tasks': ['qnnli'], 'update_bert_opt': 0, 'multi_gpu_on': False, 'mem_cum_type': 'simple', 'answer_num_turn': 5, 'answer_mem_drop_p': 0.1, 'answer_att_hidden_size': 128, 'answer_att_type': 'bilinear', 'answer_rnn_type': 'gru', 'answer_sum_att_type': 'bilinear', 'answer_merge_opt': 1, 'answer_mem_type': 1, 'answer_dropout_p': 0.1, 'answer_weight_norm_on': False, 'dump_state_on': False, 'answer_opt': [0], 'label_size': '2', 'mtl_opt': 0, 'ratio': 0, 'mix_opt': 0, 'max_seq_len': 512, 'init_ratio': 1, 'cuda': True, 'log_per_updates': 500, 'epochs': 5, 'batch_size': 32, 'batch_size_eval': 8, 'optimizer': 'adamax', 'grad_clipping': 0.0, 'global_grad_clipping': 1.0, 'weight_decay': 0, 'learning_rate': 2e-05, 'momentum': 0, 'warmup': 0.1, 'warmup_schedule': 'warmup_linear', 'vb_dropout': True, 'dropout_p': 0.1, 'dropout_w': 0.0, 'bert_dropout_p': 0.1, 'ema_opt': 0, 'ema_gamma': 0.995, 'have_lr_scheduler': True, 'multi_step_lr': '10,20,30', 'freeze_layers': -1, 'embedding_opt': 0, 'lr_gamma': 0.5, 'bert_l2norm': 0.0, 'scheduler_type': 'ms', 'output_dir': 'checkpoints/mt-dnn-rte_adamax_answer_opt0_gc0_ggc1_2019-04-16T1636', 'seed': 2018, 'task_config_path': 'configs/tasks_config.json', 'tasks_dropout_p': [0.1]}
04/16/2019 04:36:15 ####################
04/16/2019 04:36:35
############# Model Arch of MT-DNN #############
SANBertNetwork(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 1024)
(position_embeddings): Embedding(512, 1024)
(token_type_embeddings): Embedding(2, 1024)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(2): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(3): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(4): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(5): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(6): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(7): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(8): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(9): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(10): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(12): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(13): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(14): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(15): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(16): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(17): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(18): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(19): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(20): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(21): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(22): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(23): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=1024, out_features=1024, bias=True)
(key): Linear(in_features=1024, out_features=1024, bias=True)
(value): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=1024, out_features=4096, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=4096, out_features=1024, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=1024, out_features=1024, bias=True)
(activation): Tanh()
)
)
(scoring_list): ModuleList(
(0): SANClassifier(
(dropout): DropoutWrapper()
(query_wsum): SelfAttnWrapper(
(att): LinearSelfAttn(
(linear): Linear(in_features=1024, out_features=1, bias=True)
(dropout): DropoutWrapper()
)
)
(attn): FlatSimilarityWrapper(
(att_dropout): DropoutWrapper()
(score_func): BilinearFlatSim(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): DropoutWrapper()
)
)
(rnn): GRUCell(1024, 1024)
(classifier): Classifier(
(dropout): DropoutWrapper()
(proj): Linear(in_features=4096, out_features=3, bias=True)
)
)
(1): SANClassifier(
(dropout): DropoutWrapper()
(query_wsum): SelfAttnWrapper(
(att): LinearSelfAttn(
(linear): Linear(in_features=1024, out_features=1, bias=True)
(dropout): DropoutWrapper()
)
)
(attn): FlatSimilarityWrapper(
(att_dropout): DropoutWrapper()
(score_func): BilinearFlatSim(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): DropoutWrapper()
)
)
(rnn): GRUCell(1024, 1024)
(classifier): Classifier(
(dropout): DropoutWrapper()
(proj): Linear(in_features=4096, out_features=2, bias=True)
)
)
(2): SANClassifier(
(dropout): DropoutWrapper()
(query_wsum): SelfAttnWrapper(
(att): LinearSelfAttn(
(linear): Linear(in_features=1024, out_features=1, bias=True)
(dropout): DropoutWrapper()
)
)
(attn): FlatSimilarityWrapper(
(att_dropout): DropoutWrapper()
(score_func): BilinearFlatSim(
(linear): Linear(in_features=1024, out_features=1024, bias=True)
(dropout): DropoutWrapper()
)
)
(rnn): GRUCell(1024, 1024)
(classifier): Classifier(
(dropout): DropoutWrapper()
(proj): Linear(in_features=4096, out_features=2, bias=True)
)
)
)
)

04/16/2019 04:36:35 Total number of params: 357215242

04/16/2019 04:36:36 At epoch 0
Traceback (most recent call last):
File "../train.py", line 354, in
main()
File "../train.py", line 316, in main
model.update(batch_meta, batch_data)
File "../mt_dnn/model.py", line 151, in update
self.optimizer.step()
File "../module/bert_optim.py", line 119, in step
exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor

May I ask you what is the problem?
I can run either of these scripts using ../mt_dnn_models/mt_dnn_large.pt but not using the model that trained myself, i.e., model_4.pt

MT-DNN tutorial and knowledge distillation

Hello,

Do you have a rough timeline for when you might release:

  1. a tutorial for using pre-trained MT-DNN models,
  2. code/models for MT-DNN with knowledge distillation?

Thank you!

mt_dnn_base.pt

Is the downloaded mt_dnn_base.pt the same as the resulting model after running scripts/run_mt_dnn.sh? If not, what is their relationship?

stochastic prediction dropout mask greater than one

Thanks for your work and we have a problem with the stochastic prediction dropout.

def generate_mask(new_data, dropout_p=0.0, is_training=False):
    if not is_training: dropout_p = 0.0
    new_data = (1-dropout_p) * (new_data.zero_() + 1)
    for i in range(new_data.size(0)):
        one = random.randint(0, new_data.size(1)-1)
        new_data[i][one] = 1
    print(new_data)
    mask = Variable(1.0/(1 - dropout_p) * torch.bernoulli(new_data), requires_grad=False)
    return mask

When call this function as above, the element in the matrix will be greater than one, I want to known why multiply 'Variable(1.0/(1 - dropout_p)'. It will cause the element greater than one.

No such file or directory: 'data/mt_dnn/mnli_train.json'

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Namespace(answer_att_hidden_size=128, answer_att_type='bilinear', answer_dropout_p=0.1, answer_mem_drop_p=0.1, answer_mem_type=1, answer_merge_opt=1, answer_num_turn=5, answer_opt=0, answer_rnn_type='gru', answer_sum_att_type='bilinear', answer_weight_norm_on=False, batch_size=8, batch_size_eval=8, bert_dropout_p=0.1, bert_l2norm=0.0, cuda=True, data_dir='data/mt_dnn', data_sort_on=False, dropout_p=0.1, dropout_w=0.0, dump_state_on=False, ema_gamma=0.995, ema_opt=0, embedding_opt=0, epochs=5, freeze_layers=-1, global_grad_clipping=1.0, grad_clipping=0, have_lr_scheduler=True, init_checkpoint='mt_dnn_models/bert_model_base.pt', init_ratio=1, label_size='3', learning_rate=5e-05, log_file='mt-dnn-train.log', log_per_updates=500, lr_gamma=0.5, max_seq_len=512, mem_cum_type='simple', mix_opt=0, momentum=0, mtl_opt=0, multi_gpu_on=False, multi_step_lr='10,20,30', name='farmer', optimizer='adamax', output_dir='checkpoint', pw_tasks=['qnnli'], ratio=0, scheduler_type='ms', seed=2018, task_config_path='configs/tasks_config.json', test_datasets=['mnli_mismatched', 'mnli_matched'], train_datasets=['mnli'], update_bert_opt=0, vb_dropout=True, warmup=0.1, warmup_schedule='warmup_linear', weight_decay=0)
07/01/2019 01:37:37 0
07/01/2019 01:37:37 Launching the MT-DNN training
07/01/2019 01:37:37 Loading data/mt_dnn/mnli_train.json as task 0
Traceback (most recent call last):
File "train.py", line 350, in
main()
File "train.py", line 178, in main
train_data = BatchGen(BatchGen.load(train_path, True, pairwise=pw_task, maxlen=args.max_seq_len),
File "/home/ubuntu/paraphrase/mt-dnn/mt_dnn/batcher.py", line 55, in load
with open(path, 'r', encoding='utf-8') as reader:
FileNotFoundError: [Errno 2] No such file or directory: 'data/mt_dnn/mnli_train.json'

I use the tutorial to train, what does this error mean? What does the Json file contain?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.