GithubHelp home page GithubHelp logo

prajdabre / yanmtt Goto Github PK

View Code? Open in Web Editor NEW
162.0 6.0 28.0 46.75 MB

Yet Another Neural Machine Translation Toolkit

License: MIT License

Python 76.61% Shell 0.33% Makefile 0.01% Dockerfile 0.04% Jsonnet 0.01% CSS 9.03% JavaScript 8.85% Jupyter Notebook 4.73% HTML 0.40% Procfile 0.01%

yanmtt's Introduction

YANMTT

YANMTT is short for Yet Another Neural Machine Translation Toolkit. For a backstory how I ended up creating this toolkit scroll to the bottom of this README. Although the name says that it is yet another toolkit, it was written with the purpose of better understanding of the flow of training, starting from data pre-processing, sharding, batching, distributed training and decoding. There is a significant emphashis on multilingualism and on cross-lingual learning.

List of features:

  1. Basic NMT pre-training, fine-tuning, decoding, visualization
    • Distributed, mixed precision, multilingual training.
    • Denoising pre-training in mBART, mT5 or UL2 style.
    • Fine-tuning your own or official BART-like models like BART, mBART, IndicBART.
    • Joint supervised and unsupervised training using monolingual and parallel corpora.
    • Sentence representation, attention extraction, and scoring translations.
  2. User Interface
    • GUI to demo and debug models.
    • Select any official huggingface (mBART, IndicBART) or custom model and run it on any of the supported languages.
    • Visualize attention weights at each layer for each head using bertviz.
    • Visualize encoder representations of a set of sentences using tensorflow projector.
  3. Advanced features
    • Multi-billion parameter models can be trained using FSDP.
    • Mixtures-of-experts layers.
    • Tempered softmax training.
    • Softmax calibration.
    • Entropy maximization training.
    • Multi-layer softmax training.
    • 8-bit optimizers for training large models.
    • Various weight initialization strategies.
    • Various positional embedding strategies like sinusoidal, learned, RoPE, AliBi, NoPE.
  4. Light-weight fine-tuning
    • Adaptor (Houlsby, Bapna, Mixtures of Adapters) and prompt tuning.
    • Hypercomplex, IA-3 light-weight adaptors.
    • Eliminate components or layers prior to decoding or fine-tuning.
    • Fine-grained control over what parameters to fine-tune.
  5. Model compression
    • Training compact models from scratch via recurrently stacked layers (ALBART).
    • Distillation of pre-trained and fine-tuned models.
  6. Simultaneous NMT
    • Simulated Wait-k NMT where we train and decode wait-K models or decode full-sentence models using wait-k.
  7. Multi-source and Document NMT
    • Vanilla multi-source with two input sentences belonging to different languages.
    • Document level NMT where one input is the current sentence and the other one is the context.
    • Various multi-source fusion strategies.
    • Can be combined with wait-k NMT.

How to install: Follow installation_instructions.txt

Installing the GUI:

  1. Follow the README.md file in the interface folder.
  2. The GUI does not explicitly depend on YANMTT.

Scripts and their functionality:

  1. create_autotokenizer.sh and create_autotokenizer.py: These scripts govern the creation of a unigram SPM or BPE tokenizer. The shell script creates the subword segmenter using sentencepiece which can make both SPM and BPE models. All you need is a monolingual corpus for the languages you are interested in. The python script wraps this around an AlbertTokenizer (for SPM) or MBartTokenizer (for BPE), adds special user defined tokens and saves a configuration file for use in the future via an AutoTokenizer.
    Usage: see examples/create_tokenizer.sh

  2. pretrain_nmt.py: This is used to pre-train a model. At the very least you need a monolingual corpus for the languages you are interested in and a tokenizer trained for those languages. This script can also be used to do denoising style training jointly with regular NMT training although the NMT training is rather basic because there is no evaluation during training. If you want to do advanced NMT training then you should use the "train_nmt.py" script. Ultimately, you should not use the outcome of this script to perform final translations. Additional advanced usages involve: simulated wait-k simultaneous NMT, knowledge distillation, fine-tuning pre-existing MBART models with fine-grained control over what should be initialized, frozen or tuned, etc. Read the code and the command line arguments for a better understanding of the advanced features.
    Usage: see examples/train_mbart_model.sh
    Note 1: If M is your model name then a folder "M_deploy" is created which you can directly use with AutoTokenizer and AutoModel.
    Note 2: If you plan to use this "M_deploy" model with the GUI then remember to use the --supported_languages flag.

  3. train_nmt.py: This is used to either train a NMT model from scratch or fine-tune a pre-existing MBART or a NMT model created via YANMTT. At the very least you need a parallel corpus (preferrably split into train, dev and test sets although we can make do with only a train set) for the language pairs you are interested in. There are several advanced features such as: simulated wait-k simultaneous NMT, knowledge distillation, fine-grained control over what should be initialized, frozen or tuned, document NMT, multi-source NMT, adaptor tuning, prompt tuning, mixtures of experts layers, multilingual NMT training.
    Usage: see examples/train_or_fine_tune_model.sh Note: The notes applying to the "pretrain_nmt.py" script also apply to this script.

  4. decode_model.py: This is used to decode sentences using a trained model. Additionally you can do translation pair scoring, forced decoding, forced alignment (experimental), encoder/decoder representation extraction and alignment visualization.
    Usage: see examples/decode_or_probe_model.sh

  5. common_utils.py: This contains all housekeeping functions such as corpora splitting, batch generation, loss computation etc. Do take a look at all the methods since you may need to modify them.

  6. average_checkpoints.py: You can average the specified checkpoints using arithmetic averaging.
    Usage: see examples/avergage_model_checkpoints.sh

  7. gpu_blocker.py: This is used to temporarily occupy a gpu in case you use a shared GPU environment. Run this in the background before launching the training processes so that while the training scripts are busy doing preprocessing like sharding or model loading, the GPU you aim for is not occupied by someone else. Usage will be shown in the example scripts for training.

Note:

  1. Whenever running the example usage scripts simply run them as examples/scriptname.sh from the root directory of the toolkit
  2. The data under examples/data is taken from https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ and is released the ALT Parallel Corpus as a Creative Commons Attribution 4.0 International (CC BY 4.0)

License and copyright:

  1. MIT licence for code that I wrote.
  2. Apache licence for modifications or additions to the huggingface code in the transformers folder.

Copyright 2021 National Institute of Information and Communication Technology (Raj Dabre)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contact:
Contact me (Raj Dabre) at [email protected] or [email protected] for general queries. For queries about the user interface, please contact Diptesh Kanojia at [email protected] or Chinmay Sawant at [email protected]

Backstory: Why I made this toolkit
Despite the fact that I enjoy coding, I never really pushed myself throughout my Masters and Ph.D. towards writing a self contained toolkit. I had always known that coding is an important part of research and although I had made plenty of meaningful changes to several code bases, I never felt like I owned any of those changes.

Fast forward to 2020 where I wanted to play with MBART/BART/MASS. It would have been easy to use fairseq or tensor2tensor but then again the feeling of lack of ownership would remain. Huggingface provides a lot of implementations but (at the time) had no actual script to easily do MBART pre-training. All I had was this single comment and this guide for distributed training using pytorch (thanks yangkky).

After a bit of hesitation I decided to get my hands dirty and make a quick notebook for MBART pretraining. That snowballed into me writing my own pipeline for data sharding, preprocessing and training. Since I was at it I wrote a pipeline for tine tuning. Why not go further and write a pipeline for decoding and analysis? Fine-grained control over fine-tuning? Distillation? Multi-source NMT? Document NMT? Simultaneous Wait-K NMT? 3 months later I ended up with this toolkit which I wanted to share with everyone. Since I have worked in low-resource MT and efficent MT this toolkit will mostly contain implementations that somehow involve transfer learning, compression/distillation, simultaneous NMT.

I am pretty sure its not as fast or perfect like the ones written by the awesome people at GAFA but I will be more than happy if a few people use my toolkit.

yanmtt's People

Contributors

dipteshkanojia avatar jaygala24 avatar prajdabre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

yanmtt's Issues

Some problem with loss

Hi, after training 1.5million steps with setting in issue:#39
I check the loss in tensorboard, and the image is as following
image

and refer to the run_train.log, the print loss has been growing, from value 2. to value 6. during this 1.5million steps
I'm wondering if this result is reasonable?for training loss

Tokenization issue with pretrained model

I am trying to pretrain BART further from the huggingface checkpoint with the below command, and it seems like there is an issue with mismatched amount of arguments for _tokenize.

The command is below:
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model facebook/bart-large --tokenizer_name_or_path facebook/bart-large --langs en --mono_src examples/data/train.en --batch_size 8

The error is:
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['en']
Shuffling corpus!
Traceback (most recent call last):
File "pretrain_nmt.py", line 628, in
run_demo()
File "pretrain_nmt.py", line 625, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/root/yanmtt/pretrain_nmt.py", line 221, in model_create_load_run_save
for input_ids, input_masks, decoder_input_ids, labels in generate_batches_monolingual_masked_or_bilingual(tok, args, rank, files, train_files, ctr): #Batches are generated from here. The argument (0.30, 0.40) is a range which indicates the percentage of the source sentence to be masked in case we want masking during training just like we did during BART pretraining. The argument 3.5 is the lambda to the poisson length sampler which indicates the average length of a word sequence that will be masked. Since this is pretraining we do not do any evaluations even if we train on parallel corpora.
File "/root/yanmtt/common_utils.py", line 482, in generate_batches_monolingual_masked
iids = tok(lang + " " + masked_sentence + " ", add_special_tokens=False, return_tensors="pt").input_ids
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2377, in call
**kwargs,
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2447, in encode_plus
**kwargs,
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 441, in _encode_plus
first_ids = get_input_ids(text)
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 410, in get_input_ids
tokens = self.tokenize(text, **kwargs)
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 342, in tokenize
tokenized_text = split_on_tokens(no_split_token, text)
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in split_on_tokens
for token in tokenized_text
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in
for token in tokenized_text
TypeError: _tokenize() takes 2 positional arguments but 5 were given

Upon some further inspection, it seems like in a commit a few days ago, this line was changed to have 4 arguments: https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/tokenization_utils.py#L319

However, the _tokenize function for BART tokenizer (which inherits all the way down from GPT2 I believe), takes in less arguments:
https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/models/gpt2/tokenization_gpt2.py#L241

Trying to pretrain mBART model.

Hi,
I'm trying to pre-train mBART model using the following parameters:
!python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --fp16 --pretrained_model "facebook/mbart-large-50" --model_path "facebook/mbart-large-50" --tokenizer_name_or_path "facebook/mbart-large-50" --mono_src "/content/yanmtt/cleaned_Sanskrit_text_for_LM.txt" --shard_files --batch_size 16

I'm getting this error.

`Using label smoothing of 0.1
Using gradient clipping norm of 1.0
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['']
Shuffling corpus!
Zero size batch due to an abnormal example. Skipping empty batch.
Zero size batch due to an abnormal example. Skipping empty batch.
Zero size batch due to an abnormal example. Skipping empty batch.
Saving the model
Loading from checkpoint
Traceback (most recent call last):
File "pretrain_nmt.py", line 888, in
run_demo()
File "pretrain_nmt.py", line 885, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/content/yanmtt/pretrain_nmt.py", line 488, in model_create_load_run_save
lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
File "/content/yanmtt/common_utils.py", line 130, in label_smoothed_nll_loss
nll_loss = -lprobs.gather(dim=-1, index=target)
RuntimeError: Size does not match at dimension 1 expected index [1, 13, 1] to be smaller than src [1, 12, 250054] apart from dimension 2`

Nan loss after 70K steps on the pre-training.

Hi again!

I was able to start training my BART model thanks to your help with the tokenizer, but once on the pre-training process, around 70K steps, I got the following NAN loss on the training:

...
76930 2.6496577
76940 2.59109
76950 2.596339
76960 2.735384
76970 2.6265104
76980 nan
76990 nan
Saving the model
Loading from checkpoint
Loading from checkpoint
77000 nan
77010 nan
...

I used the following training arguments:

python3 pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path models/bart_base \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--langs xx --mono_src data/train.xx \
--batch_size 4096 \
--multistep_optimizer_steps 4 \
--num_batches 500000 \
--warmup_steps 16000 \
--encoder_layers 6 \
--decoder_layers 6 \
--max_length 128 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--fp16

Do you have any idea what could have caused this?
The only hypothesis that come to mind are that --fp16 could have caused this, or that I should use a smaller lr than the default one (lr 1e-3).

Thanks for your time,
Gorka

Add post-norm to the model

Currently the mbart backbone code I use has pre-norm which is layer(norm(input))+input whereas some people seem to say that postnorm which is norm(layer(input)+input) might be better for zeor shot. Lord alone knows whats going to be useful when.

Having a flag to control pre- and post-norm in the encoder and decoder would be perfect.

Getting error when try to pre-train for three languages

using the below command:

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi,kn,bn --mono_src /home/aniruddha/all_data/train.hi,/home/aniruddha/all_data/train.kn,/home/aniruddha/all_data/train.bn --batch_size 8 --batch_size_indicates_lines --shard_files --model_path aibharat/IndicBART/model --port 7878


Using label smoothing of 0.1
Using gradient clipping norm of 1.0
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['hi', 'kn', 'bn']
Shuffling corpus!
Shuffling corpus!
Shuffling corpus!
Saving the model
Loading from checkpoint
Traceback (most recent call last):
File "pretrain_nmt.py", line 968, in
run_demo()
File "pretrain_nmt.py", line 965, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in wrap
fn(i, *args)
File "/home/aniruddha/yanmtt/pretrain_nmt.py", line 521, in model_create_load_run_save
lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
File "/home/aniruddha/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss
smooth_loss.masked_fill
(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (316) must match the existing size (315) at non-singleton dimension 1. Target sizes: [8, 316, 1]. Tensor sizes: [8, 315, 1]

some confused of <CUDA out of memory>

Hi, when I use train_mbart_model.sh to continue pre train the mBart-50, after 300k batches, the error occurred as:RuntimeError: CUDA out of memory. Tried to allocate 1.90 GiB (GPU 0; 39.44 GiB total capacity; 19.93 GiB already allocated; 1.31 GiB free; 36.14 GiB reserved in total by PyTorch), and at this point, one epoch has been complete; and then I restart the code to continue pre train with batch size from 2048 to 1024 and with checkpoint that just generated; I'm confused, why this problem suddenly appeared after the task ran for so long, or is there some hidden problem in the program?
I don't know why this happens, so I'm worried that this problem will occur after the task starts again. and I don't know whether it work when I only change the batch size;

I use 4 GPUS: A100, NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7;
and total memory of per GPU is : 40G;

and during this time, there probably no other tasks to preempt resources

and my script setting is:
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Change to the GPU ID corresponding to a GPU that is free. export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning' nohup python pretrain_nmt_new.py -n 1 -nr 0 -g 4 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 2048 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --num_batches 10000000 --save_intermediate_checkpoints --data_sampling_temperature 1.0 --hard_truncate_length 512 --max_length 512 --shard_files > gen_model/mbart_v1/run_train.log 2>&1 &

Binary executables for all python scripts

Currently, if you want to run a command it has to be "python [script] [arguments]".
Someone told me that it would be cooler if people could do the same via a binary, like fairseq does (think fairseq-train or fairseq-preprocess).
So if any of you brave souls wants to contribute to this then feel free.

RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94862161006912: file read failed

Hi, When I use the train_mbart_model.sh to get further pre train based on the mBart-large-50 from this:https://huggingface.co/facebook/mbart-large-50;

And when I run in single GPU, there is no problem, but when I set this form and want to run on 2-GPUS

export CUDA_VISIBLE_DEVICES=0,1 # Change to the GPU ID corresponding to a GPU that is free. nohup python pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en --mono_src examples/test_data/test_mbart_train.en --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 128 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --shard_files > gen_model/mbart/run_train.log 2>&1 &
there is error as follow:

Number of model parameters: 610879488
Total number of params to be optimized are: 610879488
Percentage of parameters to be optimized: 100.0
Initial LR is: 1.25e-07
Training from official pretrained model
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr().
warnings.warn("To get the last learning rate computed by the scheduler, "
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Traceback (most recent call last):
File "pretrain_nmt.py", line 968, in
run_demo()
File "pretrain_nmt.py", line 965, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "****/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, args)
File "
/mt_mbart/yanmtt/pretrain_nmt.py", line 313, in model_create_load_run_save
checkpoint_dict = torch.load(CHECKPOINT_PATH, map_location=map_location)
File "
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 853, in _load
result = unpickler.load()
File "
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 845, in persistent_load
load_tensor(data_type, size, key, _maybe_decode_ascii(location))
File "
*/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 833, in load_tensor
storage = zip_file.get_storage_from_record(name, size, dtype).storage()
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94862161006912: file read failed

Can you help me for this problem? I spent a long time and didn't solve it

Unexpected Keyword arguments prompt_params, adaptor_layers, deep_adaptor_tuning, deep_adaptor_tuning_ffn_only, parallel_adaptors

Hi @prajdabre ,

Thanks for doing great work with this library. I was trying to use it and ran into this issue where prompt_params, adaptor_layers, deep_adaptor_tuning, deep_adaptor_tuning_ffn_only, parallel_adaptors params are being passed here to forward here but the MBartForConditionalGeneration class's forward function doesn't expect it.

Wanted to understand from you if the fix is as simple as creating these params in forward function call with default value of None (in which case I'm guessing we would need to make changes in the forward functions implementation itself to use these params).

Let me know if you think I might be missing something here. Thanks!

Getting error when pretraining with new languages sanskrit

We are tring to pre-train a model with initializing indicBART. we use the below command
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs sa --mono_src examples/data/train.sa --batch_size 8 --batch_size_indicates_lines --shard_files --model_path ai4bharat/IndicBART

we are getting below error.

Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
Traceback (most recent call last):
File "pretrain_nmt.py", line 968, in
run_demo()
File "pretrain_nmt.py", line 965, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/aniruddha/machine_translation/yanmtt/pretrain_nmt.py", line 85, in model_create_load_run_save
tok = AlbertTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1789, in from_pretrained
resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/albert/tokenization_albert.py", line 153, in init
self.sp_model.Load(vocab_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

GPU Consumption keeps on increasing

Hi,
I started training the model with the following parameters:
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --langs hi_IN --batch_size_indicates_lines --pretrained_model "facebook/mbart-large-50" --model_path "facebook/mbart-large-50" --tokenizer_name_or_path "facebook/mbart-large-50" --mono_src "sans_seq2seq/cleaned_Sanskrit_text_for_LM.txt" --shard_files --batch_size 2

It starts training, however, after a few hours, it crashes due to OOM.
Monitoring the GPU, I found that the GPU consumption keeps on increasing.

GPU Memory is 48GB.

Can you please tell me what could cause this?
Thanks

Add xProphetNet model into this toolkit

Hi, very helpful toolkit, I have learned a lot from it.

Recently, I have been focused on the multi-lingual title generation related tasks, and found that xProphetNet model has good performance, especially in the XGULE benchmarks.
I wanted to distill a small xProphetNet model and pre-train on my own dataset. However, I did not find the relevant pre-training codes, so I would like to ask if you would consider adding the pre-training of xProphetNet model. I can provide the code of model architecture and fine-tuning (which can reproduce the results) process.

For possible doubts, I considered using the mBART model, but the tokenizer of the pre-trained mBART model is language-specific, and my own dataset cannot be language-specific for training.
I considered putting all the data in the same file to generate a unified Tokenizer, but I was concerned that a relative reduction in the vocab_size might affect the model's effectiveness. Do you have any suggestions?

Thanks

mBART embedding matrix prunning while finetuning on a single language

Finetuning mBART-large on my machines is posible with gradient accumulation, but the training could be faster if I was able to decrease the size of the model loaded.

Is there any easy way to reduce the size of the vocabulary of mBART, prunning embedding parameters we won't use when finetuning the LM on a monolingual task using your tool?

Continue training on pre-trained BART model

Hi,

First thanks for the work on this repo !

Now, I have some quite specific requirements on training BART model and I see several of your comments (on Fairseq and/or huggingface) pointing to here.

Before deep diving into your code, I'm curious how easily I might use it for my need.
I try to:

  • use pre-trained mBart (available on fairseq/hugging face, specifically Barthez model)
  • continue mono-linguistic training w/ BART objective i.e. denoising.

Most code/script examples are aimed at finetuning the model, which usually exclude the denoising part.

Thanks for any insights!

Using masked inputs at inference time

I am considering using YANMTT to train my own BART model. However, instead of using it as the initial model for a subsequent fine-tuning process, I am interested in using the BART model itsel to generate alternative versions of the input sentence. To do this, I would like to mask a percentage of the words in a sentence at inference time and let the model generate a variation of it via beam search decoding:

  • Original sentence: Mike goes to the bookstore on Thursday
  • Possible masked input sentence: <mask> goes to the bookstore <mask>
  • Possible model output: Jerry happily goes to the bookstore with his friends

Can this be easily done with YANMTT? I am trying to have my own model for the generation of synthetic samples discussed in the paper "Detecting Hallucinated Content in Conditional Neural Sequence Generation" (section 3.1).

Three custom languages and two tasks — is this a good place to start?

I have aligned datasets for three different custom languages. Each corpus is a flat text file where each line is a sentence, and documents are separated by empty lines. All sentences and documents match between the datasets. There are two tasks I'd like to be able to perform: 1) translate between the languages, and 2) infill sentences from any single language. For the translation task, given languages A, B, and C, it's actually not likely I'll ever go from C -> A or B -> A, but I definitely want to translate A -> B and A -> C. Other translations that would be helpful would be B -> C and C -> B.

From the MBART examples at HuggingFace it looks like MBartForConditionalGeneration could perhaps do task 1 (though maybe not in all directions listed above?), and BartForConditionalGeneration could do task 2. But is there any reason why MBartForConditionalGeneration couldn't do both? That is, if I pass an input with a <mask> token to MBART, will it perform the infilling, just as BART would? If so, then does your toolkit make sense as a place to start?

Any thoughts very much appreciated.

Error when when trying to pretrain with other language extensions apart from hi

The command we are using:
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs kn --mono_src /home/aniruddha/all_data/train.kn --batch_size 8 --batch_size_indicates_lines --shard_files --model_path aibharat/IndicBART/model --port 7878


Traceback (most recent call last):
File "pretrain_nmt.py", line 970, in
run_demo()
File "pretrain_nmt.py", line 967, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in wrap
fn(i, *args)
File "/home/aniruddha/yanmtt/pretrain_nmt.py", line 521, in model_create_load_run_save
lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
File "/home/aniruddha/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss
smooth_loss.masked_fill
(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (333) must match the existing size (332) at non-singleton dimension 1. Target sizes: [8, 333, 1]. Tensor sizes: [8, 332, 1]


But when the data file has ".hi " language extension the code works fine.

finetuning BART for text classification tasks as seq2seq

Hi,

I trained a monolingual BART using your toolkit, and now I want to evaluate the model in NLU (natural language understanding), as we don't have any proper seq2seq dataset to evaluate it's generative capacities yet.

The idea is to evaluate the model on sequence labeling and text classification tasks, including sentence-pair classifications, but to get starteed, I would like to evaluate it on a single text classification task, in the form of Text-Label pairs, like topic classification or NLI.

I think your finetuning script train_nmt.py should be enough for that, as the labels could be predicted as target sequences. Otherwise, I thought of finetuning the BART model using hugginface tools, but don't know if any changed are needed for the model, vocab/tokenizer and config files, so I want to try your toolkits finetuning options first, which worked fine for a paraphrasing task using my BART model.

I would like to know if using --is_summarization makes sense for this type of tasks, and if you see any other limitation or any option I should use during finetuning.

I had something like this in mind:

python3 train_nmt.py -n 1 -nr 0 -g 1 \
--is_summarization \
--model_path models/bart_topic \
--pretrained_model models/bart_base_512 \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--train_slang xx \
--train_tlang xx \
--dev_slang xx \
--dev_tlang xx \
--train_src train.src.xx \
--train_tgt train.trg.xx \
--dev_src dev.src.xx \
--dev_tgt dev.trg.xx \
--max_src 512 \
--max_tgt 512 \
--batch_size_indicates_lines \
--batch_size 32 \
--num_batches 1000 \
--warmup_steps 100 \
--no_reload_optimizer_ctr_and_scheduler \
--lr 3e-5 \
--hard_truncate_length 1024 \
--shard_files

Where source files will have a text per line, while target files will include the corresponding labels as text.

But something weird happens during the training, and I get this printed nonstop from the beginning:

Shuffling corpus: xx-xx
Finished epoch 999 for language: xx-xx
Shuffling corpus: xx-xx
Finished epoch 1000 for language: xx-xx

Where N epochs increase 100 per second, which doesn't make sense for a dataset of 1000s of examples. Maybe I'm not reading the files in the correct way, but the same approach of reading files worked fine for a paraphrasing task before.

Thanks for your time,
Gorka

Adding WandB support

Currently, the model information such as losses and gradients are saved every N steps to a local file, which is then visualized using tensorboard. WandB is the cool new kid on the block, but we must respect our elders.

Adding a flag and then placing if else to switch between wandb and tensorboard would be nice. If someone wants to use wandb then remember to print a message so that users specify their username and wandb workspace url via flags. Also, in case of a wandb error, asking users to check if the wandb initialization is done properly or not.

This is fairly easy.
As usual, please document.

Tokenization issue when trying pretraining a monolingual model from scratch

Hi and thanks for your amazing work @prajdabre !

I got the same error as in #2 while trying to train a monolingual BART from scratch for Basque language (eu) using an already existing BERT tokenizer.

python3 pretrain_nmt.py -n 1 -nr 0 -g 1 --model_path models/bart_model --tokenizer_name_or_path ixa-ehu/berteus-base-cased --langs eu --mono_src train.eu --encoder_layers 1 --decoder_layers 1 --encoder_attention_heads=1 --decoder_attention_heads=1 --encoder_ffn_dim=128 --decoder_ffn_dim=128 --d_model=64 --shard_files

.
.
.
 File "/home/BART/yanmtt/bartenv/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in <genexpr>
    for token in tokenized_text
TypeError: _tokenize() takes 2 positional arguments but 5 were given

you can get a small monolingual text file from opus if you wanna give it a try.

I also tried to run it using my own output files ( _.vocab and _.model ) from the sentencepiece I trained from my corpus. But I need to provide a a config.json file for the training, which is not clear what it should look like when you want to pretrain a new model from scratch.

Could you help me understand what I am missing to pretrain a monolingual model from scratch properly?

Thanks,

Gorka

Add support for latest version of transformers repo

Currently I have provided my own modded fork of transformers but if someone doesnt care about the features I have added and only wants to work with the code mbart code then this should be enabled.

What this would mean is that all those other arguments I pass to the mbart config class to instantiate the object will be sent to kwargs. The main change will be minimal and most likely related to the tokenizer. In the batch creation logic, I pass some extra arguments to the tokenizer to support stochastic tokenization. The way I see it is we have a flag called --is_official_repo which if passed means that the official transformers repo is passed. This argument will then be passed to the batching function which wont pass the flags relevant for stochastic tokenization.

How to perform Inference using the fine-tuned model ?

Hi @prajdabre, I pre-trained a very small BART model on a new language and the pre-training is almost done.
I'm going to fine-tune the model on a downstream task and would want to perform inference using that fine-tuned model.
I can see the fine-tuning code in your repository but not the inference code.
Can you please tell me how to perform inference using the fine-tuned model?

Also, is it possible to use Huggingface Pipeline with the trained model to do the inference like the rest of the transformer models?

Thank you

Issues at decode after finetuning mBART on a monolingual seq2seq task

Hi!

I finetuned the mBART-large-cc25 model on a paraphrasing dataset for Spanish, with the following command, without any issues:

python3 train_nmt.py -n 1 -nr 0 -g 2 \
--model_path models/bartes_paraphrasES \
--is_summarization \
--pretrained_model facebook/mbart-large-cc25 \
--use_official_pretrained \
--tokenizer_name_or_path facebook/mbart-large-cc25 \
--train_slang es_XX \
--train_tlang es_XX \
--dev_slang es_XX \
--dev_tlang es_XX \
--train_src train.src.es_XX \
--train_tgt train.trg.es_XX \
--dev_src test.src.es_XX \
--dev_tgt test.trg.es_XX \
--max_src 128 \
--max_tgt 128 \
--batch_size_indicates_lines \
--batch_size 8 \
--multistep_optimizer_steps 2 \
--num_batches 40000 \
--warmup_steps 20000 \
--no_reload_optimizer_ctr_and_scheduler \
--lr 3e-5 \
--hard_truncate_length 1024 \
--eval_every 555555 \
--no_eval_save_every 555555 \
--shard_files

And then I tried to decode some examples:

python3 decode_nmt.py -n 1  -nr 0 -g 1 \
--model_path models/bartes_paraphrasES \
--slang es_XX \
--tlang es_XX \
--test_src test.src.es_XX \
--test_tgt decode_es.txt \
--tokenizer_name_or_path facebook/mbart-large-cc25 \

But I'm getting the following error:

IP address is localhost                                                                                                                                                                                    
Tokenizer is: PreTrainedTokenizer(name_or_path='facebook/mbart-large-cc25', vocab_size=250027, model_max_len=1024, is_fast=False, padding_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '<
/s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True), 'additional_s
pecial_tokens': ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU
', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN']})
Running DDP checkpoint example on rank 0.
Tokenizer is: PreTrainedTokenizer(name_or_path='facebook/mbart-large-cc25', vocab_size=250027, model_max_len=1024, is_fast=False, padding_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '<
/s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True), 'additional_s
pecial_tokens': ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU
', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN']})
Running DDP checkpoint example on rank 1.
Using positional embeddings
Using positional embeddings
Using positional embeddings
Using positional embeddings
Remapping layers from parent to child.
Final model dictionary after remapping is: odict_keys(['module.final_logits_bias', 'module.model.shared.weight', 'module.model.encoder.embed_tokens.weight', 'module.model.encoder.embed_positions.weight', ...
.
.
.
'module.model.decoder.layers.11.final_layer_norm.bias', 'module.model.decoder.layernorm_embedding.weight', 'module.model.decoder.layernorm_embedding.bias', 'module.model.decoder.layer_norm.weight', 'module.model.decoder.layer_norm.bias', 'module.lm_head.weight'])
Remapping embeddings.
Eliminating matched params with mismatched sizes from the initial model.
Eliminating module.model.shared.weight
Eliminating module.model.encoder.embed_tokens.weight
Eliminating module.model.encoder.embed_positions.weight
Eliminating module.model.encoder.embed_positions.weight
Eliminating module.model.encoder.layers.0.self_attn.k_proj.weight
Eliminating module.model.encoder.layers.0.self_attn.k_proj.bias
Eliminating module.model.encoder.layers.0.self_attn.v_proj.weight
.
.
.
Eliminating module.model.decoder.layers.5.final_layer_norm.weight
Eliminating module.model.decoder.layers.5.final_layer_norm.bias
Eliminating module.model.decoder.layernorm_embedding.weight
Eliminating module.model.decoder.layernorm_embedding.bias
Eliminating module.model.decoder.layer_norm.weight
Eliminating module.model.decoder.layer_norm.bias
Eliminating module.lm_head.weight
Remapping layers from parent to child.
Final model dictionary after remapping is: odict_keys(['module.final_logits_bias', 'module.model.shared.weight', 'module.model.encoder.embed_tokens.weight', 'module.model.encoder.embed_positions.weight', 'module.model.encoder.layers.0.self_attn.k_proj.weight'
.
.
.
'module.model.decoder.layer_norm.weight', 'module.model.decoder.layer_norm.bias', 'module.lm_head.weight'])
Remapping embeddings.
Eliminating matched params with mismatched sizes from the initial model.
Eliminating module.model.shared.weight
Eliminating module.model.encoder.embed_tokens.weight
Eliminating module.model.encoder.embed_positions.weight
Eliminating module.model.encoder.layers.0.self_attn.k_proj.weight
Eliminating module.model.encoder.layers.0.self_attn.k_proj.bias
.
.
.
Eliminating module.model.decoder.layernorm_embedding.bias
Eliminating module.model.decoder.layer_norm.weight
Eliminating module.model.decoder.layer_norm.bias
Eliminating module.lm_head.weight
Traceback (most recent call last):
  File "decode_nmt.py", line 455, in <module>
    run_demo()
  File "decode_nmt.py", line 451, in run_demo
    mp.spawn(model_create_load_decode, nprocs=args.gpus, args=(args,))         #
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/gurbizu/BART/yanmtt/decode_nmt.py", line 124, in model_create_load_decode
    model.load_state_dict(remap_embeddings_eliminate_components_and_eliminate_mismatches(model.state_dict(), remap_layers(checkpoint_dict['model'], 4, args), args), strict=True if (args.remap_encoder == "" and args.remap_decoder == "" and not args.eliminate_encoder_before_initialization and not args.eliminate_decoder_before_initialization and not args.eliminate_embeddings_before_initialization) else False) ## Modification needed if we want to load a partial model trained using multilayer softmaxing.
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1483, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
        Missing key(s) in state_dict: "module.model.shared.weight", "module.model.encoder.embed_tokens.weight", "module.model.encoder.embed_positions.weight",
.
.
.
"module.model.decoder.layers.5.final_layer_norm.bias", "module.model.decoder.layernorm_embedding.weight", "module.model.decoder.layernorm_embedding.bias", "module.model.decoder.layer_norm.weight", "module.model.decoder.layer_norm.bias", "module.lm_head.weight".
Unexpected key(s) in state_dict: "module.model.encoder.layers.6.self_attn.k_proj.weight", "module.model.encoder.layers.6.self_attn.k_proj.bias", "module.model.encoder.layers.6.self_attn.v_proj.weight", "module.model.encoder.layers.6.self_attn.v_proj.bias",
.
.
.
"module.model.decoder.layers.11.final_layer_norm.weight", "module.model.decoder.layers.11.final_layer_norm.bias"

Any idea what could have caused this?
am I doing something wrong in the finetune phase?
or do I need to load the tokenizer on the decode phase in a different way?

Thanks,
Gorka

AttributeError: 'Seq2SeqLMOutput' object has no attribute 'additional_lm_logits'

After I follow the installation and run examples/train_mbart_model.sh, I get the below error.

Loading from checkpoint
Traceback (most recent call last):
File "pretrain_nmt.py", line 630, in
run_demo()
File "pretrain_nmt.py", line 627, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/root/yanmtt/pretrain_nmt.py", line 359, in model_create_load_run_save
if mod_compute.additional_lm_logits is not None:
AttributeError: 'Seq2SeqLMOutput' object has no attribute 'additional_lm_logits'

What may be going wrong? The version of transformers I have is 4.3.2.

CUDA error while pre-training BART & how to use --hard_truncate_length

Hi again,

After getting the NAN loss error from the previews issue, I launched another training during the weekend:

python3 pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path models/bart_base_512 \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--langs xx --mono_src data/train.xx \
--batch_size 4096 \
--multistep_optimizer_steps 16 \
--num_batches 500000 \
--warmup_steps 16000 \
--encoder_layers 6 \
--decoder_layers 6 \
--max_length 512 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--fp16

With which I got the following error after 11K steps:

11920 2.9054422
11930 2.8778658
11940 2.9062994
11950 2.906765
11960 2.8594751
11970 2.8594935
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stack trace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

(+ many error lines)

I don't know what caused this, so I will run next trainings with CUDA_LAUNCH_BLOCKING=1 activated.

But also I want to use --hard_truncate_length argument in case the problem is caused by the length of sequences.

But I'm not sure if I understand well what --hard_truncate_length argument exactly does.
let's say that I want to train a model with --max_length=128 and --batch_size=4096... if I understood correctly, I should set --hard_truncate_length at 4096 too, right?

Thanks for your time.
Regards,
Gorka

at train_nmt.py I get RuntimeError: The expanded size of the tensor (30) must match the existing size (25) at non-singleton dimension 1.

Hi!

I trained a monolingual BART using the following command:

python3 pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path models/bart_base \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--langs xx --mono_src data/train.xx \
--batch_size 4096 \
--multistep_optimizer_steps 4 \
--num_batches 1800000 \
--warmup_steps 16000 \
--encoder_layers 6 \
--decoder_layers 6 \
--max_length 128 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--lr 1e-4 \
--hard_truncate_length 1024 \
--shard_files

and now I would like to finetune it on a seq2seq task (paraphrasing) with a small dataset, to see if the model learns something in the pretraining:

python3 train_nmt.py -n 1 -nr 0 -g 1 \
--model_path models/bart__base_ft \
--pretrained_model models/bart_base \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--train_slang src \
--train_tlang trg \
--dev_slang src \
--dev_tlang trg \
--train_src data/train.src \
--train_tgt data/train.trg \
--dev_src data/test.src \
--dev_tgt data/test.trg \
--max_src 128 \
--max_tgt 128 \
--batch_size_indicates_lines \
--batch_size 32 \
--num_batches 1000 \
--encoder_layers 6 \
--decoder_layers 6 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--lr 3e-5 \
--hard_truncate_length 1024 \
--shard_files

and I get the following error:

...
Using label smoothing of 0.1
Using gradient clipping norm of 1.0
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['src-trg']
Corpora stats: {'src-trg': 568}
Shuffling corpus: src-trg
Running eval on dev set(s)
BLEU score using sacrebleu after 450000 iterations is 33.4095177159796 for language pair src-trg
New peak reached for src-trg . Saving.
Global BLEU score using sacrebleu after 450000 iterations is: 33.4095177159796
New peak reached. Saving.
Saving the model
Loading from checkpoint
Traceback (most recent call last):
  File "train_nmt.py", line 884, in <module>
    run_demo()
  File "train_nmt.py", line 881, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,train_files, dev_files, quit_condition))         #
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/gurbizu/BART/yanmtt/train_nmt.py", line 513, in model_create_load_run_save
    lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
  File "/home/gurbizu/BART/yanmtt/common_utils.py", line 82, in label_smoothed_nll_loss
    smooth_loss.masked_fill_(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (27) must match the existing size (22) at non-singleton dimension 1.  Target sizes: [32, 27, 1].  Tensor sizes: [32, 22, 1]

Any idea what could cause this?

Can it be used to pre-train a BART model?

Hi~
I want to find a method to pretrain a BART model with my own corpus. The pretrain method can add noise to the input and reconstruct the input, just as BART pre-training model did. But I couldn't find a specific way to pre-train the BART model, but came across this mBART pre-training method. Can it be used to pre-train the BART model?

Looking forward to your reply.

Questions about BART pretraining hyperparameters.

Hei!

I want to train a BART base model for my language, and I'm trying to use the same hyper-parameters as in the BART base for English, and I have some doubts about some of the arguments I need to choose in your toolkit.

--batch_size
This one is measured in tokens in your toolkit. Do you know if the 8K batch reported in the original BART paper refers to sequences or batches?

And then I'm not sure what is exactly the difference between --max_length and --max_src/tgt_length, and how I should use them.

Thanks

Improve examples

Currently, the examples are rather underspecified and under documented. Many a soul have reached out to me through the ether (also called the internet) asking me for clarification. What you would need to do is think of a use-case, a hack or anything that you feel can be done with YANMTT and add the command to the examples.

Document pls.

CPU support

Currently, YANMTT assumes that only GPUs are to be used but some people might want to use a massive cpu cluster. Add a flag and then modify the relevant parts of the code.

Pretraining Bart on Single corpus

Hi,

First thanks for the work on this repo !

I‘m continues pretraining BART on myself English corpus“train_fineshed.txt”,but the python arguments seems didn‘t work
:“file not found error: ***/train_fineshed.txt.01”

my python command as follow:

python pretrain_nmt.py -n 1 -nr 0 -g 2 --pretrained_model facebook/bart-base --use_official_pretrained --tokenizer_name_or_path facebook/bart-base --is_summarization --warmup_steps 500 --save_intermediate_checkpoints --mono_src /home/WwhStuGrp/yyfwwhstu16/yanmtt/dataset/pubmed/pubmed-dataset/train_fineshed.txt --monolingual_domains 1 --train_domains 1

Can u point out my mistake about ur toolkit?

Thank you for your kind help!

ImportError: cannot import name 'AutoTokenizer' from 'transformers' (unknown location)

Hi,
I followed the steps mentioned to install, but meet the following error when trying to run the pre-training command.
ImportError: cannot import name 'AutoTokenizer' from 'transformers' (unknown location)
P.S: I was able to install this library and do the pre-training a few weeks back but I tried to do it again and see the above error.

Thanks for your work.

dependency conflicts in requirements.txt

Steps to reproduce the error

conda create --name indicbart python=3.6
conda activate indicbart
pip install -r requirements.txt

The error

The conflict is caused by: 
The user requested scipy==1.5.4 
imagehash 4.2.1 depends on scipy 
missingno 0.5.0 depends on scipy 
pandas-profiling 3.1.0 depends on scipy>=1.4.1 
phik 0.12.0 depends on scipy>=1.5.2 
seaborn 0.11.1 depends on scipy>=1.0 
tensor2tensor 1.14.0 depends on scipy 
tensorflow-gpu 2.3.0 depends on scipy==1.4.1

Can you please guide as to how to resolve the dependency issues?

Pre-training hangs

I run bash examples/create_tokenizer.sh and then bash examples/create_tokenizer.sh, but the latter shows

IP address is localhost
Monolingual training files are: {'hi': 'examples/data/train.hi', 'en': 'examples/data/train.en', 'vi': 'examples/data/train.vi'}
Sharding files into 1 parts
For language: hi  the total number of lines are: 18088 and number of lines per shard are: 18088
File for language hi has been sharded.
For language: en  the total number of lines are: 18088 and number of lines per shard are: 18088
File for language en has been sharded.
For language: vi  the total number of lines are: 18088 and number of lines per shard are: 18088
File for language vi has been sharded.
Sharding files into 1 parts

and then hangs without showing anything else. If I press ^C to cancel, the following traceback is shown:

  File "pretrain_nmt.py", line 888, in <module>
    run_demo()
  File "pretrain_nmt.py", line 885, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 101, in join
    timeout=timeout,
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)

I am running YANMTT in a Docker container on a machine with a GPU A100 40GB. The only dependency for which I am using a newer version is torch, as the version in requirements.txt is too old for my GPU.

Some question of pretrain_nmt.py

I have some confusion about the pretrain_nmt.py

I just saw that for the first few lines in your pretrain_nmt.py,

from transformers import AutoTokenizer, MBartTokenizer, MBart50Tokenizer, BartTokenizer, AlbertTokenizer from transformers import MBartForConditionalGeneration, BartForConditionalGeneration, MBartConfig, get_linear_schedule_with_warmup

According to my understand,you have rewrite some script such as some Class in (https://github.com/prajdabre/yanmtt/tree/main/transformers/src/transformers/models/mbart)/modeling_mbart.py
in order to reach the goal of further pre train based on mBart.

And why don't you use the function in your new modeling_mbart.py ? I mean why don't you import Class in (https://github.com/prajdabre/yanmtt/tree/main/transformers/src/transformers/models/mbart)/modeling_mbart.py ?

Mixtures of denoisers

Currently, I have implemented the mBART (span denoising) and mT5 (span prediction) pre-training approaches but according to the ULL2 paper (https://arxiv.org/pdf/2205.05131.pdf) a more comprehensive mixture of denoisers would help a lot.

Currently, you may use either mT5 or mBART style but I would like to enable the user to specify a comma separated list of denoising objectives and a comma separated list of the probabilities of using these objectives along with requisite hyperparams for each objective. If this is done we can play with some cool stuff.

Improve documentation

Exactly what the title says. Find an undocumented part of the code and document it. I will give 1 potato per pull request.

RuntimeError: The expanded size of the tensor (22) must match the existing size (21) at non-singleton dimension 1. Target sizes: [178, 22, 1] . Tensor sizes: [178, 21, 1]

error log:

Training from scratch
/home/bhandari1/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:216: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler.
warnings.warn(SAVE_STATE_WARNING, UserWarning)
/home/bhandari1/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:234: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler.
warnings.warn(SAVE_STATE_WARNING, UserWarning)
Using label smoothing of 0.1
Using gradient clipping norm of 1.0
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['hi']
Shuffling corpus!
Finished epoch 1 for language: hi
Shuffling corpus!
Finished epoch 2 for language: hi
Shuffling corpus!
Finished epoch 3 for language: hi
Shuffling corpus!
Finished epoch 4 for language: hi
Shuffling corpus!
Finished epoch 5 for language: hi
Shuffling corpus!
Saving the model
Loading from checkpoint
Traceback (most recent call last):
File "pretrain_nmt.py", line 990, in
run_demo()
File "pretrain_nmt.py", line 987, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files)) #
File "/home/bhandari1/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/bhandari1/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/bhandari1/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/bhandari1/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in wrap
fn(i, *args)
File "/home/bhandari1/yanmtt/pretrain_nmt.py", line 535, in model_create_load_run_save
lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
File "/home/bhandari1/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss
smooth_loss.masked_fill
(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (22) must match the existing size (21) at non-singleton dimension 1. Target sizes: [178, 22, 1]
. Tensor sizes: [178, 21, 1]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

Hello,

I am trying to further-pretrain the official BARThez model (French BART) checkpoint available at moussaKam/barthez with the denoising task.

The command used was the following :

export CUDA_VISIBLE_DEVICES=0
time python pretrain_nmt.py -n 1 -nr 0  -g 1 --use_official_pretrained --pretrained_model moussaKam/barthez --tokenizer_name_or_path moussaKam/barthez  --model_path moussaKam/barthez  --pretrained_tokenizer_name_or_path moussaKam/barthez  --langs fr  --mono_src /data/rali6/Tmp/salaunol/_NEXT/a21/fpt/input/fpt_input_toy_train.fr   --fp16  --shard_files    --num_batches 16

My environment:

Package                 Version
----------------------- -----------
absl-py                 1.0.0
astunparse              1.6.3
backcall                0.2.0
bleach                  1.5.0
cachetools              4.2.4
certifi                 2021.10.8
chardet                 3.0.4
charset-normalizer      2.0.12
click                   8.0.4
colorama                0.4.4
cycler                  0.11.0
dataclasses             0.6
decorator               5.1.1
filelock                3.0.12
Flask                   2.0.3
Flask-Cors              3.0.10
flask-swagger-ui        3.20.9
gast                    0.3.3
google-auth             1.35.0
google-auth-oauthlib    0.4.6
google-pasta            0.2.0
grpcio                  1.44.0
gunicorn                19.9.0
h5py                    2.10.0
html5lib                0.9999999
idna                    2.8
importlib-metadata      4.8.3
ipython                 7.16.1
ipython-genutils        0.2.0
itsdangerous            2.0.1
jedi                    0.18.1
Jinja2                  3.0.3
joblib                  1.1.0
Keras-Preprocessing     1.1.2
kiwisolver              1.3.1
Markdown                3.3.6
MarkupSafe              2.0.1
matplotlib              3.3.4
mixture-of-experts      0.2.1
nltk                    3.6.7
nose                    1.3.7
numpy                   1.18.5
oauthlib                3.2.0
opt-einsum              3.3.0
packaging               20.9
pandas                  1.1.5
parso                   0.8.3
pexpect                 4.8.0
pickleshare             0.7.5
Pillow                  8.4.0
pip                     22.0.4
portalocker             2.0.0
prefetch-generator      1.0.1
prompt-toolkit          3.0.29
protobuf                3.19.4
ptyprocess              0.7.0
pyasn1                  0.4.8
pyasn1-modules          0.2.8
Pygments                2.11.2
pyparsing               3.0.8
python-dateutil         2.8.2
pytz                    2022.1
regex                   2022.3.15
requests                2.21.0
requests-oauthlib       1.3.0
rouge-score             0.0.4
rsa                     4.8
sacrebleu               1.5.1
sacremoses              0.0.43
scipy                   1.4.1
sentencepiece           0.1.95
setuptools              58.3.0
six                     1.16.0
tensorboard             2.3.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.0
tensorflow-estimator    2.3.0
tensorflow-gpu          2.3.0
termcolor               1.1.0
tokenizers              0.10.1
torch                   1.7.1+cu110
torchaudio              0.7.2
torchvision             0.8.2+cu110
tqdm                    4.57.0
traitlets               4.3.3
transformers            4.3.2
typing_extensions       4.1.1
urllib3                 1.24.3
uuid                    1.30
validate-email          1.3
wcwidth                 0.2.5
Werkzeug                2.0.3
wheel                   0.37.0
wrapt                   1.14.0
zipp                    3.6.0

I also made some changes in pretrain_nmt.py so that barthez checkpoint is loaded properly with the classes suggested in https://huggingface.co/moussaKam/barthez (top right button Use in Transformers).
The following error occurredm but the cause is unclear. Any ideas?

  if not args.no_reload_optimizer_ctr_and_scheduler and args.remap_encoder is '' and args.remap_decoder is '' and not args.eliminate_encoder_before_initialization and not ar[5/1985]
nate_decoder_before_initialization and not args.eliminate_embeddings_before_initialization: ## Do not load optimizers, ctr and schedulers when remapping or resuming training.       
pretrain_nmt.py:273: SyntaxWarning: "is" with a literal. Did you mean "=="?                                                                                                          
  if not args.no_reload_optimizer_ctr_and_scheduler and args.remap_encoder is '' and args.remap_decoder is '' and not args.eliminate_encoder_before_initialization and not args.elimi
nate_decoder_before_initialization and not args.eliminate_embeddings_before_initialization: ## Do not load optimizers, ctr and schedulers when remapping or resuming training.       
IP address is localhost                                                                                                                                                              
Monolingual training files are: {'fr': '/data/rali6/Tmp/salaunol/_NEXT/a21/fpt/input/fpt_input_toy_train.fr'}
/u/salaunol/Documents/_2022_hiver/yanmtt/pretrain_nmt.py:273: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if not args.no_reload_optimizer_ctr_and_scheduler and args.remap_encoder is '' and args.remap_decoder is '' and not args.eliminate_encoder_before_initialization and not args.eliminate_decoder_before_initialization and not args.eliminate_embeddings_before_initialization: ## Do not load optimizers, ctr and schedulers when remapping or resuming training.
/u/salaunol/Documents/_2022_hiver/yanmtt/pretrain_nmt.py:273: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if not args.no_reload_optimizer_ctr_and_scheduler and args.remap_encoder is '' and args.remap_decoder is '' and not args.eliminate_encoder_before_initialization and not args.eliminate_decoder_before_initialization and not args.eliminate_embeddings_before_initialization: ## Do not load optimizers, ctr and schedulers when remapping or resuming training.
Sharding files into 1 parts
For language: fr  the total number of lines are: 8452 and number of lines per shard are: 8452
File for language fr has been sharded.
Sharding files into 1 parts
Traceback (most recent call last):
  File "pretrain_nmt.py", line 919, in <module>
    run_demo()
  File "pretrain_nmt.py", line 916, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/pretrain_nmt.py", line 89, in model_create_load_run_save
    tok = AutoTokenizer.from_pretrained(args.tokenizer_name_or_path, use_fast=False)
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/transformers-4.3.2-py3.8.egg/transformers/models/auto/tokenization_auto.py", line 362, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/transformers-4.3.2-py3.8.egg/transformers/models/auto/configuration_auto.py", line 368, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/transformers-4.3.2-py3.8.egg/transformers/configuration_utils.py", line 427, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/transformers-4.3.2-py3.8.egg/transformers/configuration_utils.py", line 510, in _dict_from_json_file
    text = reader.read()
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

Error in BART Monolingual Pre-training.

I am getting the following error while training on the monolingual (Hindi) corpus. I successfully trained the tokenizer on the same corpus using create_autotokenizer.sh.

Error Logs:
Shuffling corpus!
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Saving the model
Loading from checkpoint
Traceback (most recent call last):
File "pretrain_nmt.py", line 989, in
run_demo()
File "pretrain_nmt.py", line 986, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/workspace/data/yanmtt/pretrain_nmt.py", line 530, in model_create_load_run_save
mod_compute = model(input_ids=input_ids, attention_mask=input_masks, decoder_input_ids=decoder_input_ids, output_hidden_states=args.distillation, output_attentions=args.distillation, label_mask=label_mask if args.num_domains_for_domain_classifier > 1 else None) ## Run the model and get logits.
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'label_mask'

Exception: process 0 terminated with signal SIGSEGV

Hi, i met a tricky problem on pretrain_nmt.py

my commond:

CUDA_VISIBLE_DEVICES=3 python pretrain_nmt.py -n 1 -nr 0 -g 1 --pretrained_model facebook/bart-base --use_official_pretrained --tokenizer_name_or_path facebook/bart-base --is_summarization --warmup_steps 500 --save_intermediate_checkpoints --mono_src /home/WwhStuGrp/yyfwwhstu16/yanmtt/dataset/pubmed/pubmed-dataset/train_fineshed.txt --monolingual_domains 1 --train_domains 1 --shard_files --batch_size 1024

here is Tracetrack:

 File "pretrain_nmt.py", line 968, in <module>
    run_demo()
  File "pretrain_nmt.py", line 965, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGSEGV

I find some ways but seems didn't work
include this way:
facebookresearch/fairseq#1720 (comment)

Any advice or solution?
Thanks u again for your work in this repo!!!

Cleaning the code for calling the model and computing loss

The 2 core files pretrain_nmt.py, train_nmt.py contain a lot of repeated monolithic code. For example, the loss computation part of the code is mostly repeated.

Desired cleanup:

  1. Identify the common code and move it under functions with appropriate arguments. These functions can go under common_utils.py
  2. Restructure the code so that the code becomes more modular.
  3. Document all the functions so that users have an easier time following it.

Bonus meme: Clean up the if-else structures to make them look better if possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.