GithubHelp home page GithubHelp logo

prajdabre / yanmtt Goto Github PK

View Code? Open in Web Editor NEW
172.0 172.0 32.0 46.75 MB

Yet Another Neural Machine Translation Toolkit

License: MIT License

Python 76.61% Shell 0.33% Makefile 0.01% Dockerfile 0.04% Jsonnet 0.01% CSS 9.03% JavaScript 8.85% Jupyter Notebook 4.73% HTML 0.40% Procfile 0.01%

yanmtt's People

Contributors

dipteshkanojia avatar jaygala24 avatar prajdabre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

yanmtt's Issues

How to perform Inference using the fine-tuned model ?

Hi @prajdabre, I pre-trained a very small BART model on a new language and the pre-training is almost done.
I'm going to fine-tune the model on a downstream task and would want to perform inference using that fine-tuned model.
I can see the fine-tuning code in your repository but not the inference code.
Can you please tell me how to perform inference using the fine-tuned model?

Also, is it possible to use Huggingface Pipeline with the trained model to do the inference like the rest of the transformer models?

Thank you

RuntimeError: The expanded size of the tensor (22) must match the existing size (21) at non-singleton dimension 1. Target sizes: [178, 22, 1] . Tensor sizes: [178, 21, 1]

error log:

Training from scratch
/home/bhandari1/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:216: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler.
warnings.warn(SAVE_STATE_WARNING, UserWarning)
/home/bhandari1/.local/lib/python3.6/site-packages/torch/optim/lr_scheduler.py:234: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler.
warnings.warn(SAVE_STATE_WARNING, UserWarning)
Using label smoothing of 0.1
Using gradient clipping norm of 1.0
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['hi']
Shuffling corpus!
Finished epoch 1 for language: hi
Shuffling corpus!
Finished epoch 2 for language: hi
Shuffling corpus!
Finished epoch 3 for language: hi
Shuffling corpus!
Finished epoch 4 for language: hi
Shuffling corpus!
Finished epoch 5 for language: hi
Shuffling corpus!
Saving the model
Loading from checkpoint
Traceback (most recent call last):
File "pretrain_nmt.py", line 990, in
run_demo()
File "pretrain_nmt.py", line 987, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files)) #
File "/home/bhandari1/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/bhandari1/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/bhandari1/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/bhandari1/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in wrap
fn(i, *args)
File "/home/bhandari1/yanmtt/pretrain_nmt.py", line 535, in model_create_load_run_save
lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
File "/home/bhandari1/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss
smooth_loss.masked_fill
(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (22) must match the existing size (21) at non-singleton dimension 1. Target sizes: [178, 22, 1]
. Tensor sizes: [178, 21, 1]

Continue training on pre-trained BART model

Hi,

First thanks for the work on this repo !

Now, I have some quite specific requirements on training BART model and I see several of your comments (on Fairseq and/or huggingface) pointing to here.

Before deep diving into your code, I'm curious how easily I might use it for my need.
I try to:

  • use pre-trained mBart (available on fairseq/hugging face, specifically Barthez model)
  • continue mono-linguistic training w/ BART objective i.e. denoising.

Most code/script examples are aimed at finetuning the model, which usually exclude the denoising part.

Thanks for any insights!

GPU Consumption keeps on increasing

Hi,
I started training the model with the following parameters:
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --langs hi_IN --batch_size_indicates_lines --pretrained_model "facebook/mbart-large-50" --model_path "facebook/mbart-large-50" --tokenizer_name_or_path "facebook/mbart-large-50" --mono_src "sans_seq2seq/cleaned_Sanskrit_text_for_LM.txt" --shard_files --batch_size 2

It starts training, however, after a few hours, it crashes due to OOM.
Monitoring the GPU, I found that the GPU consumption keeps on increasing.

GPU Memory is 48GB.

Can you please tell me what could cause this?
Thanks

Error when when trying to pretrain with other language extensions apart from hi

The command we are using:
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs kn --mono_src /home/aniruddha/all_data/train.kn --batch_size 8 --batch_size_indicates_lines --shard_files --model_path aibharat/IndicBART/model --port 7878


Traceback (most recent call last):
File "pretrain_nmt.py", line 970, in
run_demo()
File "pretrain_nmt.py", line 967, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in wrap
fn(i, *args)
File "/home/aniruddha/yanmtt/pretrain_nmt.py", line 521, in model_create_load_run_save
lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
File "/home/aniruddha/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss
smooth_loss.masked_fill
(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (333) must match the existing size (332) at non-singleton dimension 1. Target sizes: [8, 333, 1]. Tensor sizes: [8, 332, 1]


But when the data file has ".hi " language extension the code works fine.

CUDA error while pre-training BART & how to use --hard_truncate_length

Hi again,

After getting the NAN loss error from the previews issue, I launched another training during the weekend:

python3 pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path models/bart_base_512 \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--langs xx --mono_src data/train.xx \
--batch_size 4096 \
--multistep_optimizer_steps 16 \
--num_batches 500000 \
--warmup_steps 16000 \
--encoder_layers 6 \
--decoder_layers 6 \
--max_length 512 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--fp16

With which I got the following error after 11K steps:

11920 2.9054422
11930 2.8778658
11940 2.9062994
11950 2.906765
11960 2.8594751
11970 2.8594935
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stack trace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

(+ many error lines)

I don't know what caused this, so I will run next trainings with CUDA_LAUNCH_BLOCKING=1 activated.

But also I want to use --hard_truncate_length argument in case the problem is caused by the length of sequences.

But I'm not sure if I understand well what --hard_truncate_length argument exactly does.
let's say that I want to train a model with --max_length=128 and --batch_size=4096... if I understood correctly, I should set --hard_truncate_length at 4096 too, right?

Thanks for your time.
Regards,
Gorka

Some problem with loss

Hi, after training 1.5million steps with setting in issue:#39
I check the loss in tensorboard, and the image is as following
image

and refer to the run_train.log, the print loss has been growing, from value 2. to value 6. during this 1.5million steps
I'm wondering if this result is reasonable?for training loss

Binary executables for all python scripts

Currently, if you want to run a command it has to be "python [script] [arguments]".
Someone told me that it would be cooler if people could do the same via a binary, like fairseq does (think fairseq-train or fairseq-preprocess).
So if any of you brave souls wants to contribute to this then feel free.

ImportError: cannot import name 'AutoTokenizer' from 'transformers' (unknown location)

Hi,
I followed the steps mentioned to install, but meet the following error when trying to run the pre-training command.
ImportError: cannot import name 'AutoTokenizer' from 'transformers' (unknown location)
P.S: I was able to install this library and do the pre-training a few weeks back but I tried to do it again and see the above error.

Thanks for your work.

Improve examples

Currently, the examples are rather underspecified and under documented. Many a soul have reached out to me through the ether (also called the internet) asking me for clarification. What you would need to do is think of a use-case, a hack or anything that you feel can be done with YANMTT and add the command to the examples.

Document pls.

Trying to pretrain mBART model.

Hi,
I'm trying to pre-train mBART model using the following parameters:
!python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --fp16 --pretrained_model "facebook/mbart-large-50" --model_path "facebook/mbart-large-50" --tokenizer_name_or_path "facebook/mbart-large-50" --mono_src "/content/yanmtt/cleaned_Sanskrit_text_for_LM.txt" --shard_files --batch_size 16

I'm getting this error.

`Using label smoothing of 0.1
Using gradient clipping norm of 1.0
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['']
Shuffling corpus!
Zero size batch due to an abnormal example. Skipping empty batch.
Zero size batch due to an abnormal example. Skipping empty batch.
Zero size batch due to an abnormal example. Skipping empty batch.
Saving the model
Loading from checkpoint
Traceback (most recent call last):
File "pretrain_nmt.py", line 888, in
run_demo()
File "pretrain_nmt.py", line 885, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/content/yanmtt/pretrain_nmt.py", line 488, in model_create_load_run_save
lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
File "/content/yanmtt/common_utils.py", line 130, in label_smoothed_nll_loss
nll_loss = -lprobs.gather(dim=-1, index=target)
RuntimeError: Size does not match at dimension 1 expected index [1, 13, 1] to be smaller than src [1, 12, 250054] apart from dimension 2`

Using masked inputs at inference time

I am considering using YANMTT to train my own BART model. However, instead of using it as the initial model for a subsequent fine-tuning process, I am interested in using the BART model itsel to generate alternative versions of the input sentence. To do this, I would like to mask a percentage of the words in a sentence at inference time and let the model generate a variation of it via beam search decoding:

  • Original sentence: Mike goes to the bookstore on Thursday
  • Possible masked input sentence: <mask> goes to the bookstore <mask>
  • Possible model output: Jerry happily goes to the bookstore with his friends

Can this be easily done with YANMTT? I am trying to have my own model for the generation of synthetic samples discussed in the paper "Detecting Hallucinated Content in Conditional Neural Sequence Generation" (section 3.1).

Add post-norm to the model

Currently the mbart backbone code I use has pre-norm which is layer(norm(input))+input whereas some people seem to say that postnorm which is norm(layer(input)+input) might be better for zeor shot. Lord alone knows whats going to be useful when.

Having a flag to control pre- and post-norm in the encoder and decoder would be perfect.

Mixtures of denoisers

Currently, I have implemented the mBART (span denoising) and mT5 (span prediction) pre-training approaches but according to the ULL2 paper (https://arxiv.org/pdf/2205.05131.pdf) a more comprehensive mixture of denoisers would help a lot.

Currently, you may use either mT5 or mBART style but I would like to enable the user to specify a comma separated list of denoising objectives and a comma separated list of the probabilities of using these objectives along with requisite hyperparams for each objective. If this is done we can play with some cool stuff.

Three custom languages and two tasks — is this a good place to start?

I have aligned datasets for three different custom languages. Each corpus is a flat text file where each line is a sentence, and documents are separated by empty lines. All sentences and documents match between the datasets. There are two tasks I'd like to be able to perform: 1) translate between the languages, and 2) infill sentences from any single language. For the translation task, given languages A, B, and C, it's actually not likely I'll ever go from C -> A or B -> A, but I definitely want to translate A -> B and A -> C. Other translations that would be helpful would be B -> C and C -> B.

From the MBART examples at HuggingFace it looks like MBartForConditionalGeneration could perhaps do task 1 (though maybe not in all directions listed above?), and BartForConditionalGeneration could do task 2. But is there any reason why MBartForConditionalGeneration couldn't do both? That is, if I pass an input with a <mask> token to MBART, will it perform the infilling, just as BART would? If so, then does your toolkit make sense as a place to start?

Any thoughts very much appreciated.

Pre-training hangs

I run bash examples/create_tokenizer.sh and then bash examples/create_tokenizer.sh, but the latter shows

IP address is localhost
Monolingual training files are: {'hi': 'examples/data/train.hi', 'en': 'examples/data/train.en', 'vi': 'examples/data/train.vi'}
Sharding files into 1 parts
For language: hi  the total number of lines are: 18088 and number of lines per shard are: 18088
File for language hi has been sharded.
For language: en  the total number of lines are: 18088 and number of lines per shard are: 18088
File for language en has been sharded.
For language: vi  the total number of lines are: 18088 and number of lines per shard are: 18088
File for language vi has been sharded.
Sharding files into 1 parts

and then hangs without showing anything else. If I press ^C to cancel, the following traceback is shown:

  File "pretrain_nmt.py", line 888, in <module>
    run_demo()
  File "pretrain_nmt.py", line 885, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 101, in join
    timeout=timeout,
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/home/user/.conda/envs/yanmtt/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)

I am running YANMTT in a Docker container on a machine with a GPU A100 40GB. The only dependency for which I am using a newer version is torch, as the version in requirements.txt is too old for my GPU.

Error in BART Monolingual Pre-training.

I am getting the following error while training on the monolingual (Hindi) corpus. I successfully trained the tokenizer on the same corpus using create_autotokenizer.sh.

Error Logs:
Shuffling corpus!
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized.
Saving the model
Loading from checkpoint
Traceback (most recent call last):
File "pretrain_nmt.py", line 989, in
run_demo()
File "pretrain_nmt.py", line 986, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/workspace/data/yanmtt/pretrain_nmt.py", line 530, in model_create_load_run_save
mod_compute = model(input_ids=input_ids, attention_mask=input_masks, decoder_input_ids=decoder_input_ids, output_hidden_states=args.distillation, output_attentions=args.distillation, label_mask=label_mask if args.num_domains_for_domain_classifier > 1 else None) ## Run the model and get logits.
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'label_mask'

Questions about BART pretraining hyperparameters.

Hei!

I want to train a BART base model for my language, and I'm trying to use the same hyper-parameters as in the BART base for English, and I have some doubts about some of the arguments I need to choose in your toolkit.

--batch_size
This one is measured in tokens in your toolkit. Do you know if the 8K batch reported in the original BART paper refers to sequences or batches?

And then I'm not sure what is exactly the difference between --max_length and --max_src/tgt_length, and how I should use them.

Thanks

Getting error when pretraining with new languages sanskrit

We are tring to pre-train a model with initializing indicBART. we use the below command
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs sa --mono_src examples/data/train.sa --batch_size 8 --batch_size_indicates_lines --shard_files --model_path ai4bharat/IndicBART

we are getting below error.

Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
Traceback (most recent call last):
File "pretrain_nmt.py", line 968, in
run_demo()
File "pretrain_nmt.py", line 965, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/aniruddha/machine_translation/yanmtt/pretrain_nmt.py", line 85, in model_create_load_run_save
tok = AlbertTokenizer.from_pretrained(args.tokenizer_name_or_path, do_lower_case=False, use_fast=False, keep_accents=True)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1789, in from_pretrained
resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/aniruddha/machine_translation/yanmtt/transformers/src/transformers/models/albert/tokenization_albert.py", line 153, in init
self.sp_model.Load(vocab_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/home/aniruddha/anaconda3/envs/torch1.7/lib/python3.6/site-packages/sentencepiece/init.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)

AttributeError: 'Seq2SeqLMOutput' object has no attribute 'additional_lm_logits'

After I follow the installation and run examples/train_mbart_model.sh, I get the below error.

Loading from checkpoint
Traceback (most recent call last):
File "pretrain_nmt.py", line 630, in
run_demo()
File "pretrain_nmt.py", line 627, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/root/yanmtt/pretrain_nmt.py", line 359, in model_create_load_run_save
if mod_compute.additional_lm_logits is not None:
AttributeError: 'Seq2SeqLMOutput' object has no attribute 'additional_lm_logits'

What may be going wrong? The version of transformers I have is 4.3.2.

Issues at decode after finetuning mBART on a monolingual seq2seq task

Hi!

I finetuned the mBART-large-cc25 model on a paraphrasing dataset for Spanish, with the following command, without any issues:

python3 train_nmt.py -n 1 -nr 0 -g 2 \
--model_path models/bartes_paraphrasES \
--is_summarization \
--pretrained_model facebook/mbart-large-cc25 \
--use_official_pretrained \
--tokenizer_name_or_path facebook/mbart-large-cc25 \
--train_slang es_XX \
--train_tlang es_XX \
--dev_slang es_XX \
--dev_tlang es_XX \
--train_src train.src.es_XX \
--train_tgt train.trg.es_XX \
--dev_src test.src.es_XX \
--dev_tgt test.trg.es_XX \
--max_src 128 \
--max_tgt 128 \
--batch_size_indicates_lines \
--batch_size 8 \
--multistep_optimizer_steps 2 \
--num_batches 40000 \
--warmup_steps 20000 \
--no_reload_optimizer_ctr_and_scheduler \
--lr 3e-5 \
--hard_truncate_length 1024 \
--eval_every 555555 \
--no_eval_save_every 555555 \
--shard_files

And then I tried to decode some examples:

python3 decode_nmt.py -n 1  -nr 0 -g 1 \
--model_path models/bartes_paraphrasES \
--slang es_XX \
--tlang es_XX \
--test_src test.src.es_XX \
--test_tgt decode_es.txt \
--tokenizer_name_or_path facebook/mbart-large-cc25 \

But I'm getting the following error:

IP address is localhost                                                                                                                                                                                    
Tokenizer is: PreTrainedTokenizer(name_or_path='facebook/mbart-large-cc25', vocab_size=250027, model_max_len=1024, is_fast=False, padding_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '<
/s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True), 'additional_s
pecial_tokens': ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU
', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN']})
Running DDP checkpoint example on rank 0.
Tokenizer is: PreTrainedTokenizer(name_or_path='facebook/mbart-large-cc25', vocab_size=250027, model_max_len=1024, is_fast=False, padding_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '<
/s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=True), 'additional_s
pecial_tokens': ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU
', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN']})
Running DDP checkpoint example on rank 1.
Using positional embeddings
Using positional embeddings
Using positional embeddings
Using positional embeddings
Remapping layers from parent to child.
Final model dictionary after remapping is: odict_keys(['module.final_logits_bias', 'module.model.shared.weight', 'module.model.encoder.embed_tokens.weight', 'module.model.encoder.embed_positions.weight', ...
.
.
.
'module.model.decoder.layers.11.final_layer_norm.bias', 'module.model.decoder.layernorm_embedding.weight', 'module.model.decoder.layernorm_embedding.bias', 'module.model.decoder.layer_norm.weight', 'module.model.decoder.layer_norm.bias', 'module.lm_head.weight'])
Remapping embeddings.
Eliminating matched params with mismatched sizes from the initial model.
Eliminating module.model.shared.weight
Eliminating module.model.encoder.embed_tokens.weight
Eliminating module.model.encoder.embed_positions.weight
Eliminating module.model.encoder.embed_positions.weight
Eliminating module.model.encoder.layers.0.self_attn.k_proj.weight
Eliminating module.model.encoder.layers.0.self_attn.k_proj.bias
Eliminating module.model.encoder.layers.0.self_attn.v_proj.weight
.
.
.
Eliminating module.model.decoder.layers.5.final_layer_norm.weight
Eliminating module.model.decoder.layers.5.final_layer_norm.bias
Eliminating module.model.decoder.layernorm_embedding.weight
Eliminating module.model.decoder.layernorm_embedding.bias
Eliminating module.model.decoder.layer_norm.weight
Eliminating module.model.decoder.layer_norm.bias
Eliminating module.lm_head.weight
Remapping layers from parent to child.
Final model dictionary after remapping is: odict_keys(['module.final_logits_bias', 'module.model.shared.weight', 'module.model.encoder.embed_tokens.weight', 'module.model.encoder.embed_positions.weight', 'module.model.encoder.layers.0.self_attn.k_proj.weight'
.
.
.
'module.model.decoder.layer_norm.weight', 'module.model.decoder.layer_norm.bias', 'module.lm_head.weight'])
Remapping embeddings.
Eliminating matched params with mismatched sizes from the initial model.
Eliminating module.model.shared.weight
Eliminating module.model.encoder.embed_tokens.weight
Eliminating module.model.encoder.embed_positions.weight
Eliminating module.model.encoder.layers.0.self_attn.k_proj.weight
Eliminating module.model.encoder.layers.0.self_attn.k_proj.bias
.
.
.
Eliminating module.model.decoder.layernorm_embedding.bias
Eliminating module.model.decoder.layer_norm.weight
Eliminating module.model.decoder.layer_norm.bias
Eliminating module.lm_head.weight
Traceback (most recent call last):
  File "decode_nmt.py", line 455, in <module>
    run_demo()
  File "decode_nmt.py", line 451, in run_demo
    mp.spawn(model_create_load_decode, nprocs=args.gpus, args=(args,))         #
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/gurbizu/BART/yanmtt/decode_nmt.py", line 124, in model_create_load_decode
    model.load_state_dict(remap_embeddings_eliminate_components_and_eliminate_mismatches(model.state_dict(), remap_layers(checkpoint_dict['model'], 4, args), args), strict=True if (args.remap_encoder == "" and args.remap_decoder == "" and not args.eliminate_encoder_before_initialization and not args.eliminate_decoder_before_initialization and not args.eliminate_embeddings_before_initialization) else False) ## Modification needed if we want to load a partial model trained using multilayer softmaxing.
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1483, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel:
        Missing key(s) in state_dict: "module.model.shared.weight", "module.model.encoder.embed_tokens.weight", "module.model.encoder.embed_positions.weight",
.
.
.
"module.model.decoder.layers.5.final_layer_norm.bias", "module.model.decoder.layernorm_embedding.weight", "module.model.decoder.layernorm_embedding.bias", "module.model.decoder.layer_norm.weight", "module.model.decoder.layer_norm.bias", "module.lm_head.weight".
Unexpected key(s) in state_dict: "module.model.encoder.layers.6.self_attn.k_proj.weight", "module.model.encoder.layers.6.self_attn.k_proj.bias", "module.model.encoder.layers.6.self_attn.v_proj.weight", "module.model.encoder.layers.6.self_attn.v_proj.bias",
.
.
.
"module.model.decoder.layers.11.final_layer_norm.weight", "module.model.decoder.layers.11.final_layer_norm.bias"

Any idea what could have caused this?
am I doing something wrong in the finetune phase?
or do I need to load the tokenizer on the decode phase in a different way?

Thanks,
Gorka

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

Hello,

I am trying to further-pretrain the official BARThez model (French BART) checkpoint available at moussaKam/barthez with the denoising task.

The command used was the following :

export CUDA_VISIBLE_DEVICES=0
time python pretrain_nmt.py -n 1 -nr 0  -g 1 --use_official_pretrained --pretrained_model moussaKam/barthez --tokenizer_name_or_path moussaKam/barthez  --model_path moussaKam/barthez  --pretrained_tokenizer_name_or_path moussaKam/barthez  --langs fr  --mono_src /data/rali6/Tmp/salaunol/_NEXT/a21/fpt/input/fpt_input_toy_train.fr   --fp16  --shard_files    --num_batches 16

My environment:

Package                 Version
----------------------- -----------
absl-py                 1.0.0
astunparse              1.6.3
backcall                0.2.0
bleach                  1.5.0
cachetools              4.2.4
certifi                 2021.10.8
chardet                 3.0.4
charset-normalizer      2.0.12
click                   8.0.4
colorama                0.4.4
cycler                  0.11.0
dataclasses             0.6
decorator               5.1.1
filelock                3.0.12
Flask                   2.0.3
Flask-Cors              3.0.10
flask-swagger-ui        3.20.9
gast                    0.3.3
google-auth             1.35.0
google-auth-oauthlib    0.4.6
google-pasta            0.2.0
grpcio                  1.44.0
gunicorn                19.9.0
h5py                    2.10.0
html5lib                0.9999999
idna                    2.8
importlib-metadata      4.8.3
ipython                 7.16.1
ipython-genutils        0.2.0
itsdangerous            2.0.1
jedi                    0.18.1
Jinja2                  3.0.3
joblib                  1.1.0
Keras-Preprocessing     1.1.2
kiwisolver              1.3.1
Markdown                3.3.6
MarkupSafe              2.0.1
matplotlib              3.3.4
mixture-of-experts      0.2.1
nltk                    3.6.7
nose                    1.3.7
numpy                   1.18.5
oauthlib                3.2.0
opt-einsum              3.3.0
packaging               20.9
pandas                  1.1.5
parso                   0.8.3
pexpect                 4.8.0
pickleshare             0.7.5
Pillow                  8.4.0
pip                     22.0.4
portalocker             2.0.0
prefetch-generator      1.0.1
prompt-toolkit          3.0.29
protobuf                3.19.4
ptyprocess              0.7.0
pyasn1                  0.4.8
pyasn1-modules          0.2.8
Pygments                2.11.2
pyparsing               3.0.8
python-dateutil         2.8.2
pytz                    2022.1
regex                   2022.3.15
requests                2.21.0
requests-oauthlib       1.3.0
rouge-score             0.0.4
rsa                     4.8
sacrebleu               1.5.1
sacremoses              0.0.43
scipy                   1.4.1
sentencepiece           0.1.95
setuptools              58.3.0
six                     1.16.0
tensorboard             2.3.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.0
tensorflow-estimator    2.3.0
tensorflow-gpu          2.3.0
termcolor               1.1.0
tokenizers              0.10.1
torch                   1.7.1+cu110
torchaudio              0.7.2
torchvision             0.8.2+cu110
tqdm                    4.57.0
traitlets               4.3.3
transformers            4.3.2
typing_extensions       4.1.1
urllib3                 1.24.3
uuid                    1.30
validate-email          1.3
wcwidth                 0.2.5
Werkzeug                2.0.3
wheel                   0.37.0
wrapt                   1.14.0
zipp                    3.6.0

I also made some changes in pretrain_nmt.py so that barthez checkpoint is loaded properly with the classes suggested in https://huggingface.co/moussaKam/barthez (top right button Use in Transformers).
The following error occurredm but the cause is unclear. Any ideas?

  if not args.no_reload_optimizer_ctr_and_scheduler and args.remap_encoder is '' and args.remap_decoder is '' and not args.eliminate_encoder_before_initialization and not ar[5/1985]
nate_decoder_before_initialization and not args.eliminate_embeddings_before_initialization: ## Do not load optimizers, ctr and schedulers when remapping or resuming training.       
pretrain_nmt.py:273: SyntaxWarning: "is" with a literal. Did you mean "=="?                                                                                                          
  if not args.no_reload_optimizer_ctr_and_scheduler and args.remap_encoder is '' and args.remap_decoder is '' and not args.eliminate_encoder_before_initialization and not args.elimi
nate_decoder_before_initialization and not args.eliminate_embeddings_before_initialization: ## Do not load optimizers, ctr and schedulers when remapping or resuming training.       
IP address is localhost                                                                                                                                                              
Monolingual training files are: {'fr': '/data/rali6/Tmp/salaunol/_NEXT/a21/fpt/input/fpt_input_toy_train.fr'}
/u/salaunol/Documents/_2022_hiver/yanmtt/pretrain_nmt.py:273: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if not args.no_reload_optimizer_ctr_and_scheduler and args.remap_encoder is '' and args.remap_decoder is '' and not args.eliminate_encoder_before_initialization and not args.eliminate_decoder_before_initialization and not args.eliminate_embeddings_before_initialization: ## Do not load optimizers, ctr and schedulers when remapping or resuming training.
/u/salaunol/Documents/_2022_hiver/yanmtt/pretrain_nmt.py:273: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if not args.no_reload_optimizer_ctr_and_scheduler and args.remap_encoder is '' and args.remap_decoder is '' and not args.eliminate_encoder_before_initialization and not args.eliminate_decoder_before_initialization and not args.eliminate_embeddings_before_initialization: ## Do not load optimizers, ctr and schedulers when remapping or resuming training.
Sharding files into 1 parts
For language: fr  the total number of lines are: 8452 and number of lines per shard are: 8452
File for language fr has been sharded.
Sharding files into 1 parts
Traceback (most recent call last):
  File "pretrain_nmt.py", line 919, in <module>
    run_demo()
  File "pretrain_nmt.py", line 916, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/pretrain_nmt.py", line 89, in model_create_load_run_save
    tok = AutoTokenizer.from_pretrained(args.tokenizer_name_or_path, use_fast=False)
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/transformers-4.3.2-py3.8.egg/transformers/models/auto/tokenization_auto.py", line 362, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/transformers-4.3.2-py3.8.egg/transformers/models/auto/configuration_auto.py", line 368, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/transformers-4.3.2-py3.8.egg/transformers/configuration_utils.py", line 427, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "/u/salaunol/Documents/_2022_hiver/yanmtt/py38/lib/python3.8/site-packages/transformers-4.3.2-py3.8.egg/transformers/configuration_utils.py", line 510, in _dict_from_json_file
    text = reader.read()
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

Improve documentation

Exactly what the title says. Find an undocumented part of the code and document it. I will give 1 potato per pull request.

mBART embedding matrix prunning while finetuning on a single language

Finetuning mBART-large on my machines is posible with gradient accumulation, but the training could be faster if I was able to decrease the size of the model loaded.

Is there any easy way to reduce the size of the vocabulary of mBART, prunning embedding parameters we won't use when finetuning the LM on a monolingual task using your tool?

Unexpected Keyword arguments prompt_params, adaptor_layers, deep_adaptor_tuning, deep_adaptor_tuning_ffn_only, parallel_adaptors

Hi @prajdabre ,

Thanks for doing great work with this library. I was trying to use it and ran into this issue where prompt_params, adaptor_layers, deep_adaptor_tuning, deep_adaptor_tuning_ffn_only, parallel_adaptors params are being passed here to forward here but the MBartForConditionalGeneration class's forward function doesn't expect it.

Wanted to understand from you if the fix is as simple as creating these params in forward function call with default value of None (in which case I'm guessing we would need to make changes in the forward functions implementation itself to use these params).

Let me know if you think I might be missing something here. Thanks!

Adding WandB support

Currently, the model information such as losses and gradients are saved every N steps to a local file, which is then visualized using tensorboard. WandB is the cool new kid on the block, but we must respect our elders.

Adding a flag and then placing if else to switch between wandb and tensorboard would be nice. If someone wants to use wandb then remember to print a message so that users specify their username and wandb workspace url via flags. Also, in case of a wandb error, asking users to check if the wandb initialization is done properly or not.

This is fairly easy.
As usual, please document.

Cleaning the code for calling the model and computing loss

The 2 core files pretrain_nmt.py, train_nmt.py contain a lot of repeated monolithic code. For example, the loss computation part of the code is mostly repeated.

Desired cleanup:

  1. Identify the common code and move it under functions with appropriate arguments. These functions can go under common_utils.py
  2. Restructure the code so that the code becomes more modular.
  3. Document all the functions so that users have an easier time following it.

Bonus meme: Clean up the if-else structures to make them look better if possible.

RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94862161006912: file read failed

Hi, When I use the train_mbart_model.sh to get further pre train based on the mBart-large-50 from this:https://huggingface.co/facebook/mbart-large-50;

And when I run in single GPU, there is no problem, but when I set this form and want to run on 2-GPUS

export CUDA_VISIBLE_DEVICES=0,1 # Change to the GPU ID corresponding to a GPU that is free. nohup python pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en --mono_src examples/test_data/test_mbart_train.en --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 128 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --shard_files > gen_model/mbart/run_train.log 2>&1 &
there is error as follow:

Number of model parameters: 610879488
Total number of params to be optimized are: 610879488
Percentage of parameters to be optimized: 100.0
Initial LR is: 1.25e-07
Training from official pretrained model
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr().
warnings.warn("To get the last learning rate computed by the scheduler, "
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Traceback (most recent call last):
File "pretrain_nmt.py", line 968, in
run_demo()
File "pretrain_nmt.py", line 965, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "****/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, args)
File "
/mt_mbart/yanmtt/pretrain_nmt.py", line 313, in model_create_load_run_save
checkpoint_dict = torch.load(CHECKPOINT_PATH, map_location=map_location)
File "
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 853, in _load
result = unpickler.load()
File "
/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 845, in persistent_load
load_tensor(data_type, size, key, _maybe_decode_ascii(location))
File "
*/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 833, in load_tensor
storage = zip_file.get_storage_from_record(name, size, dtype).storage()
RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94862161006912: file read failed

Can you help me for this problem? I spent a long time and didn't solve it

Add support for latest version of transformers repo

Currently I have provided my own modded fork of transformers but if someone doesnt care about the features I have added and only wants to work with the code mbart code then this should be enabled.

What this would mean is that all those other arguments I pass to the mbart config class to instantiate the object will be sent to kwargs. The main change will be minimal and most likely related to the tokenizer. In the batch creation logic, I pass some extra arguments to the tokenizer to support stochastic tokenization. The way I see it is we have a flag called --is_official_repo which if passed means that the official transformers repo is passed. This argument will then be passed to the batching function which wont pass the flags relevant for stochastic tokenization.

at train_nmt.py I get RuntimeError: The expanded size of the tensor (30) must match the existing size (25) at non-singleton dimension 1.

Hi!

I trained a monolingual BART using the following command:

python3 pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path models/bart_base \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--langs xx --mono_src data/train.xx \
--batch_size 4096 \
--multistep_optimizer_steps 4 \
--num_batches 1800000 \
--warmup_steps 16000 \
--encoder_layers 6 \
--decoder_layers 6 \
--max_length 128 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--lr 1e-4 \
--hard_truncate_length 1024 \
--shard_files

and now I would like to finetune it on a seq2seq task (paraphrasing) with a small dataset, to see if the model learns something in the pretraining:

python3 train_nmt.py -n 1 -nr 0 -g 1 \
--model_path models/bart__base_ft \
--pretrained_model models/bart_base \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--train_slang src \
--train_tlang trg \
--dev_slang src \
--dev_tlang trg \
--train_src data/train.src \
--train_tgt data/train.trg \
--dev_src data/test.src \
--dev_tgt data/test.trg \
--max_src 128 \
--max_tgt 128 \
--batch_size_indicates_lines \
--batch_size 32 \
--num_batches 1000 \
--encoder_layers 6 \
--decoder_layers 6 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--lr 3e-5 \
--hard_truncate_length 1024 \
--shard_files

and I get the following error:

...
Using label smoothing of 0.1
Using gradient clipping norm of 1.0
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['src-trg']
Corpora stats: {'src-trg': 568}
Shuffling corpus: src-trg
Running eval on dev set(s)
BLEU score using sacrebleu after 450000 iterations is 33.4095177159796 for language pair src-trg
New peak reached for src-trg . Saving.
Global BLEU score using sacrebleu after 450000 iterations is: 33.4095177159796
New peak reached. Saving.
Saving the model
Loading from checkpoint
Traceback (most recent call last):
  File "train_nmt.py", line 884, in <module>
    run_demo()
  File "train_nmt.py", line 881, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,train_files, dev_files, quit_condition))         #
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/gurbizu/BART/yanmtt/bartenv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/gurbizu/BART/yanmtt/train_nmt.py", line 513, in model_create_load_run_save
    lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
  File "/home/gurbizu/BART/yanmtt/common_utils.py", line 82, in label_smoothed_nll_loss
    smooth_loss.masked_fill_(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (27) must match the existing size (22) at non-singleton dimension 1.  Target sizes: [32, 27, 1].  Tensor sizes: [32, 22, 1]

Any idea what could cause this?

CPU support

Currently, YANMTT assumes that only GPUs are to be used but some people might want to use a massive cpu cluster. Add a flag and then modify the relevant parts of the code.

Some question of pretrain_nmt.py

I have some confusion about the pretrain_nmt.py

I just saw that for the first few lines in your pretrain_nmt.py,

from transformers import AutoTokenizer, MBartTokenizer, MBart50Tokenizer, BartTokenizer, AlbertTokenizer from transformers import MBartForConditionalGeneration, BartForConditionalGeneration, MBartConfig, get_linear_schedule_with_warmup

According to my understand,you have rewrite some script such as some Class in (https://github.com/prajdabre/yanmtt/tree/main/transformers/src/transformers/models/mbart)/modeling_mbart.py
in order to reach the goal of further pre train based on mBart.

And why don't you use the function in your new modeling_mbart.py ? I mean why don't you import Class in (https://github.com/prajdabre/yanmtt/tree/main/transformers/src/transformers/models/mbart)/modeling_mbart.py ?

Exception: process 0 terminated with signal SIGSEGV

Hi, i met a tricky problem on pretrain_nmt.py

my commond:

CUDA_VISIBLE_DEVICES=3 python pretrain_nmt.py -n 1 -nr 0 -g 1 --pretrained_model facebook/bart-base --use_official_pretrained --tokenizer_name_or_path facebook/bart-base --is_summarization --warmup_steps 500 --save_intermediate_checkpoints --mono_src /home/WwhStuGrp/yyfwwhstu16/yanmtt/dataset/pubmed/pubmed-dataset/train_fineshed.txt --monolingual_domains 1 --train_domains 1 --shard_files --batch_size 1024

here is Tracetrack:

 File "pretrain_nmt.py", line 968, in <module>
    run_demo()
  File "pretrain_nmt.py", line 965, in run_demo
    mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/home/WwhStuGrp/yyfwwhstu16/anaconda3/envs/scibart/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join
    (error_index, name)
Exception: process 0 terminated with signal SIGSEGV

I find some ways but seems didn't work
include this way:
facebookresearch/fairseq#1720 (comment)

Any advice or solution?
Thanks u again for your work in this repo!!!

Tokenization issue with pretrained model

I am trying to pretrain BART further from the huggingface checkpoint with the below command, and it seems like there is an issue with mismatched amount of arguments for _tokenize.

The command is below:
python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model facebook/bart-large --tokenizer_name_or_path facebook/bart-large --langs en --mono_src examples/data/train.en --batch_size 8

The error is:
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['en']
Shuffling corpus!
Traceback (most recent call last):
File "pretrain_nmt.py", line 628, in
run_demo()
File "pretrain_nmt.py", line 625, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/root/yanmtt/pretrain_nmt.py", line 221, in model_create_load_run_save
for input_ids, input_masks, decoder_input_ids, labels in generate_batches_monolingual_masked_or_bilingual(tok, args, rank, files, train_files, ctr): #Batches are generated from here. The argument (0.30, 0.40) is a range which indicates the percentage of the source sentence to be masked in case we want masking during training just like we did during BART pretraining. The argument 3.5 is the lambda to the poisson length sampler which indicates the average length of a word sequence that will be masked. Since this is pretraining we do not do any evaluations even if we train on parallel corpora.
File "/root/yanmtt/common_utils.py", line 482, in generate_batches_monolingual_masked
iids = tok(lang + " " + masked_sentence + " ", add_special_tokens=False, return_tensors="pt").input_ids
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2377, in call
**kwargs,
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2447, in encode_plus
**kwargs,
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 441, in _encode_plus
first_ids = get_input_ids(text)
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 410, in get_input_ids
tokens = self.tokenize(text, **kwargs)
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 342, in tokenize
tokenized_text = split_on_tokens(no_split_token, text)
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in split_on_tokens
for token in tokenized_text
File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in
for token in tokenized_text
TypeError: _tokenize() takes 2 positional arguments but 5 were given

Upon some further inspection, it seems like in a commit a few days ago, this line was changed to have 4 arguments: https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/tokenization_utils.py#L319

However, the _tokenize function for BART tokenizer (which inherits all the way down from GPT2 I believe), takes in less arguments:
https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/models/gpt2/tokenization_gpt2.py#L241

dependency conflicts in requirements.txt

Steps to reproduce the error

conda create --name indicbart python=3.6
conda activate indicbart
pip install -r requirements.txt

The error

The conflict is caused by: 
The user requested scipy==1.5.4 
imagehash 4.2.1 depends on scipy 
missingno 0.5.0 depends on scipy 
pandas-profiling 3.1.0 depends on scipy>=1.4.1 
phik 0.12.0 depends on scipy>=1.5.2 
seaborn 0.11.1 depends on scipy>=1.0 
tensor2tensor 1.14.0 depends on scipy 
tensorflow-gpu 2.3.0 depends on scipy==1.4.1

Can you please guide as to how to resolve the dependency issues?

some confused of <CUDA out of memory>

Hi, when I use train_mbart_model.sh to continue pre train the mBart-50, after 300k batches, the error occurred as:RuntimeError: CUDA out of memory. Tried to allocate 1.90 GiB (GPU 0; 39.44 GiB total capacity; 19.93 GiB already allocated; 1.31 GiB free; 36.14 GiB reserved in total by PyTorch), and at this point, one epoch has been complete; and then I restart the code to continue pre train with batch size from 2048 to 1024 and with checkpoint that just generated; I'm confused, why this problem suddenly appeared after the task ran for so long, or is there some hidden problem in the program?
I don't know why this happens, so I'm worried that this problem will occur after the task starts again. and I don't know whether it work when I only change the batch size;

I use 4 GPUS: A100, NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7;
and total memory of per GPU is : 40G;

and during this time, there probably no other tasks to preempt resources

and my script setting is:
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Change to the GPU ID corresponding to a GPU that is free. export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning' nohup python pretrain_nmt_new.py -n 1 -nr 0 -g 4 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 2048 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --num_batches 10000000 --save_intermediate_checkpoints --data_sampling_temperature 1.0 --hard_truncate_length 512 --max_length 512 --shard_files > gen_model/mbart_v1/run_train.log 2>&1 &

Tokenization issue when trying pretraining a monolingual model from scratch

Hi and thanks for your amazing work @prajdabre !

I got the same error as in #2 while trying to train a monolingual BART from scratch for Basque language (eu) using an already existing BERT tokenizer.

python3 pretrain_nmt.py -n 1 -nr 0 -g 1 --model_path models/bart_model --tokenizer_name_or_path ixa-ehu/berteus-base-cased --langs eu --mono_src train.eu --encoder_layers 1 --decoder_layers 1 --encoder_attention_heads=1 --decoder_attention_heads=1 --encoder_ffn_dim=128 --decoder_ffn_dim=128 --d_model=64 --shard_files

.
.
.
 File "/home/BART/yanmtt/bartenv/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in <genexpr>
    for token in tokenized_text
TypeError: _tokenize() takes 2 positional arguments but 5 were given

you can get a small monolingual text file from opus if you wanna give it a try.

I also tried to run it using my own output files ( _.vocab and _.model ) from the sentencepiece I trained from my corpus. But I need to provide a a config.json file for the training, which is not clear what it should look like when you want to pretrain a new model from scratch.

Could you help me understand what I am missing to pretrain a monolingual model from scratch properly?

Thanks,

Gorka

finetuning BART for text classification tasks as seq2seq

Hi,

I trained a monolingual BART using your toolkit, and now I want to evaluate the model in NLU (natural language understanding), as we don't have any proper seq2seq dataset to evaluate it's generative capacities yet.

The idea is to evaluate the model on sequence labeling and text classification tasks, including sentence-pair classifications, but to get starteed, I would like to evaluate it on a single text classification task, in the form of Text-Label pairs, like topic classification or NLI.

I think your finetuning script train_nmt.py should be enough for that, as the labels could be predicted as target sequences. Otherwise, I thought of finetuning the BART model using hugginface tools, but don't know if any changed are needed for the model, vocab/tokenizer and config files, so I want to try your toolkits finetuning options first, which worked fine for a paraphrasing task using my BART model.

I would like to know if using --is_summarization makes sense for this type of tasks, and if you see any other limitation or any option I should use during finetuning.

I had something like this in mind:

python3 train_nmt.py -n 1 -nr 0 -g 1 \
--is_summarization \
--model_path models/bart_topic \
--pretrained_model models/bart_base_512 \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--train_slang xx \
--train_tlang xx \
--dev_slang xx \
--dev_tlang xx \
--train_src train.src.xx \
--train_tgt train.trg.xx \
--dev_src dev.src.xx \
--dev_tgt dev.trg.xx \
--max_src 512 \
--max_tgt 512 \
--batch_size_indicates_lines \
--batch_size 32 \
--num_batches 1000 \
--warmup_steps 100 \
--no_reload_optimizer_ctr_and_scheduler \
--lr 3e-5 \
--hard_truncate_length 1024 \
--shard_files

Where source files will have a text per line, while target files will include the corresponding labels as text.

But something weird happens during the training, and I get this printed nonstop from the beginning:

Shuffling corpus: xx-xx
Finished epoch 999 for language: xx-xx
Shuffling corpus: xx-xx
Finished epoch 1000 for language: xx-xx

Where N epochs increase 100 per second, which doesn't make sense for a dataset of 1000s of examples. Maybe I'm not reading the files in the correct way, but the same approach of reading files worked fine for a paraphrasing task before.

Thanks for your time,
Gorka

Can it be used to pre-train a BART model?

Hi~
I want to find a method to pretrain a BART model with my own corpus. The pretrain method can add noise to the input and reconstruct the input, just as BART pre-training model did. But I couldn't find a specific way to pre-train the BART model, but came across this mBART pre-training method. Can it be used to pre-train the BART model?

Looking forward to your reply.

Getting error when try to pre-train for three languages

using the below command:

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs hi,kn,bn --mono_src /home/aniruddha/all_data/train.hi,/home/aniruddha/all_data/train.kn,/home/aniruddha/all_data/train.bn --batch_size 8 --batch_size_indicates_lines --shard_files --model_path aibharat/IndicBART/model --port 7878


Using label smoothing of 0.1
Using gradient clipping norm of 1.0
Using softmax temperature of 1.0
Masking ratio: 0.3
Training for: ['hi', 'kn', 'bn']
Shuffling corpus!
Shuffling corpus!
Shuffling corpus!
Saving the model
Loading from checkpoint
Traceback (most recent call last):
File "pretrain_nmt.py", line 968, in
run_demo()
File "pretrain_nmt.py", line 965, in run_demo
mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) #
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in wrap
fn(i, *args)
File "/home/aniruddha/yanmtt/pretrain_nmt.py", line 521, in model_create_load_run_save
lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id
File "/home/aniruddha/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss
smooth_loss.masked_fill
(pad_mask, 0.0)
RuntimeError: The expanded size of the tensor (316) must match the existing size (315) at non-singleton dimension 1. Target sizes: [8, 316, 1]. Tensor sizes: [8, 315, 1]

Nan loss after 70K steps on the pre-training.

Hi again!

I was able to start training my BART model thanks to your help with the tokenizer, but once on the pre-training process, around 70K steps, I got the following NAN loss on the training:

...
76930 2.6496577
76940 2.59109
76950 2.596339
76960 2.735384
76970 2.6265104
76980 nan
76990 nan
Saving the model
Loading from checkpoint
Loading from checkpoint
77000 nan
77010 nan
...

I used the following training arguments:

python3 pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path models/bart_base \
--tokenizer_name_or_path tokenizers/mbart-bpe50k \
--langs xx --mono_src data/train.xx \
--batch_size 4096 \
--multistep_optimizer_steps 4 \
--num_batches 500000 \
--warmup_steps 16000 \
--encoder_layers 6 \
--decoder_layers 6 \
--max_length 128 \
--encoder_attention_heads 12 \
--decoder_attention_heads 12 \
--decoder_ffn_dim 3072 \
--encoder_ffn_dim 3072 \
--d_model 768 \
--fp16

Do you have any idea what could have caused this?
The only hypothesis that come to mind are that --fp16 could have caused this, or that I should use a smaller lr than the default one (lr 1e-3).

Thanks for your time,
Gorka

Add xProphetNet model into this toolkit

Hi, very helpful toolkit, I have learned a lot from it.

Recently, I have been focused on the multi-lingual title generation related tasks, and found that xProphetNet model has good performance, especially in the XGULE benchmarks.
I wanted to distill a small xProphetNet model and pre-train on my own dataset. However, I did not find the relevant pre-training codes, so I would like to ask if you would consider adding the pre-training of xProphetNet model. I can provide the code of model architecture and fine-tuning (which can reproduce the results) process.

For possible doubts, I considered using the mBART model, but the tokenizer of the pre-trained mBART model is language-specific, and my own dataset cannot be language-specific for training.
I considered putting all the data in the same file to generate a unified Tokenizer, but I was concerned that a relative reduction in the vocab_size might affect the model's effectiveness. Do you have any suggestions?

Thanks

Pretraining Bart on Single corpus

Hi,

First thanks for the work on this repo !

I‘m continues pretraining BART on myself English corpus“train_fineshed.txt”,but the python arguments seems didn‘t work
:“file not found error: ***/train_fineshed.txt.01”

my python command as follow:

python pretrain_nmt.py -n 1 -nr 0 -g 2 --pretrained_model facebook/bart-base --use_official_pretrained --tokenizer_name_or_path facebook/bart-base --is_summarization --warmup_steps 500 --save_intermediate_checkpoints --mono_src /home/WwhStuGrp/yyfwwhstu16/yanmtt/dataset/pubmed/pubmed-dataset/train_fineshed.txt --monolingual_domains 1 --train_domains 1

Can u point out my mistake about ur toolkit?

Thank you for your kind help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.