ymoslem / opennmt-tutorial Goto Github PK

Neural Machine Translation (NMT) tutorial. Data preprocessing, model training, evaluation, and deployment.

License: MIT License

Jupyter Notebook 100.00%

opennmt-tutorial's Introduction

OpenNMT-py Tutorial

Neural Machine Translation (NMT) tutorial with OpenNMT-py. Data preprocessing, model training, evaluation, and deployment.

Fundamentals

Data Processing (notebook | code)
NMT Model Training with OpenNMT-py (notebook)
Translation/Inference with CTranslate2 (code)
MT Evaluation with BLEU and other metrics (tutorial | code | notebook)
Simple Web UI (tutorial | code)

Advanced Topics

Running TensorBoard with OpenNMT (tutorial)
Low-Resource Neural Machine Translation (tutorial)
Domain Adaptation with Mixed Fine-tuning (tutorial)
Overview of Domain Adaptation Techniques (tutorial)
Multilingual Machine Translation (tutorial)
Using Pre-trained NMT models with CTranslate2 (M2M-100 | NLLB-200)
Domain-Specific Text Generation for Machine Translation (paper | article | code)
Adaptive Machine Translation with Large Language Models (paper | code)
Fine-tuning Large Language Models for Adaptive Machine Translation (paper | code)

opennmt-tutorial's People

Contributors

Stargazers

Watchers

opennmt-tutorial's Issues

Training the transformer with pre-trained subword embeddings

Hi Yasmin,
Iʻm struggling to get decent results when training with pretrained subword embeddings (BPEmb).
I think, although small, my data is of very good quality and I might be doing something wrong along the way.
See below my config file.
For source side I am leveraging the BPEmb subword embeddings truncated at 256 dimensions with the original vocab file that was built with the same tokenizer model, and I also used the tokenizer model to tokenize my entire src-train.fr corpus and tokenize the src-val.fr validation corpus too. For the target side, I have built a Sentencepiece subword Unigram tokenization model and vocab file.
I also tried to freeze the embeddings on the encoder side but to no avail. Can you please provide some guidance?
`# Configuration file for OpenNMT-py training and translation

Model architecture configuration

encoder_type: transformer
decoder_type: transformer
position_encoding: true
layers: 6
hidden_size: 256
heads: 8
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.3]
attention_dropout: [0.3]
lora_dropout: 0.3

Pretrained embeddings configuration for the source language

src_embeddings: data/fr.wiki.bpe.vs200000.d300.w2v-256.txt # Ensure this path is correct
embeddings_type: word2vec # Ensure this matches the format of your embeddings
word_vec_size: 256

Optimization

optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0
normalization: tokens
param_init: 0.0
param_init_glorot: true
#position_encoding: false
#max_relative_positions: 20
model_dtype: "fp16"

Batching

batch_size: 2048
batch_type: tokens
accum_count: 8
max_generator_batches: 2

Tokenization options

#src_subword_type: bpe # Specify the tokenization method for the source side
#tgt_subword_type: sentencepiece # Specify the tokenization method for the target side
src_subword_model: data/fr.wiki.bpe.vs200000.model # Path to the BPEmb model
tgt_subword_model: data/tgt_spm.model # Path to the SentencePiece model
src_vocab: data/fr.wiki.bpe.vs200000.onmt_vocab # Path to the source vocabulary
tgt_vocab: data/tgt_spm.onmt_vocab # Path to the target vocabulary

Training hyperparameters

save_model: run/model
keep_checkpoint: 20
save_checkpoint_steps: 1000
seed: -1
train_steps: 100000
valid_steps: 500
warmup_steps: 8000
report_every: 500
early_stopping: 5
early_stopping_criteria: accuracy

TensorBoard configuration

tensorboard: true
tensorboard_log_dir: run/logs

Error handling

on_error: raise

Train on a single GPU

world_size: 1
gpu_ranks: [0]

Path for saving data required by pretrained embeddings

save_data: data/processed

Corpus opts:

data:
corpus_1:
path_src: src-train.fr
path_tgt: tgt-train.ty
transforms: [normalize, sentencepiece, filtertoolong]
weight: 1
src_lang: fr
tgt_lang: ty
norm_quote_commas: true
norm_numbers: true
valid:
path_src: src-val.fr
path_tgt: tgt-val.ty
transforms: [normalize, sentencepiece, filtertoolong]
src_lang: fr
tgt_lang: ty
norm_quote_commas: true
norm_numbers: true`

filter.py read_csv with line terminator '\n' as sep causes an error

I got the following error when running the notebook 1-NMT-Data-Processing:

AssertionError when training the model

Hi, I'm trying to train a model on my data. The dataset is pretty small, less than 6000 sentences. I used the first tutorial for preprocessing, everything worked just fine.
Now when I try model training I get an error:

[2022-11-19 10:42:07,824 WARNING] Corpus corpus_1's weight should be given. We default it to 1 for you.
[2022-11-19 10:42:07,825 INFO] Parsed 2 corpora from -data.
[2022-11-19 10:42:07,826 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2022-11-19 10:42:07,887 INFO] Building model...
Traceback (most recent call last):
  File "/usr/local/bin/onmt_train", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/onmt/bin/train.py", line 65, in main
    train(opt)
  File "/usr/local/lib/python3.7/dist-packages/onmt/bin/train.py", line 50, in train
    train_process(opt, device_id=0)
  File "/usr/local/lib/python3.7/dist-packages/onmt/train_single.py", line 136, in main
    model = build_model(model_opt, opt, vocabs, checkpoint)
  File "/usr/local/lib/python3.7/dist-packages/onmt/model_builder.py", line 327, in build_model
    model = build_base_model(model_opt, vocabs, use_gpu(opt), checkpoint)
  File "/usr/local/lib/python3.7/dist-packages/onmt/model_builder.py", line 242, in build_base_model
    model = build_task_specific_model(model_opt, vocabs)
  File "/usr/local/lib/python3.7/dist-packages/onmt/model_builder.py", line 158, in build_task_specific_model
    encoder, src_emb = build_encoder_with_embeddings(model_opt, vocabs)
  File "/usr/local/lib/python3.7/dist-packages/onmt/model_builder.py", line 131, in build_encoder_with_embeddings
    encoder = build_encoder(model_opt, src_emb)
  File "/usr/local/lib/python3.7/dist-packages/onmt/model_builder.py", line 73, in build_encoder
    return str2enc[enc_type].from_opt(opt, embeddings)
  File "/usr/local/lib/python3.7/dist-packages/onmt/encoders/transformer.py", line 120, in from_opt
    add_qkvbias=opt.add_qkvbias
  File "/usr/local/lib/python3.7/dist-packages/onmt/encoders/transformer.py", line 103, in __init__
    for i in range(num_layers)])
  File "/usr/local/lib/python3.7/dist-packages/onmt/encoders/transformer.py", line 103, in <listcomp>
    for i in range(num_layers)])
  File "/usr/local/lib/python3.7/dist-packages/onmt/encoders/transformer.py", line 38, in __init__
    attn_type="self", add_qkvbias=add_qkvbias)
  File "/usr/local/lib/python3.7/dist-packages/onmt/modules/multi_headed_attn.py", line 118, in __init__
    assert model_dim % head_count == 0
AssertionError

I found this issue with the same error, the solution is supposed to be with hyperparameters, but I checked them and can't find the problem. Could you give a hint how to solve this? Thank you!

ValueError: The model you are trying to convert is not supported by CTranslate2. We identified the following reasons: - Option --self_attn_type scaled-dot-flash is not supported (supported values are: scaled-dot)

I have trained a 1000 step model, but when I tried to convert it to ctranslate2, I entered the command and this error appeared. But I checked my configuration file and found that self attn type was set to scale dot, but the model I got was of scale dot flash type. So what caused this error? If you need my configuration file, I will provide it to you immediately, but overall, my configuration file has been slightly modified on top of yours. Is it possible that a code error caused a command error?

RuntimeError: DataLoader worker (pid n) is killed by signal: Killed

On Google Colab (free version), the training stops after some time with an error like:

RuntimeError: DataLoader worker (pid 629) is killed by signal: Killed.

As verified by running dmesg -T this is a RAM out of memory error.

Memory cgroup out of memory: Killed process 629 (onmt_train) total-vm:14119556kB, anon-rss:6538204kB, file-rss:80652kB, shmem-rss:16kB, UID:0 pgtables:13432kB oom_score_adj:0

Weighted corpora loaded so far❓

I tested this code on my own dataset with 4k parallel sentence. But when I trained the model, it always shows "Weighted corpora loaded so far", how can I solve this problem. Thank u

how should i

pls help me with these problems

your rows is 59719,but mine has only 3308,how to solve it?or which previous step went wrong that caused this problem?and is it possible to complete this project in the windows system?

$47(OC}B_OGS(O{0QYL6M{8$

Error in config file

When I run the tutorial in my colab. It is showing

Model dimension must be divisible by the number of heads

I didn't change any parameter from the config file also.

Can you please help me with it?

ValueError: invalid literal for int() with base 10: '-2.34575'

I've followed all instructions with a corpus size of around 300,000 (vocab 25,000) and keep on running into this issues (have tried multiple times, same problem). I've completed all pre-processing, model training etc successfully but the library just errors upon a specific entry in the source.vocab (below)

Do you have any idea how I can resolve my issue?

How should I train the transformer model appropriately?

My dataset uses the default sample size of 150 that you provided, and I need to train the en-zh model. What size of dataset do you think is suitable for this model?If you need more information from me, I will reply to you quickly，thank you

Fine tuning open nmt models

Hi,

I want to fine tune an open nmt- model on an in-domain Arabic-English corpus I have. From what I have understood about fine tuning I should then take a pretrained model and resume training from a checkpoint (i.e not train a model from scratch). So my question is, does this tutorial assume that we are training a model from scratch? An if so, how can I load an open nmt checkpoint and resume training from there instead?

All help is much appreciated.