I found the transformer usually get BLEU score around 27-28 for WMT14 EN-DE. However,

Is the AR model for NMT tasks transformer? about dl4mt-nonauto HOT 3 CLOSED

MichaelZhouwang commented on July 20, 2024

Is the AR model for NMT tasks transformer?

from dl4mt-nonauto.

Comments (3)

mansimov commented on July 20, 2024

Hi
There are several important hyperparameters that you need to take into consideration with Transformers in order to bridge 3-4 BLEU score gap:

Larger batch sizes by training on multiple GPUs. Transformers are very sensitive to batch size and as you increase the batch size the final BLEU score increases as well. We only utilized 1 GPU in this paper for Autoregressive models
Larger vocabulary size. In original paper they used 60k vocab size whereas we used 40k vocab size
Tricks like label smoothing and multiple checkpointing averaging. These two tricks will squeeze additional 1 BLEU score.

I am sure if you take into account those details, implement and execute them you will bridge the gap (I did it myself in different codebase). And you use this new model to distill into ours non-autoregressive model by iterative refinement and see additional gains as well

Hope this helps

from dl4mt-nonauto.

MichaelZhouwang commented on July 20, 2024

Hi
There are several important hyperparameters that you need to take into consideration with Transformers in order to bridge 3-4 BLEU score gap:

Larger batch sizes by training on multiple GPUs. Transformers are very sensitive to batch size and as you increase the batch size the final BLEU score increases as well. We only utilized 1 GPU in this paper for Autoregressive models

Larger vocabulary size. In original paper they used 60k vocab size whereas we used 40k vocab size

Tricks like label smoothing and multiple checkpointing averaging. These two tricks will squeeze additional 1 BLEU score.

I am sure if you take into account those details, implement and execute them you will bridge the gap (I did it myself in different codebase). And you use this new model to distill into ours non-autoregressive model by iterative refinement and see additional gains as well

Hope this helps

Thanks! In addition, I am wondering how many gpu memory it may requires to train the model. I tryed with my server with one P100 and cuda out of memory after traning 12 iterations (12/1000 [00:09<12:48, 1.28it/s]T). Is it normal or I have made something wrong? (I trained with -num_gpus=2 as the code does not work when I set it to one, actually I only got one gpu, is that the problem?)
Thanks again~

from dl4mt-nonauto.

jaseleephd commented on July 20, 2024

Yes you set the number of iterations during training too high (note that we used 4 in our experiments). As the gradients get propagated across iterations (as we pass the pre softmax hidden states to the next iteration), setting train_iters too high will likely give you out of memory error, even with batch parallelism over multiple GPUs.

from dl4mt-nonauto.

Is the AR model for NMT tasks transformer? about dl4mt-nonauto HOT 3 CLOSED

Comments (3)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs