Comments (3)
Hi
There are several important hyperparameters that you need to take into consideration with Transformers in order to bridge 3-4 BLEU score gap:
- Larger batch sizes by training on multiple GPUs. Transformers are very sensitive to batch size and as you increase the batch size the final BLEU score increases as well. We only utilized 1 GPU in this paper for Autoregressive models
- Larger vocabulary size. In original paper they used 60k vocab size whereas we used 40k vocab size
- Tricks like label smoothing and multiple checkpointing averaging. These two tricks will squeeze additional 1 BLEU score.
I am sure if you take into account those details, implement and execute them you will bridge the gap (I did it myself in different codebase). And you use this new model to distill into ours non-autoregressive model by iterative refinement and see additional gains as well
Hope this helps
from dl4mt-nonauto.
Hi
There are several important hyperparameters that you need to take into consideration with Transformers in order to bridge 3-4 BLEU score gap:
- Larger batch sizes by training on multiple GPUs. Transformers are very sensitive to batch size and as you increase the batch size the final BLEU score increases as well. We only utilized 1 GPU in this paper for Autoregressive models
- Larger vocabulary size. In original paper they used 60k vocab size whereas we used 40k vocab size
- Tricks like label smoothing and multiple checkpointing averaging. These two tricks will squeeze additional 1 BLEU score.
I am sure if you take into account those details, implement and execute them you will bridge the gap (I did it myself in different codebase). And you use this new model to distill into ours non-autoregressive model by iterative refinement and see additional gains as well
Hope this helps
Thanks! In addition, I am wondering how many gpu memory it may requires to train the model. I tryed with my server with one P100 and cuda out of memory after traning 12 iterations (12/1000 [00:09<12:48, 1.28it/s]T). Is it normal or I have made something wrong? (I trained with -num_gpus=2 as the code does not work when I set it to one, actually I only got one gpu, is that the problem?)
Thanks again~
from dl4mt-nonauto.
Yes you set the number of iterations during training too high (note that we used 4 in our experiments). As the gradients get propagated across iterations (as we pass the pre softmax hidden states to the next iteration), setting train_iters
too high will likely give you out of memory error, even with batch parallelism over multiple GPUs.
from dl4mt-nonauto.
Related Issues (14)
- Train loss value computes to zero in every iteration HOT 1
- General information about distillation HOT 11
- Training error (num_gpu argument) HOT 8
- Reproducing MSCOCO image captioning results HOT 11
- Test data for reproducing IWSLT-16 En-De results HOT 2
- RuntimeError: Error(s) in loading state_dict for FastTransformer: HOT 16
- IWSLT-16 En-De Decoding HOT 1
- different batch_size lead to different results HOT 6
- How is your WMT16 EN-Ro Dataset Preprocessed? HOT 1
- I receive Error for "model.py" HOT 2
- No event loop integration for 'inline'
- RuntimeError: each element in list of batch should be of equal size
- Need the bpe codes files for applying bpe to a new file. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dl4mt-nonauto.