GithubHelp home page GithubHelp logo

Comments (3)

mansimov avatar mansimov commented on July 20, 2024

Hi
There are several important hyperparameters that you need to take into consideration with Transformers in order to bridge 3-4 BLEU score gap:

  1. Larger batch sizes by training on multiple GPUs. Transformers are very sensitive to batch size and as you increase the batch size the final BLEU score increases as well. We only utilized 1 GPU in this paper for Autoregressive models
  2. Larger vocabulary size. In original paper they used 60k vocab size whereas we used 40k vocab size
  3. Tricks like label smoothing and multiple checkpointing averaging. These two tricks will squeeze additional 1 BLEU score.

I am sure if you take into account those details, implement and execute them you will bridge the gap (I did it myself in different codebase). And you use this new model to distill into ours non-autoregressive model by iterative refinement and see additional gains as well

Hope this helps

from dl4mt-nonauto.

MichaelZhouwang avatar MichaelZhouwang commented on July 20, 2024

Hi
There are several important hyperparameters that you need to take into consideration with Transformers in order to bridge 3-4 BLEU score gap:

  1. Larger batch sizes by training on multiple GPUs. Transformers are very sensitive to batch size and as you increase the batch size the final BLEU score increases as well. We only utilized 1 GPU in this paper for Autoregressive models
  2. Larger vocabulary size. In original paper they used 60k vocab size whereas we used 40k vocab size
  3. Tricks like label smoothing and multiple checkpointing averaging. These two tricks will squeeze additional 1 BLEU score.

I am sure if you take into account those details, implement and execute them you will bridge the gap (I did it myself in different codebase). And you use this new model to distill into ours non-autoregressive model by iterative refinement and see additional gains as well

Hope this helps

Thanks! In addition, I am wondering how many gpu memory it may requires to train the model. I tryed with my server with one P100 and cuda out of memory after traning 12 iterations (12/1000 [00:09<12:48, 1.28it/s]T). Is it normal or I have made something wrong? (I trained with -num_gpus=2 as the code does not work when I set it to one, actually I only got one gpu, is that the problem?)
Thanks again~

from dl4mt-nonauto.

jaseleephd avatar jaseleephd commented on July 20, 2024

Yes you set the number of iterations during training too high (note that we used 4 in our experiments). As the gradients get propagated across iterations (as we pass the pre softmax hidden states to the next iteration), setting train_iters too high will likely give you out of memory error, even with batch parallelism over multiple GPUs.

from dl4mt-nonauto.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.