GithubHelp home page GithubHelp logo

templeblock / transformers_without_tears Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tnq177/transformers_without_tears

0.0 2.0 0.0 9.97 MB

Transformers without Tears: Improving the Normalization of Self-Attention

Python 86.05% Perl 13.95%

transformers_without_tears's Introduction

Ace: An implementation of Transformer in Pytorch

Toan Q. Nguyen, University of Notre Dame

This is the re-implementation of the paper Transformers without Tears: Improving the Normalization of Self-Attention.

While the code was initially developed to try out multilingual NMT, all experiments mentioned in the paper and also in this guide are meant for bilingual only. Regarding the multilingual parts of the code, I initially followed XLM and add the following:

  • language embedding: each language has a an embedding vector which is summed to input word embeddings, similar to positional embedding
  • oversampling data before BPE: we sample sentences from each language based on some weighted probability to ensure that rarer languages are well represented so it won't be broken into very short BPE segments. See their paper for more information. My own implementation is in preprocessing.py

If we train for bilingual only, adding language embedding and oversampling data won't make any difference (according to my early experiments), however, I keep them in the code since they might be useful later.

This code has been tested with only Python 3.6 and PyTorch 1.1.

Input and Preprocessing

Under a data directory, for each language pair of src_lang and tgt_lang, create a folder of name src_lang2tgt_lang which has the following files:

train.src_lang   train.tgt_lang
dev.src_lang     dev.tgt_lang
test.src_lang    test.tgt_lang

The files should be tokenized. My rule of thumb for data preprocessing:

  • tokenize data
  • filter out sentences longer than 200-250
  • learn BPE
  • apply BPE
  • don't filter again

Transformer is known to not generalize well to sentences longer than what it's seen (see this), so we need long sentences during training. We don't have to worry about OOM because we always batch by number of tokens. Even a really long sentence of 250 words won't be have more than 2048/4096 BPE tokens.

After that, install fastBPE. Then run:

python3 preprocessing.py --data-dir data --num-ops number_of_bpe_ops --pairs src_lang2tgt_lang --fast path_to_fastbpe_binary"

This will:

  • sample sentences from train.{src_lang, tgt_lang} to a joint text file
  • learn bpe from that
  • bpe-encode the rest of files
  • create vocabularies
  • convert data into ids and save as .npy files

For example, if we're training an English-Vietnamese model (en2vi) using 8000 BPE operations, then the resultant directory looks like this:

data
├── en2vi
│   ├── dev.en
│   ├── dev.en.bpe
│   ├── dev.en.npy
│   ├── dev.vi
│   ├── dev.vi.bpe
│   ├── dev.vi.npy
│   ├── test.en
│   ├── test.en.bpe
│   ├── test.vi
│   ├── test.vi.bpe
│   ├── train.en
│   ├── train.en.bpe
│   ├── train.en.npy
│   ├── train.vi
│   ├── train.vi.bpe
│   └── train.vi.npy
├── joint_all.txt
├── joint.bpe
├── lang.vocab
├── mask.en.npy
├── mask.vi.npy
├── subjoint_en.txt
├── subjoint_en.txt.8000
├── subjoint_en.txt.8000.vocab
├── subjoint_vi.txt
├── subjoint_vi.txt.8000
├── subjoint_vi.txt.8000.vocab
└── vocab.joint

Usage

To train a new model:

  • write a new configuration function in configurations.py
  • run python3 main.py --mode train --data-dir ./data --dump-dir ./dump --pairs src_lang2tgt_lang --config config_name

Note that I separate the two configs:

  • hyperparameters/training options: in configurations.py
  • what pairs are we training on, are we training or translating...: just see main.py

Training is logged in dump/DEBUG.log. During training, the model is validated on the dev set, and the best checkpoint is saved to dump/model-SCORE.pth (also dump/src_lang2tgt_lang-SCORE.pth, they are the same). All best/beam translations, final training stats (train/dev perplexities)... are stored in dump as well.

To translate using a checkpoint, run:

python3 main.py --mode translate --data-dir ./data --dump-dir ./dump --pairs src_lang2tgt_lang --files-langs data/src_lang2tgt_lang/temp,src_lang,tgt_lang --config src_lang2tgt_lang --model-file dump/src_lang2tgt_lang-SCORE.pth

Options

Many options in configurations.py are pretty important:

  • use_bias: if set to False, all linear layer won't use bias. Default to True which uses bias.
  • fix_norm: fix the word embedding norm to 1 divide each word embedding vector by its l2 norm (Improving Lexical Choice in Neural Machine Translation)
  • scnorm: the ScaleNorm in our paper. This replaces Layer Normalization with a scaled l2-normalization layer which first normalizes input to norm 1 (divide vector by its l2 norm) then scale up by a single, learnable parameter. See ScaleNorm in layers.py
  • mask_logit: if set to True, for each target language, we set the logits of types that are not in that language's vocabulary to -inf (so after softmax, those probs become 0). The idea is, say src and tgt each has 8000 types in their vocabs, but only 1000 is shared, then we should not predict the other 7000 types in the source.
  • pre_act: if True, do PreNorm (normalization->sublayer->residual-add), else do PostNorm (sublayer->residual-add->normalization). See the paper for more discussion and related works.
  • clip_grad: gradient clipping value. 1.0 works well.
  • warmup_steps: number of warmup steps if we do warmup
  • lr_scheduler: if ORG_WU (see all_constants.py), we follow the warmup-cooldown schedule in the original paper. If NO_WU, we use a constant learning rate lr which is then decayed whenever development BLEU has not improved over patience evaluations. If UPFLAT_WU then we do warmup, but then stay at the peak learning rate and decay like NO_WU.
  • lr_scale: multiply learning rate by this value
  • lr_decay: decay factor (new_lr <-- old_lr * lr_decay)
  • stop_lr: stop training when learning rate reaches this value
  • label_smoothing: default to 0.1 like in original paper
  • batch_size: number of src+tgt tokens in a batch
  • epoch_size: number of iterations we consider one epoch
  • max_epochs: maximum number of epochs we train for
  • dropout: sublayer output's dropout rate
  • {att,ff,word}_dropout: dropout rate for attention layer, feedforward and word-dropout. For word-dropout, we replace with UNK instead of zero-ing embeddings. I find word-dropout useful for training low-resource, bilingual model.
  • beam_{size, alpha}: Beam size and length normalization using Wu et al.'s magic formula

Some other notes

Because this is my re-implementation from memory, there are many pieces of information I forget. I just want to clarify the followings:

  • The IWSLT English-Vietnamese dataset is from paper, data. The other IWSLT datasets are from paper, data. I don't remember what is the length limit I use to filter those datasets, but must be approx. 200-250.
  • This code doesn't implement the fixed g=sqrt(d) experiments in table 7. One can try those experiments by simply edit the ScaleNorm class to take in a trainable bool param which determines if g is learned or fixed. Then for all normalization layers, set that to False (so we always use g=sqrt(d)), except for the final output from the decoder because we want it to scale up to widen the logit range (and sharpen the softmax).
  • In the paper, we use early stopping (stop training if dev BLEU has not improved over 20-50 epochs). This code doesn't do that since all kind of early stopping heuristics can sometimes hurt performance. I suggest to just train until learning rate gets too small or max_epochs is reached.
  • Ihe original implementation shuffles the whole training dataset every epoch, then re-generates batches. After reading fairseq, I change it to generating batches at first, then reuse them (but still shuffle their order).

If there are any questions, feel free to send me an email (email address in the paper).

References

Parts of code/scripts are inspired/borrowed from:

alt text The art is from here

transformers_without_tears's People

Contributors

tnq177 avatar

Watchers

James Cloos avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.