GithubHelp home page GithubHelp logo

nlp2ct / norm-nmt Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lkfo415579/norm-nmt

17.0 2.0 0.0 11.46 MB

Norm-Based Curriculum Learning for Neural Machine Translation (ACL 2020)

License: Other

Python 3.61% Shell 0.48% CMake 2.97% Vim Script 0.04% Perl 0.31% C++ 81.52% C 0.05% Cuda 10.35% Batchfile 0.67%

norm-nmt's Introduction

Norm-Based Curriculum Learning for Neural Machine Translation (ACL 2020)

This repo contains the source code and instructions to reproduce the results of our paper.

Reference:

@inproceedings{NORMCL20,
  title={Norm-Based Curriculum Learning for Neural Machine Translation},
  author={Liu, Xuebo and Lai, Houtim and Wong, Derek F. and Chao, Lidia S.},
  booktitle={ACL 2020},
  year={2020}
}

INSTALLATION

Requirements

* [Boost] 1.64.0
* [CMAKE] 3.13.2
* [CUDA] 8.0
* [Fasttext] https://github.com/facebookresearch/fastText
* Please review Marian for more installation details: https://marian-nmt.github.io/

Clone this repository

git clone https://github.com/NLP2CT/norm-nmt
cd norm-nmt
mkdir build
cd build
cmake ..
make -j

Training and Testing

Train Fasttext to get norm-based sentence difficulty

# Install Fasttext
mkdir ~/fast && cd ~/fast
wget https://github.com/facebookresearch/fastText/archive/v0.9.1.zip
unzip v0.9.1.zip
cd fastText-0.9.1
make -j
cp fasttext ../
# Use tokenized data for fasttext
mkdir -p cl/mod
cat $TRAIN.$SRCL | $MARIAN_VOCAB > cl/vocab.$SRCL.yml
python CL_tools/process_fasttext.py -i $TRAIN.$SRCL -o $SRCL.emb -v $cl/vocab.$SRCL.yml -w ~/fast/fasttext
python CL_tools/build_cdf_mod.py --emb_vector $SRCL.emb.orig.vec $TRAIN.$SRCL cl/mod/$SRCL-mod

Translation

# Training (8 GPUs)
 $MARIAN_TRAIN \
        --model $MODEL_DIR/model_revo.npz --type transformer \
        --train-sets $TRAIN.$SRCL $TRAIN.$TGTL \
        --max-length 140 \
        --vocabs $MODEL_DIR/vocab.$SRCL.yml $MODEL_DIR/vocab.$TGTL.yml \
        --mini-batch-fit -w 9250 --maxi-batch 5000 \
        --early-stopping 10 --cost-type=ce-mean-words \
        --valid-freq 2500 --save-freq 2500 --disp-freq 1 \
        --valid-metrics ce-mean-words perplexity translation \
        --valid-sets $CORPUS_DIR/$VALID.$SRCL $CORPUS_DIR/$VALID.$TGTL \
        --valid-script-path "bash ./validate-"$SRCL\-$TGTL".sh" \
        --valid-translation-output $OUTPUT_DIR/$ID.tf.$SRCL$TGTL.single --quiet-translation \
        --valid-mini-batch 64 \
        --beam-size 6 --normalize 0.6 \
        --log $MODEL_DIR/train.log --valid-log $MODEL_DIR/valid.log \
        --enc-depth 6 --dec-depth 6 \
        --transformer-heads 8 \
        --transformer-postprocess-emb d \
        --transformer-postprocess dan \
        --transformer-dropout 0.1 --label-smoothing 0.1 \
        --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
        --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
        --devices $GPUS --sync-sgd --seed $ID$ID$ID$ID --keep-best --overwrite \
        --exponential-smoothing --no-nccl --filter_corpus 0.85 \
        --sr-freq-file cl/mod/en-mod.txt cl/mod/en-mod-cdf_base.npz 2.5 0.01 mod d 0.5 \
        --after-batches 100000

# Evaluation
./decode_validate.sh $MODEL_DIR > $MODEL_DIR/result

# For more details:
# runner/run.sh

# Competence Parameters
# NBCL : --sr-freq-file cl/mod/en-mod.txt cl/mod/en-mod-cdf_base.npz 2.5 0.01 mod [d] [0.5]
# params : word_stat_file CDF_file [MOD: ratio] percentage_of_starting_corpus(c0) mode(mod) [dynamic_weight] [dynamic_ratio]

Procedure

All tools can be found in the runner folder.

  1. compile marian source code [Please review marian repo for detail]
  2. install fasttext
  3. prepare your corpus data
  4. use NBCL tools to build word_stat and cdf files
  5. train model

Mainly modified code:

src/data/competence.h
src/data/gap_training.h
src/data/batch_generator.h

Great thanks to Marian community

This project is based on the codebase forked from Marian (version 13 Dec 2018).

norm-nmt's People

Contributors

sunbowliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.