My own fork of Latent Morphology Model for Open-Vocabulary Neural Machine Translation by D. Ataman
- Install moses
- Install python dependencies
- Run
scripts/download-and-prepare-data.sh
-
On AWS,
pytorch==1.4.0
,torchtext=0.2.1
gets train.py to line 240 -
Using
subword-nmt
from Sennrich's group for the BPE learning -
Going to use
moses
for tokenization/truecasing- Always make sure
$MOSES_SCRIPTS
is set to point to the folder containing Moses' perl scripts
- Always make sure
-
BPE for English can be learned
- a) Separately for each EN-TGT pair
- b) Jointly from all EN training data
-
Make sure to set the
$LMM_REPO
environment variable to point to the repository- There is now a check for this in
preprocess.sh
- There is now a check for this in
-
Make sure to use python 3.6 or earlier since 3.7 gives an odd error message about StopIteration
- Alternatively you can go dig in the source code but it's probably easier to just
-
There seems to be two versions of
Samplers.py
onmt.modules.Samplers
andonmt.Samplers
- The former seems to be commented out
- host dataset somewhere
- write
download-data.sh
to download and extract TED xml - write tokenization/lowercasing/truecasing/BPE scripts
- tokenization of src/tgt
- truecasing model for each lang pair
- bpe of english target side
- preprocess corpus into correct format
- TED dataset
- IWSLT dataset
- [] get
examples/train.sh
working - [] get
examples/translate.sh
working
This software implements the Neural Machine Translation model based on Hierchical Character-based Decoding using Variational Inference.
## Hiearchical Decoder with Compositional Word Embeddings and Character-level Generation with Variational Inference
To activate the character-level decoder, select
-tgt_data_type characters
in the settings of preprocess.py and translate.py
and
-decoder_type charrnn
and -tgt_data_type characters
in train.py
The feature dimensions are hardcoded to 100 for the lemma and 10 for inflectional feature vectors, you can change this depending on your language or data size.
For information about how to install and use OpenNMT-py: Full Documentation
If you use this software, please cite: @article{lmm, author = {Duygu Ataman and Wilker Aziz and Alexandra Birch}, title = {A Latent Morphology Model for Open-Vocabulary Neural Machine Translation}, booktitle = {Under Review as Conference Paper at ICLR}, year = {2019}, }