GithubHelp home page GithubHelp logo

markuskhoa / mtet Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vietai/mtet

0.0 1.0 0.0 1.04 MB

MTet: Multi-domain Translation for English and Vietnamese

Python 98.33% Dockerfile 0.50% Makefile 0.34% CSS 0.83%

mtet's Introduction

MTet: Multi-domain Translation for English-Vietnamese

PWC

Release

Introduction

We are excited to introduce a new larger and better quality Machine Translation dataset, MTet, which stands for Multi-domain Translation for English and VieTnamese. In our new release, we extend our previous dataset (v1.0) by adding more high-quality English-Vietnamese sentence pairs on various domains. In addition, we also show our new larger Transformer models can achieve state-of-the-art results on multiple test sets.

Get data and model at Google Cloud Storage

Visit our blog post for more details.


Using the code

This code is build on top of vietai/dab:

To prepare for training, generate tfrecords from raw text files:

python t2t_datagen.py \
--data_dir=$path_to_folder_contains_vocab_file \
--tmp_dir=$path_to_folder_that_contains_training_data \
--problem=$problem

To train a Transformer model on the generated tfrecords

python t2t_trainer.py \
--data_dir=$path_to_folder_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_to_save_checkpoints

To run inference on the trained model:

python t2t_decoder.py \
--data_dir=$path_to_folde_contains_vocab_file_and_tf_records \
--problem=$problem \
--hparams_set=$hparams_set \
--model=transformer \
--output_dir=$path_to_folder_contains_checkpoints \
--checkpoint_path=$path_to_checkpoint

In this colab, we demonstrated how to run these three phases in the context of hosting data/model on Google Cloud Storage.


Dataset

Our data contains roughly 4.2 million pairs of texts, ranging across multiple different domains such as medical publications, religious texts, engineering articles, literature, news, and poems. A more detail breakdown of our data is shown in the table below.

v1 v2 (MTet)
Fictional Books 333,189 473,306
Legal Document 1,150,266 1,134,813
Medical Publication 5,861 13,410
Movies Subtitles 250,000 721,174
Software 79,912 79,132
TED Talk 352,652 303,131
Wikipedia 645,326 1,094,248
News 18,449 18,389
Religious texts 124,389 48,927
Educational content 397,008 213,284
No tag 5,517 63,896
Total 3,362,569 4,163,710

Data sources is described in more details here.

Acknowledgment

We would like to thank Google for the support of Cloud credits and TPU quota!

mtet's People

Contributors

heraclex12 avatar justinphan3110 avatar lmthang avatar ntkchinh avatar thtrieu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.