GithubHelp home page GithubHelp logo

yiyiwang515 / styleap Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ivanwang0730/styleap

0.0 0.0 0.0 405 KB

Code and Data for Paper "Controlling Styles in Neural Machine Translation with Activation Prompt" (ACL 2023 Findings)

Shell 0.06% Python 99.94%

styleap's Introduction

Controlling Styles in Neural Machine Translation with Activation Prompt

Code and Data for Paper Controlling Styles in Neural Machine Translation with Activation Prompt, this paper proposes 1) a dataset multiway stylized machine translation (MSMT) benchmark, including four language directions with diverse language styles. 2) a method named style activation prompt (StyleAP) method to avoid re-tuning time after time. Through automatic evaluation and human evaluation, our method achieves a re-markable improvement over baselines and other methods. A series of analysis also show the advantages of our method.

Requirements

NOTE: At the very beginning, install NeurST from source:

git clone https://github.com/IvanWang0730/StyleAP.git
cd neurst/
pip3 install -e .

If there exists ImportError during running, manually install the required packages at that time.

Quick Start

Datasets

Our experiments are implemented on MSMT with four language directions, i.e., en-zh, zh-en, en-ko, and en-pt. You can download the raw dataset used in our paper on Google Drive. It includes training, evaluation and innovated test sets as is depicted in our paper.

Multi-way Stylized Machine Translation(MSMT) Benchmark

en-zh zh-en en-ko en-pt
Styles Modern / Classical Modern / Early Honorific / Non-hono Eurpean / Brazilian
Monolingual 22M / 967K 22M / 83.2K 20.5K / 20.5K 168K / 234K
Parallel 9.12M 9.12M 271K 412K
Development 1,997 2,000 879 890
Test 1,200 1,182 1,191 857
  • Classical and Modern. Classical Chinese originated from thousands of years ago and was used in ancient China. Modern Chinese is the normal Chinese that is commonly used currently.
  • Early Modern and Modern. Early Modern English in this paper refers to English used in the Renaissance such as Shakespearean plays. Modern English is the normal English that is commonly used currently.
  • Honorific and Non-honorific. There are seven verb paradigms or levels of verbs in Korean, each with its own unique set of verb endings used to denote the formality of a situation. We simplify the classification and roughly divide them into two groups.
  • European and Brazilian. European Portuguese is mostly used in Portugal. Brazilian Portuguese is mostly used in Brazil.

Create Prompt-based Data

We take en2zh task as an example to show how it works.

# generate faiss index
bash scripts/generate_index.sh 0 wmt2021_en_zh.en trained_en_zh.index
# search nearest samples via index
bash scripts/search_index.sh 0 wmt2021_en_zh.en trained_en_zh.index 

In this instance, we use wmt2021_en_zh.en in the default file directory ./MSMT/ to train a faiss index on a single GPU 0 and the same file to search the nearest monolingual sentences via the above trained index. note: You may use the scripts/split_parallel_sentence.sh to obtain monolingual sentence files.

You can quickly prepossess the training data like this. Besides, check sacremoses and subword-nmt for other setting details.

# tokenize source and target sentences
sacremoses -l {src_lang} -j 4 tokenize  < {src_text} > {src_text}.tok
sacremoses -l {trg_lang} -j 4 tokenize  < {trg_text} > {trg_text}.tok
# learn bpe subword
subword-nmt learn-joint-bpe-and-vocab --input {train_file}.L1 {train_file}.L2 -s {num_operations} -o {codes_file} --write-vocabulary {vocab_file}.L1 {vocab_file}.L2

Training & Validating

We can directly use the yaml-style configuration files to train and evaluate a transformer model on neurst.

python3 -m neurst.cli.run_exp \
    --config_paths configs/training_args.yml,configs/translation_bpe.yml,configs/validation_args.yml \
    --hparams_set transformer_base \
    --model_dir /models/benchmark_base

where /models/benchmark_base is the root path for checkpoints. Here we use --hparams_set transformer_base to train a transformer model including 6 encoder layers and 6 decoder layers with dmodel=512.

Evaluation on Testset

By running with

python3 -m neurst.cli.run_exp \
    --config_paths configs/prediction_args.yml \
    --model_dir configs/benchmark_base/best_avg

BLEU scores will be reported on MSMT testset.

Citation

@article{2022-styleAP,
    title = "Controlling Styles in Neural Machine Translation with Activation Prompt",
    author = "Yifan Wang and Zewei Sun and Shanbo Cheng and Weiguo Zheng and Mingxuan Wang",
    year = "2022",
    journal = "arXiv preprint arXiv:2212.08909",
    url = "https://arxiv.org/abs/2212.08909"
}

Please kindly cite our paper if this paper and the codes are helpful.

Thanks

Many thanks to the GitHub repositories of Transformers, Neurst and Faiss. Part of our codes are modified based on their codes.

styleap's People

Contributors

ivanwang0730 avatar sunzewei2715 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.