GithubHelp home page GithubHelp logo

kexuanzhang / petci Goto Github PK

View Code? Open in Web Editor NEW

This project forked from kenantang/petci

0.0 0.0 0.0 1.11 MB

PETCI: A Parallel English Translation Dataset of Chinese Idioms

License: Apache License 2.0

Shell 3.45% Python 96.55%

petci's Introduction

PETCI: A Parallel English Translation Dataset of Chinese Idioms

PETCI is a Parallel English Translation dataset of Chinese Idioms, collected from an idiom dictionary and Google and DeepL translation. PETCI contains 4,310 Chinese idioms with 29,936 English translations. These translations capture diverse translation errors and paraphrase strategies.

We provide several baseline models to facilitate future research on this dataset.

Data

The Chinese idioms and their translations are in the ./data/json/raw.json file. Here is one example:

{
    "id": 0,
    "chinese": "一波未平,一波又起",
    "book": [
        "suffer a string of reverses",
        "hardly has one wave subsided when another rises",
        "one trouble follows another"
    ],
    "google": [
        "One wave is not flat, another wave is rising"
    ],
    "deepl": [
        "before the first wave subsides, a new wave rises"
    ]
}
  • id is the index of the idiom in the dictionary
  • chinese is the Chinese idiom
  • book is the translations from the dictionary
  • google is the translation from Google
  • deepl is the translation from DeepL

In ./data/json/filtered.json, the machine translations that are the same as dictionary translations are removed, and the dictionary translations are split into gold and human translations.

Training and Testing

Prerequisites

Run pip install -r ./models/requirements.txt to install required packages. Download and put glove.840B.300d.txt in ./data/embedding. Download CoreNLP.

Create Datasets

Before training, run java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -parse.binaryTrees to start the CoreNLP server, and run the following commands in the ./data folder to create the necessary datasets.

mkdir label simplify tree
python dataset.py

LSTM

In the enclosing folder, run

./auto_train.sh
./auto_test.sh

Tree-LSTM

In the enclosing folder, run

./auto_train.sh
./auto_test.sh

BERT

In the enclosing folder, run

SEED=45
HM=ghm
PART=5
python train.py --seed $SEED --train-set train-$HM-$PART --dev-set dev-$HM

MODEL=checkpoint-5000
python test.py --model $MODEL --test-set dev-$HM --seed $SEED --hm $HM --part $PART

NTS

In the enclosing folder, run

onmt_build_vocab -config vocab.yaml -n_sample -1 

onmt_train -config nts.yaml

BEST=checkpoints/checkpoint_step_300.pt
SRC=../../data/simplify/test-src.txt
OUTPUT=../test-output.txt
onmt_translate -model $BEST -src $SRC -output $OUTPUT -verbose -beam_size 5

Figures

In the figs folder, run python plot.py --model lstm, where the model name can be replaced by tree_lstm or bert.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.