GithubHelp home page GithubHelp logo

otedama's Introduction

OTEDAMA - Preordering for Machine Translation

Welcome to OTEDAMA, a system for extracting preordering rules for machine translation. Otedama currently works for English and Chinese (using the Stanford Parser) as source languages and any target language for which word alignments can be extracted. External source-language parsers using the CoNLL output format are also supported.

Prerequisites

OTEDAMA comes pre-packaged with the Stanford Parser. You just need a working installation of java (Version 8.0 or higher, including a compiler).

Installation

  1. Download Otedama by running: git clone https://github.com/StatNLP/otedama
  2. Download the Stanford Parser models by running:
  • cd otedama/lib/parser/
  • wget http://www.cl.uni-heidelberg.de/statnlpgroup/otedama/stanford-parser-3.5.2-models.jar
  • cd ../..
  1. Running ./build.sh in the installation directory should create the files parser.jar, zhparser.jar, conll2tree.jar, learner.jar and reorder.jar in the bin subdirectory.
  2. If all went well, you should be able to run the example script (see example/README.txt for instructions).

Preprocessing/Postprocessing

Note that because output sentences are reconstructed from parse trees, the parser's tokenization is preserved in the input (if your parser performs automatic tokenization). For meaningful comparison, it is recommended you use tokenized text for your experiments or apply de-tokenization after re-ordering.

Parsing

Parser Models:

In order to use the Stanford parser, you need to download the models which are not included in this repository.

  1. Download the parser models from here (282 MB)
  2. Copy stanford-parser-3.5.2-models.jar to the lib/parser/ directory in this repo.

You can also try using a different version of the Stanford parser and parser models (which will probably work just fine), but this has not been tested.

For training:

java -jar bin/parser.jar <english training corpus> <alignments english-foreign> <output file> <verbose_mode: v(erbose) or q(uiet)>

Example:
java -jar bin/parser.jar train.en train.en-fr.align train.en-fr.trees v

For testing (without alignments):

Specifying "no_align" instead of an alignment file will create an unaligned treebank (such as for reordering test data).

java -jar bin/parser.jar <english_training_corpus> no_align <output_file> <verbose_mode: v(erbose) or q(uiet)>

Example:
java -jar bin/parser.jar test.en no_align test.en.trees q

Languages other than English:

  • For Chinese, use bin/zhparser.jar instead of bin/parser.jar.
  • For external dependency parsers that produce output in CoNLL format (http://ilk.uvt.nl/conll/#dataformat), run bin/conll2tree.jar <input file source in CONLL format>, <alignment file source-target>, <output file> to convert labelled dependencies to trees in Otedama's own format.
  • It is planned to support other output from external parsers eventually.

Learning rules

java -jar bin/learner.jar <config file> 

Example:
java -jar bin/learner.jar example/example.config

Training Parameters:

Parameters for the learner are specified in a config file. See example/example.config for a template. The parameters are as follows:

  • TRAINING_TREEBANK_FILE: Parsed training data, created by parser.jar or conll2tree.jar
  • INITIAL_SUBSAMPLE_SIZE=10: A new random subset of the specified size is chosen as a learning treebank for extracting rule candidates in each iteration. Bigger training subsets lead to longer training times, but increase the probability of finding a near-optimal new rule in each iteration. Should be at leat 4N, where N is the number of CPUs available for adequate load balancing.
  • MAX_RULE_CROSSING_SCORE: Bound on the effect on alignment monotonicity of the whole training corpus for candidate rules. Should be 0 or a negative number.
  • MIN_MATCHING_FEATURES: For window size 4: 12 for exact matching, <12 for fuzzy matching. For window size 3: 10 for exact matching, <10 for fuzzy matching. For window size 2: 8 for exact matcing, <8 for fuzzy matching The given value is adjusted automatically for nodes that have few children and thus cannot fulfill the given matching criterion. Small values yield very poor results.
  • PARALLEL_THREADS: Number of parallel threads for learning.
  • RULE_OUTPUT_FILE: Output model file where learned rules are written.
  • MIN_REDUCTION_FACTOR: Variance constraint for rules: Each new rule must reduce crossing score on n times as many sentences as number of sentences where it increases crossing score. For noisy data sets, a minimum reduction factor of 2.0 is recommended.
  • WINDOW_SIZE: Size of the sliding window during rule extraction. Values between 2 and 4 are supported.
  • USE_FEATURE_SUBSETS: If 'y', subsets of the matching context are also extracted as feature sets for candidate rules.
  • LOGGING: v[erbose] or q[uiet]
  • MAX_WAITING_TIME_MINS: Maximum time in minutes that may elapse without any new rules being learned before training is stopped. Longer time limits will result in more rules being learned. For large datasets, a minimum of 10 minutes waiting time is recommended.

Reordering

java -jar bin/reorder.jar <english treebank> <rule file> <config file> <english output treebank> <english surface output> <batch-size> <number of parallel threads>

Example:
java -jar bin/reorder.jar test.en.trees en-fr.rules en-fr.config test.en.trees.reordered test.en.reordered 1000 10 10

Greater batch sizes can lead to higher memory consumption but higher throughput as well. It is highly recommended to use the same config file as in the learning step.

Good luck and have fun!

##Example Script

For instructions on how to run an example script of the entire Otedama pipeline, see example/README.txt

##Computing Alignment Monotonicity

A python script for computing the monotonicity of a given word alignment can be found at scripts/crossing_score.py. For example, you can run

python crossing_score.py <alignment_file> none

to obtain the total number of alignment crossings in <alignment_file>. Optionally, a file with alignment transformations can be specified as the second parameter.

References

Publications:

Julian Hitschler, Laura Jehl, Sariya Karimova, Mayumi Ohta, Benjamin Körner, Stefan Riezler. Otedama: Fast Rule-Based Pre-Ordering for Machine Translation. The Prague Bulletin of Mathematical Linguistics, vol. 106, pp. 159-168. Charles University, Prague, Czech Republic, 2016: http://ufal.mff.cuni.cz/pbml/106/art-hitschler-et-al.pdf

D. Genzel. Automatically Learning Source-side Reordering Rules for Large Scale Machine Translation. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics ACL 2010, (August):376-384, 2010.

M. Collins, et. al. Clause restructuring for statistical machine translation. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics ACL05, (June):531-540, 2005.

External Libraries:

stanford-parser.jar

Support

For questions and comments, please contact Julian Hitschler: [email protected]. We are happy to help with any issue you may encounter while using Otedama, no matter how big or small.

License

OTEDAMA was created 2013-2016 by Julian Hitschler, Laura Jehl, Sariya Karimova, Mayumi Ohta, Benjamin Körner and Stefan Riezler at the Institute for Computational Linguistics, University of Heidelberg, Germany (www.cl.uni-heidelberg.de). OTEDAMA is licensed under the BSD 2-Clause License: https://opensource.org/licenses/bsd-license.php OTEDAMA is named after a Japanese juggling game for children: https://en.wikipedia.org/wiki/Otedama

Citing Otedama

If you use Otedama, please cite the following publication:

Julian Hitschler, Laura Jehl, Sariya Karimova, Mayumi Ohta, Benjamin Körner, Stefan Riezler. Otedama: Fast Rule-Based Pre-Ordering for Machine Translation. The Prague Bulletin of Mathematical Linguistics, vol. 106, pp. 159-168. Charles University, Prague, Czech Republic, 2016: http://ufal.mff.cuni.cz/pbml/106/art-hitschler-et-al.pdf

otedama's People

Contributors

julianhitschler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

otedama's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.