GithubHelp home page GithubHelp logo

yaozusun / thrax2 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jweese/thrax2

0.0 0.0 0.0 454 KB

scfg extractor for machine translation

License: MIT License

CMake 1.73% C++ 96.72% C 0.79% sed 0.01% Awk 0.05% Perl 0.60% Shell 0.10%

thrax2's Introduction

thrax2

scfg extractor for machine translation

This is intended to replace jweese/thrax.[1]

Building

First, in your build directory, use ccmake or cmake -D to set CMAKE_BUILD_TYPE=Release. Then type make.

Note: thrax2 requires a C++17-compliant compiler.

Running

The binaries src/{hiero,samt} generate Hiero and SAMT grammars, respectively. They will read an aligned parallel corpus from stdin and produce rules on stdout. The rules are not unique. Typically we use the many scripts/filter_* scripts to reduce the rules of interest. The stream of rules should then be piped to scripts/score to produce feature scores for a unique set of rules.

  • scripts/default_{hiero,samt} will run extraction, filtering, and scoring for the typical Hiero or SAMT setup. Those scripts are easy to modify for your purposes.

Motivation

The original Thrax, in the name of speed, used Apache hadoop for both rule extraction and rule scoring. The rule collections that were intermediate results were too big to be held in memory, so hadoop's map-reduce-like behavior was used to sort and stream the rules in order to calculate their scores.

In 2018, that kind of chicanery is not necessary. In fact, it's probably openly harmful, given the significant overhead involved in writing intermediate files to disk and shuffling data across the network to different hadoop nodes. Machines are big and fast enough to do things locall now.[2]

When you factor in the human costs of setting up and maintaining a hadoop cluster, tuning the JVM of various child processes, and so on, the advantage of Thrax (reproducibility) is not worth its expense.


  1. Weese et al., 2"Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor." WMT2011.
  2. See, for example, Drake's "Command-line Tools can be 235x Faster than your Hadoop Cluster" (https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.