ufal / mtmonkey Goto Github PK

Distributed infrastructure for Machine Translation web services (using Moses, Python, JSON-RPC/web interface)

License: Other

Perl 4.65% Shell 17.89% Python 61.45% CSS 2.44% PHP 6.10% JavaScript 4.69% Smalltalk 2.22% Dockerfile 0.57%

mtmonkey's Introduction

MTMonkey – an infrastructure for Machine Translation web services

Description

MTMonkey is a simple and easily adaptable infrastructure for Machine Translation web services, written in Python. It allows clients JSON-encoded request for different translation directions to be distributed among multiple MT servers.

This system consists of:

the main application server that handles the clients' requests and distributes them to the machines that perform the translation,
the worker that handles one translation direction (including segmentation, tokenization, recasing, and detokenization),
all text processing tools used by the workers,
a simple demonstration web client written in PHP,
and support scripts for self-checks, auto-starting and easy model distribution.

The communication between the main application server and workers proceeds via XML-RPC requests, but workers accepting JSON requests are also supported on the application server side, allowing alternative worker implementations.

There may be more workers for the same language pair. Workers may run on the same physical machine or on several different machines. For a more detailed description of the overall architecture of MTMonkey, see our paper presented at MT Marathon 2013 in Prague or the accompanying poster.

License

Authors: Aleš Tamchyna, Ondřej Dušek, Rudolf Rosa, Pavel Pecina

Licensed under the Apache License, Version 2.0.

When using this software in your scientific work, please cite the following paper:

Aleš Tamchyna, Ondřej Dušek, Rudolf Rosa, and Pavel Pecina: MTMonkey: A scalable infrastructure for a Machine Translation web service. In Prague Bulletin of Mathematical Linguistics 100, 2013, pp. 31-40.

Contents of this package

appserver – source codes of the application server
cmdline-client – command-line clients for MTMonkey
config-example – example configuration files
images – just logos and images
install – installation scripts and instructions
scripts – application server, worker and Moses servers startup scripts
web-client – two different web clients for the service
worker – source codes of the worker, incl. text pre- and post-processing tools.

Usage

Installation

For installation notes for both workers and the application server, see install/README.md.

API description

For a detailed description of the API used by MTMonkey, see API.md and the paper referenced above.

MTMonkey clients

The package includes command-line and web-based clients that can connect to MTMonkey servers. Please see the respective directories for documentation.

In addition, you can easily send requests to MTMonkey from command-line using the curl tool, or from your browser by typing the correct URL. See the API description for more information.

Acknowledgements

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 257528 (KHRESMOI). This work has been using language resources developed and/or stored and/or distributed by the LINDAT-Clarin project of the Ministry of Education of the Czech Republic (project LM2010013). This work has been supported by the AMALACH grant (DF12P01OVV02) of the Ministry of Culture of the Czech Republic.

mtmonkey's People

Contributors

Stargazers

Watchers

Forkers

gracaninja lefterav liesbetha mantyr muranava surafelml satishkrr varisd pompomon

mtmonkey's Issues

API: nBestSize: unique?

From the API:

nBestSize: integer -- maximum number of translation options

Should these be unique translations? You can specify the option nbest-distinct to mosesserver. If you don't, you may get ten times the same solutions, but with different scores (and perhaps different alignments?)

Request for no sentence segmentation

We should add an option that will force the system to treat the request as a single sentence. I suggest segmentSentences=False.

Resolve discrepancy between API and actual behavior for n-best lists

The API description puts the elements of the n-best list inside translated, but the implementation puts them inside translation. translated then contains the individual sentences of a single n-best list member.

We should unify the behavior of the code with the API description or the other way round, whichever is more reasonable.

API: status messages

I think the API should allow more reporting than just errors. Like warnings, debug info, other meta info. But how? Perhaps a single mechanism in combination with errors?

msgLevel: 0 = none, 1 = information, 2 = warning, 3 = error
msgText: "OK"  (for msgLevel 0)

I use a simpler way, just warnings here, but this is not ideal:
http://zardoz.service.rug.nl:9070/rpc?action=translate&sourceLang=en&targetLang=nl&text=This+is+a+test.&nBestSize=3

JSON string supported is much more cool

I am not sure the use of the schema in the screenshot below:

Sometimes the XML and JSON files will be translated, i am not sure whether the function is wanting to allow different types of input files to be translated. if so, very cool. What I mean is that the toolkit can allow us to perform multiple translations within a single call through using JSON object. Exactly the same JSON object will be returned back but with translated values.

Type of alignmentInfo

In the API, it says:

alignmentInfo: string -- request alignment information (optional, default = "false")

Shouldn't this be type boolean, just like detokenize?

Translation scores

Output actual translation scores.

Nice to have: get scores using some baseline quality estimation approach instead of the rather uninformative translation probability output by Moses.

Add an option to use no recaser

Workers should have an option to use no recasers (in case the "main" translation Moses model also handles recasing).

GET vs POST

Currently, GET requests handle a small subset of arguments accepted by POST. We should unify this (should we?).

API: response for detokenize=false

If detokenize=false, should the result (still tokenized) be in "text" or in "tgt-tokenized"?

n-best outputs

Implement outputting of n-best translations.

Require src-tokenized for multiple sentences

In the translation output, src-tokenized should always be included if multiple sentences are translated.

install_virtualenv.sh breaks

Hi, when trying to run this script, I get the following error:

mkdir: cannot create directory /virtualenv': Permission denied ln: failed to create symbolic link./virtualenv': File exists
New python executable in /home/elav01/virtualenv/bin/python
Installing setuptools............done.
Installing pip...............done.
install_virtualenv.sh: line 32: virtualenv/bin/activate: No such file or directory

The problem does not occur if I run the script commands one by one from the commandline

API: timing info

It might be useful to have timing info in the result. See, for example:
http://zardoz.service.rug.nl:9070/rpc?action=translate&sourceLang=nl&targetLang=en&text=Dit+is+een+test.