GithubHelp home page GithubHelp logo

mozilla / sumo-mt Goto Github PK

View Code? Open in Web Editor NEW
2.0 6.0 1.0 137 KB

INACTIVE - http://mzl.la/ghe-archive - Machine translation tests for SUMO

License: Mozilla Public License 2.0

Python 99.82% Shell 0.18%
inactive unmaintained

sumo-mt's Introduction

Description

This script uses Google Cloud Translate API to translate mediwiki markup files.

Requirements

Python version

$ python --version
Python 2.7.5

Dependencies

$ pip install --upgrade google-cloud-translate
$ pip install configparser

Google Cloud Translate set-up

Set-up Google Cloud translate

  • Create a project.
  • Enable Translation API.
  • Create a service account with Translate permissions.
  • Download the Google Cloud private key as credentials.json, say, in this same folder.

Settings

Rename settings_example into settings.

Usage

Usage examples:

You don't need to specify a language we can just default to the language that is specified in your settings file.

$ ./mediawiki.py --input orig_input.txt

Specify a specific single language.

$ ./mediawiki.py --input orig_input.txt --lang es
$ ./mediawiki.py --input orig_input.txt --lang ru

Specify multiple languages for a single file.

$ ./mediawiki.py --input orig_input.txt --lang 'es,ru'

Or you can specify a weird language and the error message shall tell you what are supported.

$ ./mediawiki.py --input orig_input.txt --lang unknown

For multiple files in a single directory and we want to generate multiple output languages.

$ ./mediawiki.py --indir myinputdirectory/ --lang 'ru,es'

If you want to send the translated files to a different directory you can specify an outdir (note that you don't need the slash on the end of the directory names).

$ ./mediawiki.py --indir my-input-directory/ --lang 'ru,es' --outdir my-output-directory/

Note that if the output directory does not already exist, then it is created.

If you want to have your settings file named something other than the default value of settings then you can provide this.

$ ./mediawiki.py --indir my-input-directory/ --lang 'ru,es' --outdir my-output-directory/ --settings custom-settings-filename.txt

Other notes

Overview

Translate an input mediawiki file of Spanish and generate an output mediawiki file of English. orig_input.txt -> script -> orgi_output.txt

Inputs

  • Input input-filename - Input mediawiki file.
  • Language of output file (default language is specified in settings file)
  • Language of output files can be a list of languages eg 'ru,es' would be for Russian and Spanish.
  • Input directory - so need to get a list of all files in that directory and then parse each one of them.
  • Output directory - place all the translated files into the output directory.
  • Input settings file - provides custom default languages and location of Google App API credentials JSON file.

Outputs

  • output file with the name of the file "myfile-es.txt' if the input is "myfile.txt", for a target language of es.
  • status
    • success (zero) or
    • failure (non-zero)

Design

Control Flow

  1. Open and read input file.
  2. Parse input file into a data structure.
  3. Process each line one at a time.
  4. For each line replace special text sequences with a symbol as we may want to translate these separately.
  5. Send requests to Cloud Translation to perform the language conversion.
  6. Create and write to output file.

Error conditions

  • Cannot find input file.
  • Empty input file.
  • Format of input file not valid according to mediawiki.
  • Unable to send requests to Cloud Translation.
  • Unable to create output file.

Data structure(s)

List of objects
  • Object - for each line in the input file.

    • Original Line - Original line of text from the input file, in English, say.
    • Translated Line - The final translated line of text into the requested output language.
    • Line number - Line number of the input file.
    • Sequence Line - After special sequences of interest within the original line have been replaced with a special squences so that we don't want to translate these.
    • Sequences - List of the unique sequences in the current line, we may or may not want to translate individually.
    • Empty Line - Boolean true or false so that we don't ask Google to translate an empty string.
  • Object - for each unique sequence for a given line.

    • sequence - This is a special sequence that looks like 123-456, say.
    • original - This is the original string before any translations.
    • translate - Boolean true or false if we would like to translate the sting or leave it in the original language.

Detailed Design

Control Flow

  1. Start with the parsing of the input arguments to verify them.
  2. Parse over the input file.
  3. Look at one line at a time.
  4. Look for specific patterns of interest in the input file and if they are special then remove them from the line and replace them with a unique tag.
  5. Then send the remaining line to Google Cloud Translate API.
  6. Each of the special unique tags replace them with the original content.
  7. OR some of the special unique tags we need to still translate them but just a bit of their content.
  8. write the line to the output file.

Data Flow

  • Need to add more details here.

Test-cases

Run the local script called runTest.sh.

$ ./runTests.sh

This local script uses the package pytest in order to run a suite of unit tests.

Run a test suite

Or to run the test suite in verbose mode, for a suite of tests, you can say,

$ pytest -v test_wikiparser.py

Run a single test-case

Or to run just a single testcase called test_filepath_directive, from the test suite TestWikiParser, in verbose mode, you can say,

$ pytest -v test_wikiparser.py::TestWikiParser::test_filepath_directive

sumo-mt's People

Contributors

soccerjustinh1 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

soccerjustinh1

sumo-mt's Issues

Command line input parameters incorrect

Problem:

The usage of this Python script should let you say the command line options in any order.

Version:

On the master branch for the hash 0992b5b6a1a8b36b12b4533803dd694136cc89dd

Steps to reproduce:

./mediawiki.py --lang af --input orig_input.txt

OR

./mediawiki.py --input orig_input.txt --lang af 

Impact:

As long as you use the script in the exact order that is specified in the documented usage then the script just works.

Analysis:

Changing the order around of the input options does not work and is caused by the use of the Python library "docopt".

Suggest to change the usage to:

Usage:
  mediawiki.py (--input <input-filename>|--indir <input-directory-name>) [--settings <settings-filename>] [--lang <output-languages> ] [--outdir <output-directory-name> ]
  mediawiki.py -h | --help

Wrong code replacements

''HTTP'' is recognized as a code but never replaced back, so we get numbers in the translated version. ''HTTP'' should not be a code to replace.

Also I've seen that in {key Ctrl} is translated and it shouldn't.

Error if you specify a dir with ending /

Using --dir if your folder has an ending /, the script will fail with a DNS error.

 ./mediawiki.py --dir ../exports/desktop/ --lang 'nl'
Traceback (most recent call last):
  File "./mediawiki.py", line 265, in <module>
    main()
  File "./mediawiki.py", line 251, in main
    outputLanguage = currentLang)
  File "/home/nuke/bin/sumo-mt/script/WikiParser.py", line 78, in __init__
    self.parseMediaWikiFile()
  File "/home/nuke/bin/sumo-mt/script/WikiParser.py", line 218, in parseMediaWikiFile
    target_language_code=self.mOutputLanguage)
  File "/home/nuke/.local/lib/python2.7/site-packages/google/cloud/translate_v3beta1/gapic/translation_service_client.py", line 300, in translate_text
    request, retry=retry, timeout=timeout, metadata=metadata
  File "/home/nuke/.local/lib/python2.7/site-packages/google/api_core/gapic_v1/method.py", line 143, in __call__
    return wrapped_func(*args, **kwargs)
  File "/home/nuke/.local/lib/python2.7/site-packages/google/api_core/retry.py", line 270, in retry_wrapped_func
    on_error=on_error,
  File "/home/nuke/.local/lib/python2.7/site-packages/google/api_core/retry.py", line 179, in retry_target
    return target()
  File "/home/nuke/.local/lib/python2.7/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
    return func(*args, **kwargs)
  File "/home/nuke/.local/lib/python2.7/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "/home/nuke/.local/lib/python2.7/site-packages/six.py", line 737, in raise_from
    raise value
google.api_core.exceptions.ServiceUnavailable: 503 DNS resolution failed

Support for additional providers

We have found that Google Cloud Translate is not providing high quality translations for some locales, like Chinese (Taiwan).

We should probably think about supporting multiple providers and be able to define the defaults per language once we understand which ones are better.

Extract Project ID from Google credentials

As an enhancement we should extract the Google Cloud Project ID from the Google Application Credentials file (credentials.json) rather than specifying the Google Cloud Project ID in the settings file.

The advantage of this is that we have the project_id specified in only place.

This means we can remove the following line from the settings file
project_id = my-project-1234

As we can extract project_id from the credentials.json
{
"type": "service_account",
"project_id": "my-project-1234",
"private_key_id": "abcd",
"private_key": "-----BEGIN PRIVATE KEY-----\n1234\n-----END PRIVATE KEY-----\n",
"client_email": "[email protected]",
"client_id": "5678",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/account%40my-project-1234.iam.gserviceaccount.com"
}

Handle formal vs informal translations

We need to figure out if there is a way to tell the API to use formal or informal translations for languages that have that.

The feedback we got was that in general it mixed both, and for example in Spanish most of the times uses the formal, while we use the informal for web documents, with a big need for manual changes.

Use existing terminology

We need to figure out if there is a way to provide the API with a glossary or translation memory so it uses consistent terminology with the rest of the project we localize.

Translations into Dutch generate random numbers for codes

It seems random numbers are generated when or TOC {/note} {/warning} or wikilinks without description [[Access Mozilla Services with Firefox Accounts]] are in the original one and we do translation in Dutch (nl).

Example output:

=Gebruik de Firefox-startpagina om snel te vinden wat u zoekt=
353,437
Open Firefox en u hebt toegang tot al uw topsites. Vanaf hier kunt u ook naar rechts vegen om iets in uw browsegeschiedenis te openen of naar links vegen om bij uw bladwijzers te komen. Tik op de adresbalk om op internet te zoeken. Zie voor meer informatie [[Use the Awesome screen to search in Firefox for Android]].
244,459
697,403
157,199
Open Firefox en u hebt toegang tot de beste sites op internet, onlangs bezochte of bladwijzeringsites en populaire artikelen op [https://getpocket.com/ Pocket (nu onderdeel van Mozilla)]. Vanaf hier kunt u ook rechtstreeks naar het bovenste paneel vegen om toegang te krijgen tot uw bladwijzers en geschiedenis (inclusief sites die u op andere apparaten hebt bezocht). Tik op de adresbalk om op internet te zoeken. Zie voor meer informatie [[Use the Awesome screen to search in Firefox for Android]].
805,463
289,959

For some reason this is messed up just for Dutch

https://github.com/mozilla/sumo-mt/blob/master/WikiParser.py#L199

Define output folder

Ideally we should be able to define an output folder to avoid messing up with the source folder if we run the translation for other language later.

Incorrect translation for corner-cases

Problem

There are a few corner cases that are unable to be correctly handled with regards to for a given input the output is not what we would expect.

  • Found (UPC)
  • Expected (CPU)
  • Found &#39;&#39; &#39;Desconectar <!- -> .me estricta protección &#39;&#39; &#39;
  • Expected ''' Desconectar <!- -> .me estricta protección '''
  • Found preferencias {for win} {/for} {for mac,linux} preferencias {/for}
  • Expected {for win} preferencias {/for} {for mac,linux} preferencias {/for}

Version

On the master branch for the hash 0992b5b6a1a8b36b12b4533803dd694136cc89dd

Steps

Need to run the wikiparser with the attached input file and it generates the attached output

$ ./mediawiki --input input_broken.txt

This generated the attaced input_broken-es.txt

input_broken-es.txt
input_broken.txt

Impact

The output file is not correct an does not display the media wiki file in the translated language correctly.

Analysis

Need to update the pattern match to deal with these corner cases and update the regression suite to include parsing with these scenarios.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.