GithubHelp home page GithubHelp logo

msquarme / masakhane Goto Github PK

View Code? Open in Web Editor NEW

This project forked from masakhane-io/masakhane-mt

0.0 0.0 0.0 16.23 MB

Let's put Africa on the Machine Translation Map!

License: MIT License

Jupyter Notebook 16.56% Python 0.26% Ada 9.74% Visual Basic 8.13% Modula-3 10.42% Lua 9.47% Roff 8.59% Common Lisp 9.00% Scheme 8.88% Smalltalk 8.96% TypeScript 9.98%

masakhane's Introduction

Masakhane - A living machine translation project for Africans, by Africans

We need African researchers from ACROSS the continent to join our effort in building translation models for African languages. Masakhane means "We Build Together" in isiZulu. Phase 1 consists of getting baseline translation results for as many languages as possible. Phase 2 consists of combining all our powers and doing some transfer learning to get significant improvement on the models.

First, we encourage you read up on our website: masakhane.io

Outcomes

  1. Put African NLP on the map by publishing a paper at a top-tier NLP conference featuring as many languages and countries and researchers as possible.

  2. Faciliate the development of an NLP community in Africa

  3. Spur research and focus on African languages, by providing a starting point for other researchers to begin

To officially join the project

This is so we can feature you on our webpage masakhane.io, google group and invite you to our weekly meetings & slack. This aims to faciliate your participation in the movement. Please email the following to [email protected]:

  • Your Full Name
  • A preferred social media link
  • The language(s) you'll be working on (or your general relevant specialty - if you're an expert at machine translation and - would like to boost the community through that)
  • A picture
  • Your affiliation and role.

What can I contribute?

There are many ways to contribute to Masakhane

  1. Contribute a trained model and related code for your language
  2. Contribute analysis of data/models for any African languages. You do not need any technical experience for this! If you're a linguist, we can pair you up with a machine translation practitioner and you can help contribute analysis
  3. Contribute to documentation or the base "notebook" that will improve the experience of others
  4. Provide advice or help tune models for their languages and datasets
  5. Help administrate! Working with so many researchers can be quite a challenge!
  6. Help with infrastructure and compute! Do you have spare compute to donate? Let us know! We're always looking for more!
  7. Help document our discussions, progress. This is VERY much needed

Getting Started

1. Have a look at the example code

We have an example colab notebook which trains a model for English-to-Zulu translation. Open it in [Google Colab](google colab) - you can select it by going to the GitHub section when opening a new project.

2. Finding data for my language?!

This is a huge challenge, but luckily we have a place to start! At ACL 2019, this paper was published. The short story? Turns out the Jehovah's Witness community has been translating many many documents and not all of them are religious. And their language representation is DIVERSE.

Check out this spreadsheet HERE to see if your language is featured, then go to Opus to find the links to the data: http://opus.nlpl.eu/JW300.php

We also provide a script for easy downloading and BPE-preprocessing of JW300 data from OPUS: jw300_utils/get_jw300.py. It requires the installation of the opustools-pkg Python package. Example: For dowloading and pre-processing the Acholi (ach) and the Nyaneka (nyk) portions of JW300, call the script like this: python get_jw300.py ach nyk --output_dir jw300

Can't find your language in the JW300 dataset?

Then we still have some options! Our community has been searching wide and far! Drop us a mail and we'll get you onboarded on our slack: [email protected]. We will be updating our wiki with other resources as we find them!

If nothing comes from that, then we're working with CocoHub.cc

3. Run the notebook!

Your next step is to use the JW300 dataset in the colab notebook and run it. Most pieces of advice are within the notebook itself. We are constantly improving that notebook and are open to any recommendations. Struggled to get going? Then let's work together to build a notebook that's easier to use! Create a github issue or email us!

4. It's done! I have results! Now what?

Amazing! You're created your first baseline. Now we need to get the code and data and results into this github repository

In order for us to consider your result submission official, we need a couple of things:

  1. The notebook that will run the code. The notebook MUST run on on someone else account and the data that it uses should be publically accessible (i.e. if I download the notebook and run it, it must work - so shouldn't be using any private files). If you're wondering how to do this, don't fear! Drop us a line and we will work together to make sure the submission is all good! :)

  2. The test sets - in order to replicate this and test against your results, we need saved test sets uploaded separately.

  3. A README.md that describes the (a) the data used - esp important if it's a combination of sources (b) any interesting changes to the model (c) maybe some analysis of some sentences of the final model

  4. The model itself. This can be in the form of a google drive or dropbox link. We will be finding a home for our trained models soon

  5. The results - the train, dev, and test set BLEU score

We will be further expanding our analysis techniques so it's super important we have a copy of the model and test sets now so we don't need to rerun the training just to do the analysis

Once you have all of the above, please create a pull request into the repository. See guidelines here.

Structure of my PR:

Also see this as an example for the structure of your contribution

Structure:

/<src-lang>-<tgt-lang>
   /<technique> -- this could be "jw300-baseline" or "fine-tuned-baseline" or "nig-newspaper-dataset"
     - notebook.ipynb
     - README.md
     - test.src
     - test.tgt
     - results.txt
     - any other files, if you have any

Example:

/en-xh
  /xhnavy-data-baseline 
    - notebook.ipynb
    - README.md
    - test.xh
    - test.en
    - results.txt
    - preprocessing.py

Here is a link to a pull request that has the relevant things.

Feeling nervous about contributing your first pull request or unsure how to proceed? Please don't feel discouraged! Drop us an email or a slack message and we will work together to get your contribution in ship shape!

5. I've got a baseline. What do I do to improve it?

Cool! So there are many ways to improve results. We're busy working on a wiki to aid that! Got ideas for this? Drop us a line!

Code of Conduct

See Code of Conduct

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.