GithubHelp home page GithubHelp logo

hlaingtinhtun / myanmar-tools Goto Github PK

View Code? Open in Web Editor NEW

This project forked from google/myanmar-tools

0.0 2.0 0.0 413 KB

Detect the Zawgyi-One font encoding in C++, Java, JavaScript, PHP, and Ruby

License: Other

Makefile 1.87% CMake 0.88% C++ 5.80% Java 54.26% JavaScript 19.80% TypeScript 4.84% PHP 6.31% Ruby 6.20% Shell 0.05%

myanmar-tools's Introduction

Build Status

Myanmar Tools (Zawgyi detection)

This project includes tools for processing font encodings used in Myanmar, currently with support for the widespread Zawgyi-One font encoding. For more information on font encodings in Myanmar, read the Unicode Myanmar FAQ.

Features:

  • Detect whether a string is Zawgyi or Unicode.
  • Support for C++, Java, JavaScript (both Node.js and browser), PHP, and Ruby.

Conversion from Zawgyi to Unicode is provided with the help of the CLDR Zawgyi conversion rules (see "Zawgyi-to-Unicode Conversion" below).

This is not an official Google product, but we hope that you’ll find Myanmar Tools useful to better support the languages of Myanmar.

Why Myanmar Tools?

Myanmar Tools uses a machine learning model to give very accurate results when detecting Zawgyi versus Unicode. Detectors that use hand-coded rules for detection are susceptible to flagging content in other languages like Shan and Mon as Zawgyi when it is actually Unicode.

Myanmar Tools and the CLDR Zawgyi conversion rules are used by Google, Facebook, and others to provide great experiences to users in Myanmar.

Using the Zawgyi Detector

See language-specific documentation:

Depending on your programming language, a typical use case should look something like this:

if (zawgyiDetector.getZawgyiProbability(input) > THRESHOLD) {
    // Convert to Unicode, etc.
}

The method getZawgyiProbability returns a number between 0 and 1 to reflect the probability that a string is Zawgyi, given that it is either Zawgyi or Unicode. For strings that are sufficiently long, the detector should return a number very close to 0 or 1, but for strings with only a few characters, the number may be closer to the middle. With this in mind, use the following heuristics to set THRESHOLD:

  • If under-predicting Zawgyi is bad (e.g., when a human gets to evaluate the result), set a low threshold like 0.05. This threshold guarantees that fewer than 1% of Zawgyi strings will go undetected.
  • If over-predicting Zawgyi is bad (e.g., when conversion will take place automatically), set a high threshold like 0.95. This threshold guarantees that fewer than 1% of Unicode strings will be wrongly flagged.

Additionally, keep in mind that you may want to tune your thresholds to the distribution of strings in your input data. For example, if your input data is biased toward Unicode, in order to reduce false positives, you may want to set a higher Zawgyi threshold than if your input data is biased toward Zawgyi. Ultimately, the best way to pick thresholds is to obtain a set of labeled strings representative of the data the detector will be processing, compute their scores, and tune the thresholds to your desired ratio of precision and recall.

If a string contains a non-Burmese affix, it will get the same Zawgyi probability as if the affix were removed. That is, getZawgyiProbability("hello <burmese> world") == getZawgyiProbability("<burmese>").

Some strings are identical in both U and Z; this can happen if the string consists of mostly consonants with few diacritic vowels. The detector may return any value for such strings. If the user is concerned with this case, they can simply run the string through a converter and check whether or not the converter's output is equal to the converter's input.

Training the Model

The model used by the Zawgyi detector has been trained on several megabytes of data from web sites across the internet. The data was obtained using the Corpus Crawler tool.

To re-train the model, first run Corpus Crawler locally. For example:

$ mkdir ~/corpuscrawler_output && cd ~/corpuscrawler_output
$ corpuscrawler --language my --output . &
$ corpuscrawler --language my-t-d0-zawgyi --output . &
$ corpuscrawler --language shn --output . &

This will take a long time, as in several days. The longer you let it run, the better your model will be. Note that at a minimum, you must ensure that you have obtained data for both Unicode and Zawgyi; the directory should contain my.txt, my-t-d0-zawgyi.txt, and shn.txt.

Once you have data available, train the model by running the following command in this directory:

$ make train CORPUS=$HOME/corpuscrawler_output

Zawgyi-to-Unicode Conversion

Once determining that a piece of text is Unicode or Zawgyi, it's often useful to convert from one encoding to the other.

Although this package does not support conversion on its own, the Common Locale Data Repository (CLDR) publishes a Zawgyi-to-Unicode converter in the form of transliteration rules. This functionality is available in ICU 58+ with the transform ID "Zawgyi-my":

Many other languages, including Python, Ruby, and PHP, have wrapper libraries over ICU4C, which means you can use the Zawgyi converter in those languages, too. See the samples directory for examples on using the ICU Transliterator.

In languages without full ICU support, like JavaScript in the browser, there are other open-source Zawgyi converters available. The Rabbit converter is available in several different languages, including JavaScript. See the JavaScript README for more details.

Contributing New Programming Language Support

We will hapilly consider pull requests that add clients in other programming languages. To add support for a new programming language, here are some tips:

  • Add a new directory underneath clients. This will be the root of your new package.
  • Use a build system customary to your language.
  • Add your language to the copy-resources and test rules in the top-level Makefile.
  • At a minimum, your package should automatically consume zawgyiUnicodeModel.dat and test against compatibility.tsv.
  • See the other clients for examples. Most clients are only a couple hundred lines of code.
  • When finished, add your client to the .travis.yml file.

myanmar-tools's People

Contributors

sffc avatar stevenay avatar sven-oly avatar kirankarki avatar brawer avatar

Watchers

James Cloos avatar Hlaing Tin Htun avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.