GithubHelp home page GithubHelp logo

guesslanguage's Introduction

LibIndic GuessLanguage

Build Status Coverage Status

LibIndic's guesslanguage module may be used to detect the primary language of a text. It works even with text containing mixed languages.

Installation

  1. Clone the repository git clone https://github.com/libindic/guesslanguage.git
  2. Change to the cloned directory cd guesslanguage
  3. Run setup.py to create installable source python setup.py sdist
  4. Install using pip pip install dist/libindic-guesslanguage*.tar.gz

Usage

>>> from libindic.guesslanguage import LangGuess
>>> instance = LangGuess()
>>> text = u"കേരളത്തിലെ എറണാകുളം ജില്ലയിൽ പെരിയാറിന്റെ തീരത്തുള്ള ഒരു ഗ്രാമമാണ് കാലടി காலடி"
>>> text = u"കേരളത്തിലെ എറണാകുളം ജില്ലയിൽ പെരിയാറിന്റെ തീരത്തുള്ള ഒരു ഗ്രാമമാണ് കാലടി காலடி"
>>> instance.guessLanguage(text)
'Malayalam'
>>> for key, value in instance.getScriptName(text).items():
...     print("%s : %s" % (key, value))
... 
ജില്ലയിൽ : ml_IN
എറണാകുളം : ml_IN
ഒരു : ml_IN
പെരിയാറിന്റെ : ml_IN
காலடி : ta_IN
കേരളത്തിലെ : ml_IN
ഗ്രാമമാണ് : ml_IN
കാലടി : ml_IN
തീരത്തുള്ള : ml_IN

For more details read the docs

guesslanguage's People

Contributors

balasankarc avatar copyninja avatar diadara avatar jishnu7 avatar santhoshtr avatar

Stargazers

 avatar

Watchers

 avatar  avatar Ashik Salahudeen avatar  avatar Hrishi avatar Rajeesh K Nambiar avatar James Cloos avatar  avatar

Forkers

sknadig

guesslanguage's Issues

Question: About romanized languages detection

First thanks a lot for this amazing work! I'm trying to address to task of the language detection for indian languages. Currently my neural network can detected with a good accuracy most of the languages I wanted to have. The problem is that I need to detect the romanized version of these language, but I'm not sure that training that network (fasttext) on a romanized languages corpus could bring to something meaningful. I can see that here this tool is using ngram - trigram and a frequency based model.
My question is if this approach could be used for romanized version of a language like "hin", or "urd", etc. or this cannot be done in any case.
Thanks a lot.

guesslanguage fails on european languages.

I haven't tested with Many, but find my observations below

  • "Аҧсны Жәлар Реизара – Апарламент" - This is Abkhaz language but as per guessLanguage(text) it is Kazakh
  • "Народное Собрание — Парламент Республики Абхазия" - This is Russian, but as per guessLanguage(text) it is Bulgarian
  • "Banat, Buchenland, Siebenbürgen" - This is German, but as per guessLanguage(text) it is Turkish

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.