GithubHelp home page GithubHelp logo

Strange classifications about langid.py HOT 9 CLOSED

corpulent avatar corpulent commented on August 29, 2024
Strange classifications

from langid.py.

Comments (9)

saffsd avatar saffsd commented on August 29, 2024

@detrop That's hard to say for sure. Short texts are one of the areas where langid.py is weaker, due to the way it works. Internally, langid.py has a model of the relative distribution between 97 languages of a closed set of ~8000 individual 1 to 4 byte sequences, that have been learned from training data drawn from a variety of sources. Short strings such as "feeling good" will only have a very small number of features present (if any) - in fact, I suspect that "feeling" and "good" individually don't have any of the features present and what you are getting is the back-off probability in the language model. Is your aim to perform language identification on individual words?

from langid.py.

corpulent avatar corpulent commented on August 29, 2024

@saffsd Yes, in some cases it would be one or two words. I also tried the stopwords.words('english') technique, but as you can imagine, it was very inaccurate.

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

@detrop langid.py isn't really suited for langid of individual words. There's a tool by Shuyo Nakatani, LDIG, which I think might perform better. That tool only covers 17 languages, but if the languages you want are in that set it may work for you.

from langid.py.

corpulent avatar corpulent commented on August 29, 2024

@saffsd, thanks for the feedback, it seems fine to me that it gives a very low probability on single words, and I guess I can assume that everything is english unless langid tells me with a high enough probability that its not. I tried to play around with it this way, but then I got this.

text input...湁渮
(u'da', 0.9958943887804865)

Why such a high probability in this case for dutch with chinese characters?

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

@corpulent I'm afraid I can't replicate your output:

$ langid -n
>>> text input...湁渮
('zh', 0.9922766868226115)

Such a mistmatch is usually due to some encoding issue. The intention was to make langid.py encoding-agnostic, but the practicality of it is that it is not clear if this is even possible. Most training data that we used is utf8-encoded, so accuracy is highest with utf8-encoded text. Internally, langid.py will encode unicode objects to utf8, and will operate directly on the byte representation of strings. This means that if you ensure that you pass either a unicode object or a utf8-encoded string, you should get the same result that I did.

from langid.py.

corpulent avatar corpulent commented on August 29, 2024

@saffsd you are right, seems like an encoding issue. A quick fix was to unicode the string with utf-8 encoding text = unicode(text, 'utf-8'). I think the problem is how I am grabbing text from the terminal, I am using raw_input like this text = raw_input('text input...').strip().

from langid.py.

utilitarianexe avatar utilitarianexe commented on August 29, 2024

I am also getting some wrong results with high claimed confidence on short text but without special characters. For example ('de', 0.9886530789011537) for '2009 Pet Comedy Challenge: Ben Kronberg'

from langid.py.

utilitarianexe avatar utilitarianexe commented on August 29, 2024

but will give https://github.com/shuyo/ldig a try

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

Closed as it has been inactive for some time.

from langid.py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.