I get these results for english text, <div class="snippet-clipboard-content notran

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Strange classifications about langid.py HOT 9 CLOSED

corpulent commented on August 29, 2024

Strange classifications

from langid.py.

Comments (9)

saffsd commented on August 29, 2024

@detrop That's hard to say for sure. Short texts are one of the areas where langid.py is weaker, due to the way it works. Internally, langid.py has a model of the relative distribution between 97 languages of a closed set of ~8000 individual 1 to 4 byte sequences, that have been learned from training data drawn from a variety of sources. Short strings such as "feeling good" will only have a very small number of features present (if any) - in fact, I suspect that "feeling" and "good" individually don't have any of the features present and what you are getting is the back-off probability in the language model. Is your aim to perform language identification on individual words?

from langid.py.

corpulent commented on August 29, 2024

@saffsd Yes, in some cases it would be one or two words. I also tried the stopwords.words('english') technique, but as you can imagine, it was very inaccurate.

from langid.py.

saffsd commented on August 29, 2024

@detrop langid.py isn't really suited for langid of individual words. There's a tool by Shuyo Nakatani, LDIG, which I think might perform better. That tool only covers 17 languages, but if the languages you want are in that set it may work for you.

from langid.py.

corpulent commented on August 29, 2024

@saffsd, thanks for the feedback, it seems fine to me that it gives a very low probability on single words, and I guess I can assume that everything is english unless langid tells me with a high enough probability that its not. I tried to play around with it this way, but then I got this.

text input...湁渮
(u'da', 0.9958943887804865)

Why such a high probability in this case for dutch with chinese characters?

from langid.py.

saffsd commented on August 29, 2024

@corpulent I'm afraid I can't replicate your output:

$ langid -n
>>> text input...湁渮
('zh', 0.9922766868226115)

Such a mistmatch is usually due to some encoding issue. The intention was to make langid.py encoding-agnostic, but the practicality of it is that it is not clear if this is even possible. Most training data that we used is utf8-encoded, so accuracy is highest with utf8-encoded text. Internally, langid.py will encode unicode objects to utf8, and will operate directly on the byte representation of strings. This means that if you ensure that you pass either a unicode object or a utf8-encoded string, you should get the same result that I did.

from langid.py.

corpulent commented on August 29, 2024

@saffsd you are right, seems like an encoding issue. A quick fix was to unicode the string with utf-8 encoding text = unicode(text, 'utf-8'). I think the problem is how I am grabbing text from the terminal, I am using raw_input like this text = raw_input('text input...').strip().

from langid.py.

utilitarianexe commented on August 29, 2024

I am also getting some wrong results with high claimed confidence on short text but without special characters. For example ('de', 0.9886530789011537) for '2009 Pet Comedy Challenge: Ben Kronberg'

from langid.py.

utilitarianexe commented on August 29, 2024

but will give https://github.com/shuyo/ldig a try

from langid.py.

saffsd commented on August 29, 2024

Closed as it has been inactive for some time.

from langid.py.

Strange classifications about langid.py HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs