GithubHelp home page GithubHelp logo

wrong detection about langid.py HOT 7 CLOSED

saffsd avatar saffsd commented on August 29, 2024
wrong detection

from langid.py.

Comments (7)

tripleee avatar tripleee commented on August 29, 2024

This is going to be a problem with any short input. If you need high precision for short inputs, maybe you could preprocess it so that proper names are normalized or neutralized one way or another? Then your sample becomes e.g. "A: B dynamic was why I left C". This is a hard topic unto itself, though. (Named Entity Recognition could help but state of the art is nothing like 100% accuracy.)

from langid.py.

canhduong28 avatar canhduong28 commented on August 29, 2024

@tripleee thanks for your comment.

from langid.py.

canhduong28 avatar canhduong28 commented on August 29, 2024

In [3]: text = """Kojima truly has a variety of friends. The game designer took to Twitter today to show off images of him hanging out with Star Wars: The Force Awakens director J.J. Abrams and the astromech droid BB-8. "JJ, who has been supporting the project for a long time, has also been told that MGS V is complete," he said, pictured in a photo with J.J. and a copy of Metal Gear Solid V: The Phantom Pain. ずっと応援してくれていたJJにも「MGSV TPP」完成を報告。 pic.twitter.com/qajeAizxYx — 小島秀夫 (@Kojima_Hideo) August 28, 2015. Known best for his work on the Metal Gear series, Kojima is a veteran game designer who has spent much of his career working on games for Konami. Recent evidence suggests, however, that he may no longer be working for Konami. Read IGN's Metal Gear Solid V review to learn why it earned a score of 10. Cassidee is a freelance writer and the co-host of a podcast about freelancing. You can chat with her about that and all other things geeky on Twitter."""

In [4]: langid.classify(text)
Out[4]: ('la', 0.9999999125266582)

Is it true that unicode characters have a greater weight than other ascii characters? The above text should be detected as English, is there any solution for this?

Thanks in advance,
Canh

from langid.py.

bittlingmayer avatar bittlingmayer commented on August 29, 2024

I've done a quick evaluation and would suggest a few basic improvements:

  1. addition of 'und' (undetermined) when really the word is not in any language
  2. give weight to known stop words
  3. give weight to punctuation
  4. consider that certain characters only occur in certain languages
  5. train on translit too to identify languages like Arabic, Russian, Hindi and Chinese when they are written in the Latin script
langid.classify("haha")
('en', 0.16946150595865334)
langid.classify("!!!")
('en', 0.16946150595865334)
langid.classify("no")
('en', 0.16946150595865334)
langid.classify("no!")
('en', 0.16946150595865334)
langid.classify("¡No!")
('zh', 0.2249412262395412)

The last should of course be Spanish, not Chinese.

langid.classify("yeah haha")
('id', 0.4730470342933074)
langid.classify("ты я меня так что да нет не же")
('bg', 0.6202337036529055)

Note, that is unambiguously Russian, not Bulgarian. Most of the words and even one of the characters, 'ы', is unknown in Bulgarian.

langid.classify("jajaj")
('en', 0.16946150595865334)
langid.classify("jaja")
('en', 0.16946150595865334)

I think Spanish or German are more reasonable guesses here, using words or character-n-grams-based approaches.

langid.classify("ty kuda edish' seychas?")
('es', 0.48318636730355763)

This is actually Russian translit.

langid.classify("asdfk94jlskdle")
('en', 0.16946150595865334)

In my opinion, this should return ('und', 0.99).

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

@bittlingmayer thanks for the suggestions and the excellent examples. The suggestions are all very sensible, but not very easy to implement in practice. In designing and training langid.py we tried to avoid introducing any hand-crafted features. We did this by using collections of documents in known languages and detecting essentially byte patterns that were characteristic of specific languages. This naturally selects some patterns and stopwords (see list of features). Your suggestions identify several weaknesses in the approach, but unfortunately I don't know any way to integrate them into the existing method in a way that I am confident would improve performance across the board:

  1. there is no easy way to represent or train for the "und" class. It may be possible to learn a per-language threshold but our early experiments in this led nowhere.
  2. it's not easy to determine stopwords across 97 languages
  3. punctuation is not always easy to detect, and many languages share it. any punctuation that is particularly characteristic of a language should have been detected in the generation of our feature set
  4. again, if there is evidence in our training data for certain characters being language specific, they should have been detected. It may well be the case that our training data contains mislabeled documents. However, in the Russian/Bulgarian example you give I suspect the character you mention isn't a feature at all, so no information from it is being used in the classification
  5. that's an idea we discussed but never found the time to implement. I also think it is possible but don't have a good source of training and test data for it.

Please don't take this as a rejection of your suggestions, I think they are all sensible and I hope to have provided some context as to why we didn't do some of them. Unfortunately I'm not able to dedicate the amount of time and effort required to developing any of them with the thoroughness that would be required to make them work. There is clearly some demand for better performance on short input, and that has probably been the biggest shortcoming of langid.py, so perhaps someone else in the research community will take note and further develop tools in this area.

from langid.py.

bittlingmayer avatar bittlingmayer commented on August 29, 2024

I understand the healthy impulse to avoid too much hand-crafting, but given that langid is a relatively static problem I see value in some sort of ultimate dataset against which a system can do a lookup for the top 10000 (ie top 90%) of queries.

  1. I would perhaps just make some rules. There are good libs and gists out there to match URLs, email addresses, numbers, emoji, mixed alphanumeric and other types of non-words. Even if it doesn't catch all non-language, it will be better than now. Another approach is to just return 'und' if no language is sufficiently probable nor more probable than others. (I mean, what does it mean if a decently long sentence has no probability higher than 10%, split roughly evenly between 5 languages that are not related to each other?)

  2. It can be done in a way where we get a performance boost even with just a list of stopwords for the top few languages. For such lists, many have used NLTK (eg https://gist.github.com/bittlingmayer/ba17969070c2749b478f). It has it's strengths but it's far from perfect, I would avoid relying on this approach as more than a confirmation, or otherwise penalising languages for which there are not stopwords. (eg, we can use it as a tiebreaker between two languages for which we have a list - say en/de/es.)

  3. I think punct is one of those things worth handcoding, although I agree the model should handle punct and special chars like any alpha char. Regarding "¡No!" returning 'zh', I think we need to dig deeper. It's likely a fundamental bug.

  4. The issue (not unique to this lib of course) with summative models is that there are points awarded to a lang for having a char, but no concept of taking points away. So let's say we have some long string "Xxxx Xxxxxxx xxxxx xx xxxxxx xx Xxxxx." (where X/x are real chars). And we must decide between Swedish and Danish, for some set of words and characters that are very similar in both. For the presence of 'ä', we could know it's not Danish, but under the current regime, in a long string, there is hardly any boost from that - it can easily get lost in the noise.

  5. This I understand is a bit more of an undertaking. The safest and easiest way I have seen it done is to just programmatically produce translit from proper text. (Same as for producing Latin-alphabet languages like Spanish or German without accent marks.)

Re features, I guess the implicit goal is to compress it a bit to keep the library size small? Because I would expect a string like "Jedenfalls empfehlenswert!" to be classified correctly every time. (2 long words both unique to one big language.) So we only back off to char n-grams if there is no match.

I have certainly found that training data are frequently contaminated. This often occurs with false equations of country with language (eg .ba -> bs, although I must say langid.py performs well on that normally problematic set of languages).

Overall I hope you see that I am not suggesting to solve any of these issues 100%, but to incorporate a good-enough solution in such a way as to get some upside with no downside, cheaply.

from langid.py.

bittlingmayer avatar bittlingmayer commented on August 29, 2024

As a note, we essentially already have 'und', as far as I can tell.

langid.classify("haha")
('en', 0.16946150595865334)
langid.classify("!!!")
('en', 0.16946150595865334)
langid.classify("no")
('en', 0.16946150595865334)
langid.classify("no!")
('en', 0.16946150595865334)
langid.classify("asdfk94jlskdle")
('en', 0.16946150595865334)

So

if classify(x) == ('en', 0.16946150595865334):
  return ('und', 0.5)

:-)

from langid.py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.