GithubHelp home page GithubHelp logo

Comments (3)

saffsd avatar saffsd commented on August 29, 2024

Hello Sina!

The model is trained on Hindi and Korean, and my testing shows that
langid.py works correctly for both the strings you provided.

यह एक स्ट्रिंग है कि मैं हिंदी में लिखना है
('hi', 0.0044859052608894794)
이것은 아랍어 문자열입니다
('ko', 0.0047176192477553112)

My testing above was done with langid.py used interactively in a
terminal session. I suspect that you are facing an encoding issue.
langid.py does not perform encoding detection, it assumes that text
provided is utf8 encoded. Could you please provide me a bit more
detail about how you are using langid.py?

Cheers
Marco

from langid.py.

sinjax avatar sinjax commented on August 29, 2024

Hey Marco,

Thanks for the quick reply! It would seem I, just like the only other issue on this project, was having encoding problems!

If i did the following:
s = u"\u092F\u0939 \u090F\u0915 \u0938\u094D\u091F\u094D\u0930\u093F\u0902\u0917 \u0939\u0948 \u0915\u093F \u092E\u0948\u0902 \u0939\u093F\u0902\u0926\u0940 \u092E\u0947\u0902 \u0932\u093F\u0916\u0928\u093E \u0939\u0948"
classify(s)

the ordinal encoding of the string s is:
[2351, 2361, 32, 2319, 2325, 32, 2360, 2381, 2335, 2381, 2352, 2367, 2306, 2327, 32, 2361, 2376, 32, 2325, 2367, 32, 2350, 2376, 2306, 32, 2361, 2367, 2306, 2342, 2368, 32, 2350, 2375, 2306, 32, 2354, 2367, 2326, 2344, 2366, 32, 2361, 2376]

As you trained your naive bayes classifier on byte features this resulted in 0 value feature vectors. the correct way to solve this problem is either to set as with the original characters such as:

s = "यह एक स्ट्रिंग है कि मैं हिंदी में लिखना है"
for which the ordinal values are:
[224, 164, 175, 224, 164, 185, 32, 224, 164, 143, 224, 164, 149, 32, 224, 164, 184, 224, 165, 141, 224, 164, 159, 224, 165, 141, 224, 164, 176, 224, 164, 191, 224, 164, 130, 224, 164, 151, 32, 224, 164, 185, 224, 165, 136, 32, 224, 164, 149, 224, 164, 191, 32, 224, 164, 174, 224, 165, 136, 224, 164, 130, 32, 224, 164, 185, 224, 164, 191, 224, 164, 130, 224, 164, 166, 224, 165, 128, 32, 224, 164, 174, 224, 165, 135, 224, 164, 130, 32, 224, 164, 178, 224, 164, 191, 224, 164, 150, 224, 164, 168, 224, 164, 190, 32, 224, 164, 185, 224, 165, 136]

and the features are correct.

I think there is a reasonable fix for this, you can encode an incoming string to the tokenize function as so:
map(ord,s.encode("utf-8"))

if the string is unicode encoded. Knowing whether a string is encoded in advance is difficult but instead you can use the technique outlined here: http://code.activestate.com/recipes/466341-guaranteed-conversion-to-unicode-or-byte-string/ which should let you encode any string as bytes safely

Thanks very much for your help.

On a side note, I am taking your great project and porting it to java for my work. I'd really love to feed back to your project, shall I do this by branching and adding the java classes? I havn't ported the feature selector and trainer yet, so I was converting the language model to a format readable in java. Some kinks to iron out, but I will let you know what i come up with if you are interested :-). Of course my java classes attribute you as the original author etc.

Thanks again!

  • Sina

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

Hello Sina

Thank you for your input. It does seem that encoding has caused problems for some users, but I think that encoding detection is beyond the scope of langid.py. There are other modules out there to perform this function, such as chardet . If this continues to give users difficulty, I may consider attempting to detect the situation and warning the user.

I wish you all the best in your java implementation, but I think that it is best that you maintain your own repository. I would be happy to provide a link to your project when it is complete. The compiled model should not be difficult to convert, it is essentially just a large heap of numbers. I am planning to write a paper soon to describe the implementation of langid.py from a more technical perspective.

All the best!

Cheers
Marco

from langid.py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.