Hello! Thanks so much for the great tool and paper, it is really hel

Hindi, Arabic, Korean and Japanese about langid.py HOT 3 CLOSED

sinjax commented on August 29, 2024

Hindi, Arabic, Korean and Japanese

from langid.py.

Comments (3)

saffsd commented on August 29, 2024

Hello Sina!

The model is trained on Hindi and Korean, and my testing shows that
langid.py works correctly for both the strings you provided.

यह एक स्ट्रिंग है कि मैं हिंदी में लिखना है
('hi', 0.0044859052608894794)
이것은 아랍어 문자열입니다
('ko', 0.0047176192477553112)

My testing above was done with langid.py used interactively in a
terminal session. I suspect that you are facing an encoding issue.
langid.py does not perform encoding detection, it assumes that text
provided is utf8 encoded. Could you please provide me a bit more
detail about how you are using langid.py?

Cheers
Marco

from langid.py.

sinjax commented on August 29, 2024

Hey Marco,

Thanks for the quick reply! It would seem I, just like the only other issue on this project, was having encoding problems!

If i did the following:
s = u"\u092F\u0939 \u090F\u0915 \u0938\u094D\u091F\u094D\u0930\u093F\u0902\u0917 \u0939\u0948 \u0915\u093F \u092E\u0948\u0902 \u0939\u093F\u0902\u0926\u0940 \u092E\u0947\u0902 \u0932\u093F\u0916\u0928\u093E \u0939\u0948"
classify(s)

the ordinal encoding of the string s is:
[2351, 2361, 32, 2319, 2325, 32, 2360, 2381, 2335, 2381, 2352, 2367, 2306, 2327, 32, 2361, 2376, 32, 2325, 2367, 32, 2350, 2376, 2306, 32, 2361, 2367, 2306, 2342, 2368, 32, 2350, 2375, 2306, 32, 2354, 2367, 2326, 2344, 2366, 32, 2361, 2376]

As you trained your naive bayes classifier on byte features this resulted in 0 value feature vectors. the correct way to solve this problem is either to set as with the original characters such as:

s = "यह एक स्ट्रिंग है कि मैं हिंदी में लिखना है"
for which the ordinal values are:
[224, 164, 175, 224, 164, 185, 32, 224, 164, 143, 224, 164, 149, 32, 224, 164, 184, 224, 165, 141, 224, 164, 159, 224, 165, 141, 224, 164, 176, 224, 164, 191, 224, 164, 130, 224, 164, 151, 32, 224, 164, 185, 224, 165, 136, 32, 224, 164, 149, 224, 164, 191, 32, 224, 164, 174, 224, 165, 136, 224, 164, 130, 32, 224, 164, 185, 224, 164, 191, 224, 164, 130, 224, 164, 166, 224, 165, 128, 32, 224, 164, 174, 224, 165, 135, 224, 164, 130, 32, 224, 164, 178, 224, 164, 191, 224, 164, 150, 224, 164, 168, 224, 164, 190, 32, 224, 164, 185, 224, 165, 136]

and the features are correct.

I think there is a reasonable fix for this, you can encode an incoming string to the tokenize function as so:
map(ord,s.encode("utf-8"))

if the string is unicode encoded. Knowing whether a string is encoded in advance is difficult but instead you can use the technique outlined here: http://code.activestate.com/recipes/466341-guaranteed-conversion-to-unicode-or-byte-string/ which should let you encode any string as bytes safely

Thanks very much for your help.

On a side note, I am taking your great project and porting it to java for my work. I'd really love to feed back to your project, shall I do this by branching and adding the java classes? I havn't ported the feature selector and trainer yet, so I was converting the language model to a format readable in java. Some kinks to iron out, but I will let you know what i come up with if you are interested :-). Of course my java classes attribute you as the original author etc.

Thanks again!

Sina

from langid.py.

saffsd commented on August 29, 2024

Hello Sina

Thank you for your input. It does seem that encoding has caused problems for some users, but I think that encoding detection is beyond the scope of langid.py. There are other modules out there to perform this function, such as chardet . If this continues to give users difficulty, I may consider attempting to detect the situation and warning the user.

I wish you all the best in your java implementation, but I think that it is best that you maintain your own repository. I would be happy to provide a link to your project when it is complete. The compiled model should not be difficult to convert, it is essentially just a large heap of numbers. I am planning to write a paper soon to describe the implementation of langid.py from a more technical perspective.

All the best!

Cheers
Marco

from langid.py.

Hindi, Arabic, Korean and Japanese about langid.py HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs