Different result when giving the same text about langid.py HOT 6 CLOSED

Stophface commented on August 29, 2024

Different result when giving the same text

from langid.py.

Comments (6)

saffsd commented on August 29, 2024

Hi Stophface,

Thanks for reporting the issue. The algorithm used by langid.py is
entirely deterministic, so the only way to get two different outputs is to
provide it with two different inputs. An encoding issue would be my first
thought, it is possible that your database is returning text that is not
UTF8-encoded? Another possibility is perhaps some weirdness in the space
characters. In any case, what I think is happening is that the string
returned by your database looks the same as the one manually entered, but
is actually different when you compare them at the byte level.

Aside, what do you expect to be the correct output for that input?

Cheers,
Marco

On Wed, Mar 25, 2015 at 11:04 PM, Stophface [email protected]
wrote:

I have a database from which I read. I want to identify the language in a
specific cell, defined by column.

I read from my database like this:

connector = sqlite3.connect("somedb.db")
selecter = connector.cursor()
selecter.execute(''' SELECT tags FROM sometable''')
for row in selecter: #iterate through all the rows in db
#print (type(row)) #tuple
rf = str(row)
#print (type(rf)) #string
lan = langid.classify("{}".format(rf))

Technically, it works. It identifies the languages used and later on (not
displayed here) writes the identified language back into the database.

So, now comes the weird part.
I wanted to double check some results manually. So I have these words:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"

When I perform the language identification on the database it plots me
Portuguese into the database.
But, performing it like this:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
lan = langid.classify(a)

Well, that returns me French. Apart from that it is neither French nor
Portuguese, why is it returned different results?!

—
Reply to this email directly or view it on GitHub
#31.

from langid.py.

Stophface commented on August 29, 2024

Hey,
thanks for your fast reply.
When I create the database the field which is populated with the words I showed here is set as TEXT.

TEXT. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).
https://www.sqlite.org/datatype3.html

I insert the data into the database with the
"prepared statement"
https://en.wikipedia.org/wiki/Prepared_statement
http://stackoverflow.com/questions/3727688/what-does-a-question-mark-represent-in-sql-queries

The returned values when I read them out from the database are a Python tuple. I convert them then to a string as you can see.
According to this SO post http://stackoverflow.com/questions/4182603/python-how-to-convert-a-string-to-utf-8 python is ASCII.
You recommend converting it to UTF 8?

from langid.py.

saffsd commented on August 29, 2024

Given the text you showed there should be no difference between ASCII and
UTF8 (all ASCII is valid UTF8 by design of UTF8). Are you on Python2 or
Python3? The best thing to do would be to look at the output of your
database as a sequence of bytes. In Python2 this would be something like
print(map(ord(rf))).

However, looking at your code more closely, I notice that you do
rf=str(row) - this will show the representation of the tuple row as a
string, which will including quoting and parenthesis. Is this your
intention? Or did you intend for this to be rf = ' '.join(row) ?

On Thu, Mar 26, 2015 at 8:12 AM, Stophface [email protected] wrote:

Hey,
thanks for your fast reply.
When I create the database the field which is populated with the words I
showed here is set as TEXT.

TEXT. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).https://www.sqlite.org/datatype3.html

I insert the data into the database with the
"prepared statement"
https://en.wikipedia.org/wiki/Prepared_statement

http://stackoverflow.com/questions/3727688/what-does-a-question-mark-represent-in-sql-queries

The returned values when I read them out from the database are a Python
tuple. I convert them then to a string as you can see.
According to this SO post
http://stackoverflow.com/questions/4182603/python-how-to-convert-a-string-to-utf-8
python is ASCII.
You recommend converting it to UTF 8?

—
Reply to this email directly or view it on GitHub
#31 (comment).

from langid.py.

Stophface commented on August 29, 2024

There will be different text. Farsi, Pashto, Arabic. Basically all the languages spokeny might be in the variable I pass to langid.
I am not a programmer as you might recognized already. Is there a difference between ASCII or UTF-8 when passing it to langid in a variable?

My intention is to pass to langid text, as clean as possible.
So rf = ''.join(row) is the better thing to do. I had that before, but I started editing my code and it got lost. Thanks for mentioning it.
However, passing it to langid with join(row) does not do the trick.
I am working in python 3.3. Could you specify what you mean by "looking at it byte by byte"?
print(map(ord(rf))) thats for python 2. I cannot find the syntax for python 3 since I do not know exactly what to look for.

Thats my output when looking at it byte by byte:

b = rf.encode('utf-8')
print (b)

b'shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote'

Thanks for providing such a creat tool!

from langid.py.

Stophface commented on August 29, 2024

Solved.
There was a problem further down with my script....

Ah, and I expected the language to be identified as french :) All good now!

If your interested: I am using your library on the flickr API :)

from langid.py.

saffsd commented on August 29, 2024

Closing as @Stophface indicated issue is solved.

from langid.py.

Different result when giving the same text about langid.py HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs