GithubHelp home page GithubHelp logo

Comments (6)

saffsd avatar saffsd commented on August 29, 2024

Hi Stophface,

Thanks for reporting the issue. The algorithm used by langid.py is
entirely deterministic, so the only way to get two different outputs is to
provide it with two different inputs. An encoding issue would be my first
thought, it is possible that your database is returning text that is not
UTF8-encoded? Another possibility is perhaps some weirdness in the space
characters. In any case, what I think is happening is that the string
returned by your database looks the same as the one manually entered, but
is actually different when you compare them at the byte level.

Aside, what do you expect to be the correct output for that input?

Cheers,
Marco

On Wed, Mar 25, 2015 at 11:04 PM, Stophface [email protected]
wrote:

I have a database from which I read. I want to identify the language in a
specific cell, defined by column.

I read from my database like this:

connector = sqlite3.connect("somedb.db")
selecter = connector.cursor()
selecter.execute(''' SELECT tags FROM sometable''')
for row in selecter: #iterate through all the rows in db
#print (type(row)) #tuple
rf = str(row)
#print (type(rf)) #string
lan = langid.classify("{}".format(rf))

Technically, it works. It identifies the languages used and later on (not
displayed here) writes the identified language back into the database.

So, now comes the weird part.
I wanted to double check some results manually. So I have these words:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"

When I perform the language identification on the database it plots me
Portuguese into the database.
But, performing it like this:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
lan = langid.classify(a)

Well, that returns me French. Apart from that it is neither French nor
Portuguese, why is it returned different results?!

β€”
Reply to this email directly or view it on GitHub
#31.

from langid.py.

Stophface avatar Stophface commented on August 29, 2024

Hey,
thanks for your fast reply.
When I create the database the field which is populated with the words I showed here is set as TEXT.

TEXT. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).
https://www.sqlite.org/datatype3.html

I insert the data into the database with the
"prepared statement"
https://en.wikipedia.org/wiki/Prepared_statement
http://stackoverflow.com/questions/3727688/what-does-a-question-mark-represent-in-sql-queries

The returned values when I read them out from the database are a Python tuple. I convert them then to a string as you can see.
According to this SO post http://stackoverflow.com/questions/4182603/python-how-to-convert-a-string-to-utf-8 python is ASCII.
You recommend converting it to UTF 8?

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

Given the text you showed there should be no difference between ASCII and
UTF8 (all ASCII is valid UTF8 by design of UTF8). Are you on Python2 or
Python3? The best thing to do would be to look at the output of your
database as a sequence of bytes. In Python2 this would be something like
print(map(ord(rf))).

However, looking at your code more closely, I notice that you do
rf=str(row) - this will show the representation of the tuple row as a
string, which will including quoting and parenthesis. Is this your
intention? Or did you intend for this to be rf = ' '.join(row) ?

On Thu, Mar 26, 2015 at 8:12 AM, Stophface [email protected] wrote:

Hey,
thanks for your fast reply.
When I create the database the field which is populated with the words I
showed here is set as TEXT.

TEXT. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).https://www.sqlite.org/datatype3.html

I insert the data into the database with the
"prepared statement"
https://en.wikipedia.org/wiki/Prepared_statement

http://stackoverflow.com/questions/3727688/what-does-a-question-mark-represent-in-sql-queries

The returned values when I read them out from the database are a Python
tuple. I convert them then to a string as you can see.
According to this SO post
http://stackoverflow.com/questions/4182603/python-how-to-convert-a-string-to-utf-8
python is ASCII.
You recommend converting it to UTF 8?

β€”
Reply to this email directly or view it on GitHub
#31 (comment).

from langid.py.

Stophface avatar Stophface commented on August 29, 2024

There will be different text. Farsi, Pashto, Arabic. Basically all the languages spokeny might be in the variable I pass to langid.
I am not a programmer as you might recognized already. Is there a difference between ASCII or UTF-8 when passing it to langid in a variable?

My intention is to pass to langid text, as clean as possible.
So rf = ''.join(row) is the better thing to do. I had that before, but I started editing my code and it got lost. Thanks for mentioning it.
However, passing it to langid with join(row) does not do the trick.
I am working in python 3.3. Could you specify what you mean by "looking at it byte by byte"?
print(map(ord(rf))) thats for python 2. I cannot find the syntax for python 3 since I do not know exactly what to look for.

Thats my output when looking at it byte by byte:

b = rf.encode('utf-8')
print (b)

b'shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote'

Thanks for providing such a creat tool!

from langid.py.

Stophface avatar Stophface commented on August 29, 2024

Solved.
There was a problem further down with my script....

Ah, and I expected the language to be identified as french :) All good now!

If your interested: I am using your library on the flickr API :)

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

Closing as @Stophface indicated issue is solved.

from langid.py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.