saffsd / langid.py Goto Github PK

View Code? Open in Web Editor NEW

2.3K 2.3K 315.0 13.58 MB

Stand-alone language identification system

License: Other

Python 100.00%

langid.py's People

Contributors

Stargazers

Watchers

Forkers

dreamind vchahun gsnedders aitzol cdegroc ekmutai brendam dragancc cgl isaachaze neufang db2882 panyang ethanhart ylpei saloua-cliqz gr33ndata marjancek kevinduh klbostee maxleonov rynmccrmck chagge geonu samoran davidmakovoz mainka mtfelix 4line jeanru xjzhou iuyo5678 tiphaine nkhuyu lixiangnlp pmlandwehr michaly alexrutherford odconfront hubert-he rhoposit marengz shangma finderl w1llg why-not-sky pimplbe zhouyunan zhangweiabc tediscript zhaarn avorio martinth xsongx crazydreamer lilt animenon imclab pquentin fsxchen jnothman tripleee lipper rustoceans wangg12 xuanhan863 leondz wollmers side2k doudou-z dungpd alvations superxiaoqiang paulnaoki seejay puneetsl swistakm tiancode sophist-uk boweiliu ramtinms iwhisper sunilkgrao delip vikingliu liormagen loretoparisi elafonizi yilihong huanglg lilonghua1987 bzzytgtd tanthml 2legit darreal44 vamsijkrishna shikigit val314159 maryhak ltoscano

langid.py's Issues

Training a new language on Windows doesn't work

I am trying to train it on some language files I downloaded from the internet. But unfortunately no matter what I try, it always crashes.

D:\Django\langid\Scripts>python.exe LDfeatureselect.py -c d:\corpus\wikipedia\langid\corpus -o features -j 1
output path: features
temp path: c:\users\nick\appdata\local\temp
corpus path: d:\corpus\wikipedia\langid\corpus
will tokenize 2 files
langs: ['am', 'af']
domains: ['domain1']
chunk size: 1 (3 chunks)
Traceback (most recent call last):
File "LDfeatureselect.py", line 533, in
chunk_paths, features, chunk_offsets = build_inverted_index(paths, options)
File "LDfeatureselect.py", line 423, in build_inverted_index
for i, keycount in enumerate(pass1_out):
File "C:\Python27\Lib\multiprocessing\pool.py", line 626, in next
raise value
OSError: [Errno 9] Bad file descriptor

I am using Python 2.7.3 on Windows 7 64bit and the latest version of langid.

Seeking advice regarding classification problem only present with Chinese

Hello,

I have some sample texts, which originate in PDFs, with my goal being to classify the language automatically. I've extracted the text content with pdfminer and whilst langid works excellently with all my samples in a variety of languages, it seems to have problems for me when I run it with Chinese (I have samples in both simplified or traditional) because it always suggests 'en'.

Does anyone have any advice on how I should approach investigating what the problem might be?

Are there any standard example documents that I could try that would confirm there isn't something quirky with my PDF extraction?

I could be wrong, but I don't think it's necessarily a UTF-8 encoding issue as I have managed to get it working with other non-Latin texts (eg Cyrillic).

The languages that I've found to work with my samples, so far, are: en, it, de, ru. I will be checking pt, fr, pl and ja ones shortly.

There is a tiny portion of English in the header section, but that does not throw off the language detection for the other samples and I have tried focusing on pages where the body of the text is entirely Chinese and present in significantly larger quantities than in the header.

It also makes no difference if I preselect the languages (unfortunately the false suggestion of English needs to be in the list, as there are likely to be samples in English present)

langid.set_languages(['en','es','pt','fr','ru','pl','de','it','ja', 'zh'])

Even if I try taking out English then it merely suggests a different wrong language (eg German), although the confidence level is fairly low (eg typically 0.16 to 0.25, whether it guesses English or German).

My set up is Windows 7, with Python 2.7 (needed due to use of PDFMiner, although I could try Python 3.5 if it was thought to solve the issue).

Many thanks,
Neil

Different result when giving the same text

I have a database from which I read. I want to identify the language in a specific cell, defined by column.

I read from my database like this:

connector = sqlite3.connect("somedb.db")
selecter = connector.cursor()
selecter.execute(''' SELECT tags FROM sometable''')
for row in selecter: #iterate through all the rows in db
    #print (type(row)) #tuple
    rf = str(row)
    #print (type(rf)) #string
    lan = langid.classify("{}".format(rf))

Technically, it works. It identifies the languages used and later on (not displayed here) writes the identified language back into the database.

So, now comes the weird part.
I wanted to double check some results manually. So I have these words:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"

When I perform the language identification on the database it plots me Portuguese into the database.
But, performing it like this:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
lan = langid.classify(a)

Well, that returns me French. Apart from that it is neither French nor Portuguese, why is it returned different results?!

Unicode support

The classifier apparently assumes utf8-encoded text:

>>> langid.classify('mangé')
('fr', 0.0016145239007830734)
>>> langid.classify(u'mangé')
('zh', -0.0063975259411870739)

adding something like:

if isinstance(text, unicode):
  text = text.encode('utf8')

would help avoiding unexpected behavior...

langid.set_languages(['en','it'])

langid.set_languages is used a few times in the readme file but when running it I receive this error:
Traceback (most recent call last):
File "<pyshell#26>", line 1, in
langid.set_languages(['en','it'])
AttributeError: 'module' object has no attribute 'set_languages'

By looking at the code I think langid.identifier.set_languages(['en','it']) would do the job.
I thought maybe this needs to be corrected.

-b option does not appear to work

I tried to use the -b option to classify a bunch of files, but it does't appear to be doing anything. It starts up a multitude of subprocesses which appear to want to use standard input.

vnix$ chmod +x ./langid/langid.py

vnix$ ./langid/langid.py -b path/to/sample.txt
^C
Process PoolWorker-25:
Traceback (most recent call last):
Process PoolWorker-15:
Process PoolWorker-24:
Process PoolWorker-12:
Process PoolWorker-23:
Process PoolWorker-20:
Process PoolWorker-17:
Process PoolWorker-29:
Process PoolWorker-7:
Process PoolWorker-22:
Process PoolWorker-2:
Traceback (most recent call last):
Process PoolWorker-14:
Process PoolWorker-13:
Process PoolWorker-5:
Process PoolWorker-9:
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
Process PoolWorker-28:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Process PoolWorker-18:
Process PoolWorker-3:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Process PoolWorker-10:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Process PoolWorker-21:
Process PoolWorker-16:
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
Process PoolWorker-27:
Process PoolWorker-4:
    self._target(*self._args, **self._kwargs)
    self.run()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
Traceback (most recent call last):
Traceback (most recent call last):
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self.run()
    self.run()
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
Process PoolWorker-26:
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self.run()
    self.run()
    self.run()
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
    self.run()
    self.run()
Traceback (most recent call last):
    self.run()
    self.run()
    self.run()
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    task = get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
    self.run()
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
    self.run()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self.run()
    self.run()
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    self.run()
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    task = get()
    task = get()
    task = get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
Process PoolWorker-11:
Traceback (most recent call last):
Process PoolWorker-8:
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
Process PoolWorker-19:
Process PoolWorker-31:
Process PoolWorker-6:
    self._target(*self._args, **self._kwargs)
    task = get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Process PoolWorker-1:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
    task = get()
    task = get()
    task = get()
Traceback (most recent call last):
    task = get()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    racquire()
    task = get()
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
KeyboardInterrupt
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    racquire()
    racquire()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    racquire()
    self.run()
    racquire()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
    task = get()
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    racquire()
    self.run()
    self.run()
    task = get()
KeyboardInterrupt
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    task = get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
    racquire()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    task = get()
KeyboardInterrupt
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
    racquire()
KeyboardInterrupt
    racquire()
KeyboardInterrupt
    racquire()
KeyboardInterrupt
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)
    task = get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    task = get()
    task = get()
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
    task = get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)

(etc for another hundred or so lines, lost track so stopped copy/pasting). The process still cannot be stopped; logging in in a separate window reveals a Python process with some 30 child Python processes. Killing the top process stops the program, but appears to leave the child processes still running, now orphaned.

Class probability computation is very inefficient (patch enclosed)

The following patch produces the same output with a 4.4-fold speedup for language identification (not counting startup time) in --line mode given 650-byte average line lengths, and a 33-fold speedup with 62-byte average line lengths when using the default language model. Larger models with more features show an even larger speedup.

The speedup results from avoiding a matrix multiplication against a feature-count vector which is mostly zeros. You may wish to tweak the cut-over from "short" to "long" texts by adjusting the self.nb_numfeats/10; it could probably be moved higher, but I was being conservative.

259a260,302

optimized version by Ralf Brown

def instance2classprobs(self, text):
"""
Compute class probabilities for an instance according to the trained model
"""
if isinstance(text, unicode):
text = text.encode('utf8')

# Convert the text to a sequence of ascii values
ords = map(ord, text)

state = 0
if len(ords) < self.nb_numfeats / 10:
    # for very short texts, just apply each production every time the
    # state changes, rather than counting the number of occurrences of
    # each state
    pdc = np.zeros(len(self.nb_classes))
    for letter in ords:
        state = self.tk_nextmove[(state << 8) + letter]
        for index in self.tk_output.get(state, []):
            # compute the dot product incrementally, avoiding lots
            # of multiplications by zero with a sparse
            # feature-count vector
            pdc += self.nb_ptc[index]
else:
    # Count the number of times we enter each state
    statecount = defaultdict(int)
    for letter in ords:
        state = self.tk_nextmove[(state << 8) + letter]
        statecount[state] += 1

    # Update all the productions corresponding to the state
    arr = np.zeros((self.nb_numfeats,), dtype='uint32')
    for state in statecount:
        for index in self.tk_output.get(state, []):
            arr[index] += statecount[state]
    # compute the partial log-probability of the document given each class
    pdc = np.dot(arr,self.nb_ptc)

# compute the partial log-probability of the document in each class
pd = pdc + self.nb_pc
return pd

271,272c314,315
< fv = self.instance2fv(text)

< probs = self.norm_probs(self.nb_classprobs(fv))

probs = self.instance2classprobs(text)
probs = self.norm_probs(probs)

282,283c325,326
< fv = self.instance2fv(text)

< probs = self.norm_probs(self.nb_classprobs(fv))

probs = self.instance2classprobs(text)
probs = self.norm_probs(probs)

Fix versioning

Your versioning scheme is not PEP 440 compliant, which means than many people, as you have noticed, have issues when installing langid.py with pip.

First, 1.1.4dev is not valid, you need to use 1.1.4.dev0 instead.

Second, is there any good reason to choose a dev suffix? Can you simply remove tag_build = dev from setup.cfg before releases?

Thanks!

Make redirection in a stream fashion

Hello,

I am trying out this fancy language detection script,
and I am actually thinking -
do you think it makes more sense to make
for line in sys.stdin.readlines():
into
for line in sys.stdin:

in langid.py ?

error installing langid (with buildout)

Page at http://pypi.python.org/simple/langid/ links to .py file(s) without version info; an index scan is required.
Getting distribution for 'langid'.
While:
Installing client1.
Getting distribution for 'langid'.
Error: Couldn't find a distribution for 'langid'.

Maybe you could add the source (tar.gz) distribution as well on pypi (python setup.py sdist)

Mixed languages polarized by "en"

I have the following text

that is a mix of english and sesotho:

>>>Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom\nSomebody tell my mama

Having the whole sentence as-it-is I get a wrong identification:

>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom\nSomebody tell my mama
('en', -233.38300132751465)

Removing the whole en sentences, and I get the right ISO-639-1 language code: sl:

>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino                                                       
('sl', -154.34662437438965)

Also keeping only one en sentence, the right language is recognized:

>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom                    
('sl', -217.8226833343506)

So, it seems that the detector is being "polarized" by the en sentences in this phrase.

Training on Windows returns error at DFfeatureselect.py step

I'm trying to train a new language identifier model on my own languages dataset. Unfortunately, it crashes at the DFfeatureselect.py script, returning "TypeError: marshal.load() arg must be file" error message. Below is the log until the crash point.

C:\langid.py-master\langid\train>C:\Python27\python.exe train.py corpus
corpus path: corpus
model path: ..model
langs(22): el(26) eo(42) en(1674) af(285) ca(287) am(2426) an(226) cy(79) ar(82) cs(432) et(449) az(534) es(457) be(292) bg(818) bn(65) de(2795) da(90) dz(220) br(532) bs(493) as(101)
domains(1): domain(12405)
identified 12405 documents
will tokenize 12405 documents
using byte NGram tokenizer, max_order: 4
chunk size: 50 (249 chunks)
job count: 8
whole-document tokenization
tokenized chunk [21140 keys]
Traceback (most recent call last):
File "train.py", line 196, in
doc_count = tally(b_dirs, args.jobs)
File "C:\langid.py-master\langid\train\DFfeatureselect.py", line 92, in tally
for i, keycount in enumerate(pass_sum_df_out):
File "C:\Python27\lib\multiprocessing\pool.py", line 620, in next
raise value
TypeError: marshal.load() arg must be file

Runtimewarning freeze uwsgi worker

Thank you for this nice lib.
We are using langid with a webservice and are facing a really weird problem : from time to time, the uwsgi worker is killed because of this portion of code :

with np.errstate(over='ignore'):
    pd = (1/np.exp(pd[None,:] - pd[:,None]).sum(1))
return pd

After a few googling, we found this message :
http://stackoverflow.com/questions/19213768/numpy-runtime-warning-causing-apache-workers-to-freeze-in-sending-reply-state
It seems that there is something with warning and numpy

As we are on nginx, the proposed fix is not working.
Do you have any idea of how we could workd around this ?

Poor classification with mono-cased text

I am classifying 3 million texts, and usually classification performance is good. However, for a few texts that are all uppercase, performance is very poor. Here is a toy example

>>> import langid
>>> print langid.classify('The quick brown fox jumps over the lazy dog')
('en', 0.9999999999138445)
>>> print langid.classify('THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG')
('pl', 0.4406207245841342)

Also, all lowercase text can have poor performance

>>> print langid.classify('I do not speak English. Do you?')
('en', 0.9071235348010346)
>>> print langid.classify('i do not speak English. do you?')
('pt', 0.9283325957561123)

Do you think langid should be retrained on all-capitalized and all-lowercase examples? This is analogous to how image classification training sets are augmented by distorting the original images through transformations such as mirroring or warping.

Detection error when processing full-width letters

When processing full-width letters, it returns "Chinese" as result:
>>> import langid
>>> langid.classify('ＡＢＣ')
('zh', 0.9668056948707975)

Hard-coded lookup for very short strings?

It's understandable that performance for very short strings is poor. Could we create a mapping with hand-assigned weights for those?

I believe strings like 'yeah', 'no', 'si', 'haha', 'hehe' and so on should always be classified reasonably. I am happy to donate my mapping for this.

-d option causes syntax error with -b

The -d option produces a syntax error if you attempt to use it with -b; it refers to a variable nb_classes which was probably in scope before the code was refactored. (It does exist in some other methods but I am unable to figure out how to stitch it back together.)

 vnix$ ./langid/langid.py -d -b sample.txt
Traceback (most recent call last):
  File "./langid/langid.py", line 587, in <module>
    main()
  File "./langid/langid.py", line 558, in main
    writer.writerow(['path']+nb_classes)
NameError: global name 'nb_classes' is not defined

Detection accuracy

**功夫是一门博大精深的武学艺术 , **功夫app , 介绍**功夫的分类、特点、器材、门派等与**功夫有关的内容！让广大读者能够更完整的了解**功夫的精华！

If run the above snippet through your detection tool I am getting "en" as the answer. This is due to a three letter word in the snippet ("app"). Is it possible to fix this issue ?

training data

Thanks for making langid available! It's awesome! We (researchers at Carnegie Mellon University) would like to augment the training data with more languages. Shall we send you the data so that you can retrain the models when your time permits? Alternatively, feel free to send us the data and we would retrain the models ourselves.

many thanks!
waleed ammar

error in running IGweight.py

Hello

I have the following error when I run IGweight.py

computing information gain Traceback (most recent call last): File "/home/motaz/tmp/langid.py/langid/train/IGweight.py", line 246, in <module> ig = compute_IG(bucketlist, features, dist, args.binarize, suffix, args.jobs) File "/home/motaz/tmp/langid.py/langid/train/IGweight.py", line 164, in compute_IG for i, (t, w) in enumerate(pass_IG_out): File "/usr/lib/python2.7/multiprocessing/pool.py", line 668, in next raise value IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

What do you think the cause of this error? How can I fix it?

Thanks

Fix pypi classifiers

Your setup.py file does not define any pypi classifier, which I think prevents pip3 install --pre langid. Here is the list: http://pypi.python.org/pypi?%3Aaction=list_classifiers. I think we should start with:

Programming Language :: Python :: 2
Programming Language :: Python :: 2.7
Programming Language :: Python :: 3

This would fix the Python 3 issue. But since we're on such a good start, we could continue.

You then need to choose a status. Either Beta or Production/Stable, probably Production/Stable.

Development Status :: 4 - Beta
Development Status :: 5 - Production/Stable

Then, an intended audience. I suggest:

Intended Audience :: Developers
Intended Audience :: Science/Research

Then the license. Is your custom license OSI Approved?

And finally a topic:

Topic :: Scientific/Engineering :: Artificial Intelligence

Once you tell me your choices, I can send a pull request. Thanks!

Weird result

I got this as a result :

this is a test
('en', -40.536659240722656)

Could you tell me please what should I do to fix this ?

Repetition of words causes detection error

When I input strings like 'hello world hello world hello world', langid can't identify it as English text.
>>> import langid
>>> langid.classify('hello world hello world hello world')
('af', 0.683057652874482)

train.py uses excessive memory (patch enclosed)

The following patch adds a bunch of {var}=None statements to let Python reuse memory that is no longer needed. This reduces memory use by more than a factor of two.
The patch also bypasses generation of domain_dist_vec when --no_domain_ig is specified, since it is never used in that case.

*** ./train.py  2013-06-25 19:12:19.000000000 -0400
--- ../new/train.py 2013-08-01 20:45:35.867486680 -0400
***************
*** 123,126 ****
--- 123,127 ----

    items = [ (d,l,p) for (d,l,n,p) in indexer.items ]
+   indexer = None
    if args.debug:
      # output the language index
***************
*** 191,198 ****
          write_weights(doc_count, doc_count_path)
          print "wrote DF counts for all features to:", doc_count_path
- 
      if DFfeats is None:
        # Choose the first-stage features
        DFfeats = ngram_select(doc_count, args.max_order, args.df_tokens)

      if args.debug:
--- 192,199 ----
          write_weights(doc_count, doc_count_path)
          print "wrote DF counts for all features to:", doc_count_path
        if DFfeats is None:
          # Choose the first-stage features
          DFfeats = ngram_select(doc_count, args.max_order, args.df_tokens)
+       doc_count = None

      if args.debug:
***************
*** 213,222 ****
--- 214,227 ----
      DF_scanner = Scanner(DFfeats)
      b_dirs = build_index(items, DF_scanner, buckets_dir, args.buckets, args.jobs, args.chunksize)
+     DF_scanner = None

      # Build vectors of domain and language distributions for use in IG calculation
+     if not args.no_domain_ig:
        domain_dist_vec = numpy.array([ domain_dist[domain_index[d]]
                 for d in sorted(domain_index, key=domain_index.get)], dtype=int)
+     domain_dist = None
      lang_dist_vec = numpy.array([ lang_dist[lang_index[l]]
              for l in sorted(lang_index.keys(), key=lang_index.get)], dtype=int)
+     lang_dist = None

      # Compute IG
***************
*** 235,241 ****
--- 240,249 ----
          write_weights(ig, weights_path)
        ig_vals[label] = dict((row[0], numpy.array(row[1].flat)) for row in ig)
+       ig = None
+     DFfeats = None

      # Select features according to the LD criteria
      features_per_lang = select_LD_features(ig_vals['lang'], ig_vals.get('domain'), args.feats_per_lang, ignore_domain = args.no_domain_ig)
+     ig_vals = None
      LDfeats = reduce(set.union, map(set, features_per_lang.values()))
      print 'selected %d features' % len(LDfeats)
***************
*** 251,254 ****
--- 259,263 ----
            writer.writerow(map(repr,features_per_lang[i]))
        print 'wrote LD.perlang features to "%s"' % feature_path + '.perlang'
+     features_per_lang = None

    # Compile a scanner for the LDfeats
***************
*** 259,277 ****
        cPickle.dump((tk_nextmove, tk_output, LDfeats), f)
      print "wrote scanner to {0}".format(scanner_path)

    # Assemble the NB model
    langs = sorted(lang_index, key=lang_index.get)

    cm = generate_cm([ (l,p) for d,l,p in items], len(langs))
    paths = zip(*items)[2]

    nb_classes = langs
    nb_pc = learn_pc(cm)
    nb_ptc = learn_ptc(paths, tk_nextmove, tk_output, cm, buckets_dir, args)

    # output the model
    output_path = os.path.join(model_dir, 'model')
    model = nb_ptc, nb_pc, nb_classes, tk_nextmove, tk_output
!   string = base64.b64encode(bz2.compress(cPickle.dumps(model)))
    with open(output_path, 'w') as f:
      f.write(string)
--- 268,298 ----
        cPickle.dump((tk_nextmove, tk_output, LDfeats), f)
      print "wrote scanner to {0}".format(scanner_path)
+   LDfeats = None

    # Assemble the NB model
    langs = sorted(lang_index, key=lang_index.get)
+   lang_index = None

    cm = generate_cm([ (l,p) for d,l,p in items], len(langs))
    paths = zip(*items)[2]
+   items = None

    nb_classes = langs
+   langs = None
    nb_pc = learn_pc(cm)
    nb_ptc = learn_ptc(paths, tk_nextmove, tk_output, cm, buckets_dir, args)
+   paths = None
+   cm = None

    # output the model
    output_path = os.path.join(model_dir, 'model')
    model = nb_ptc, nb_pc, nb_classes, tk_nextmove, tk_output
!   dump = cPickle.dumps(model)
!   tk_nextmove = None
!   tk_output = None
!   nb_pc = None
!   nb_classes = None
!   model = None
!   string = base64.b64encode(bz2.compress(dump))
    with open(output_path, 'w') as f:
      f.write(string)

Python3 branch does not work with python 3(.5) Fix is here

Hi Saffsd,

langid fails on line 167 of langid.py with the very example provided on the front page.
This is because in python3, dividing integers produces a float. The fix is to change line 163 from

nb_numfeats = len(nb_ptc) / len(nb_pc)

nb_numfeats = int(len(nb_ptc) / len(nb_pc))

and to change line 167 to this:

nb_ptc = np.array(nb_ptc).reshape(nb_numfeats, len(nb_pc))

This fix should be easy to make.
Thanks.

Diff to fix the ValueError issue

diff --git a/langid/langid.py b/langid/langid.py
index 3c39275..36a9159 100644
--- a/langid/langid.py
+++ b/langid/langid.py
@@ -226,7 +226,7 @@
       # to speed up processing.
       for lang in langs:
         if lang not in nb_classes:
-          raise ValueError, "Unknown language code %s" % lang
+          raise ValueError("Unknown language code %s" % lang)
 
       subset_mask = np.fromiter((l in langs for l in nb_classes), dtype=bool)
       self.nb_classes = [ c for c in nb_classes if c in langs ]

Caching Resource

Running

import langid
langid.classify("India is a country") ## statement number 1

takes a lot of time to run "statement number 1".

but,

import langid
langid.classify("I like cricket")
langid.classify("India is a country") ## statement number 2

does not take a lot of time to run "statement number 2".

So, Does langid.classify caches some information? Can we manage that? Thanks

Different classification for upper/lower-case sentences

I'm noticing something very peculiar with all-caps strings.

Running langid -n:

>>> ceci est une phrase française
('fr', 0.9999966296917099)
>>> CECI EST UNE PHRASE FRANÇAISE
('pt', 0.4985860132092562)

>>> this is an english phrase
('en', 0.9999771953554634)
>>> THIS IS AN ENGLISH PHRASE
('en', 0.16946150595865334)

>>> ali bongo veut jouer la montre
('fr', 0.9999628977673475)
>>> ALI BONGO VEUT JOUER LA MONTRE
('en', 0.16946150595865334)

Note the 'en', 0.169461... This value happens often for all-caps words and phrases

>>> mange
('en', 0.16946150595865334)
>>> MANGE
('en', 0.16946150595865334)

>>> fille
('fr', 0.3545074995041862)
>>> FILLE
('en', 0.16946150595865334)

>>> femme
('da', 0.5528002996661378)
>>> FEMME
('en', 0.16946150595865334)

I understand that I could always lowercase the input string, but I haven't seen this mentioned anywhere in the issue tracker, so I'm submitting this issue just so this can be documented.

negative probability?

$ echo "hva heter du" | langid
('nb', -32.18966054916382)

Unable to do a pip install

When i try to do a pip install I get the following error.

pip install langid
Downloading/unpacking langid
  Could not find a version that satisfies the requirement langid (from versions: 1.0dev, 1.1.1dev, 1.1.2dev, 1.1.3dev, 1.1.4dev, 1.1dev)
Cleaning up...
No distributions matching the version for langid

python 3.x support

is this on the roadmap somewhere?

Detection error when encounter full-width characters

The langid mistakens full-width English texts like 'ｈｅｌｌｏ　ｗｏｒｌｄ' as CJK language texts.
>>> import langid
>>> langid.classify('ｈｅｌｌｏ　ｗｏｒｌｄ')
('zh', 0.9339664571825803)

some quotes ("️) causes classification as Chinese

I have noticed that 'wunderbar' is classified as Chinese, but only sometimes. Well, you see why:

>>> langid.rank(' wunderbar')
[('de', 0.9778415187189662), ('ms', 0.010616691993507496), ('rw', 0.005629123117595187), ('jv', 0.002381279333979642), ('en', 0.0012907605583217631), ('xh', 0.0007049424071661806), ('zu', 0.0005406729256266108), ('pl', 0.00033030266440511896), ('zh', 0.00012068794780089398), ('lb', 8.737333954204245e-05), ('ku', 7.951447183144349e-05), ('et', 4.4845458685428935e-05), ('it', 3.176656196745408e-05), ('qu', 3.148454827117921e-05), ('se', 2.1674043639160725e-05), ('ht', 2.1155926636937293e-05), ('la', 2.0910529457339416e-05), ('cy', 2.0232843521824427e-05), ('mt', 1.844486496954288e-05), ('nl', 1.4311942620855673e-05), ('lt', 1.2972404851637802e-05), ('no', 1.2377528866306211e-05), ('oc', 1.1861996353338544e-05), ('tl', 1.1204590626950813e-05), ('fo', 1.1022700757718407e-05), ('an', 8.329868622385504e-06), ('sw', 7.053226420262944e-06), ('af', 6.599790475577181e-06), ('lo', 6.0179664803498765e-06), ('vo', 5.556359238025519e-06), ('es', 4.8332113719175565e-06), ('ko', 4.777415287420478e-06), ('id', 4.391051103680497e-06), ('ky', 3.818515217450824e-06), ('br', 3.5191782008061838e-06), ('eo', 3.4779281046734827e-06), ('mg', 3.466248539257764e-06), ('am', 3.229264725976488e-06), ('is', 3.074756887874788e-06), ('fr', 2.8661019894490493e-06), ('ga', 2.537918306496631e-06), ('ps', 1.5075166885307474e-06), ('wa', 1.4472239069400242e-06), ('ar', 1.3391240428809958e-06), ('bs', 1.1932147667325364e-06), ('si', 1.153093780014326e-06), ('tr', 1.0981617162215484e-06), ('eu', 1.0016241426877151e-06), ('az', 9.650942971628987e-07), ('ka', 9.325534232313878e-07), ('hr', 8.631094356760578e-07), ('pt', 6.746837968524499e-07), ('sk', 6.693679308944927e-07), ('nn', 6.645596091493626e-07), ('hy', 4.715304772325221e-07), ('nb', 4.182398787212912e-07), ('ja', 3.924990952428872e-07), ('lv', 3.7621043483560154e-07), ('ug', 3.60804060385915e-07), ('sq', 3.2894525454536553e-07), ('sv', 2.952765918502936e-07), ('fi', 2.675271490842995e-07), ('kk', 2.636153614232699e-07), ('he', 2.6168240445933393e-07), ('ur', 2.4873320080711625e-07), ('ca', 2.431349486376223e-07), ('sl', 2.2170225585481938e-07), ('fa', 1.6505684273619443e-07), ('gl', 1.6464847504277298e-07), ('km', 1.4689751678433855e-07), ('ro', 1.3200234729630338e-07), ('vi', 1.1652110654659623e-07), ('mn', 1.1039890158774906e-07), ('da', 9.616333939918114e-08), ('el', 8.865316433779712e-08), ('hu', 8.476379472092254e-08), ('bn', 7.533416864544197e-08), ('th', 7.145543835277235e-08), ('gu', 5.8311942506196446e-08), ('as', 4.503109767538008e-08), ('ru', 4.463188951350857e-08), ('ml', 3.794148223120744e-08), ('pa', 3.216318214492323e-08), ('ta', 3.194030431437515e-08), ('mr', 2.6181357262362214e-08), ('te', 2.500582090488879e-08), ('or', 2.0345725506260737e-08), ('cs', 1.7323718652257704e-08), ('hi', 1.482088515139454e-08), ('mk', 1.2543768035052028e-08), ('be', 1.1737055387189823e-08), ('kn', 9.089409826915146e-09), ('uk', 5.491771809519326e-09), ('ne', 5.147506514007945e-09), ('sr', 4.828762668563497e-09), ('dz', 2.9114805800159365e-09), ('bg', 1.2081540813937604e-09)]
>>> langid.rank("️wunderbar")
[('zh', 0.9928204705819458), ('ja', 0.0065785229592405705), ('ko', 0.00038846511466535897), ('ar', 4.1408258069092674e-05), ('jv', 3.5224872068297655e-05), ('en', 1.6612952762747267e-05), ('de', 1.4317074941863288e-05), ('qu', 1.244498916800758e-05), ('ru', 1.2175607654316526e-05), ('lb', 1.2110582975832658e-05), ('ms', 1.0621617421597854e-05), ('el', 7.080716875555405e-06), ('la', 6.181768270527322e-06), ('bg', 5.722886583220138e-06), ('rw', 5.383872388674567e-06), ('no', 4.087949465473779e-06), ('fo', 3.544647301692967e-06), ('th', 3.225987949444041e-06), ('ky', 2.2206115855257554e-06), ('nl', 1.607727472526966e-06), ('hy', 1.4007056676162333e-06), ('zu', 1.3214916491765556e-06), ('sw', 1.2498255366416236e-06), ('si', 1.1304218558942313e-06), ('ro', 9.505256464019786e-07), ('cy', 8.460239489735446e-07), ('an', 7.772158405223286e-07), ('ps', 7.303143910130873e-07), ('ht', 6.840844861484241e-07), ('pl', 6.501235175370973e-07), ('id', 6.439492422441017e-07), ('oc', 6.416247130132337e-07), ('fr', 6.145676129535128e-07), ('eo', 6.112717723472432e-07), ('xh', 5.864737297378976e-07), ('it', 5.754600361422645e-07), ('he', 4.604095080825112e-07), ('lo', 4.3820524153448525e-07), ('nn', 4.372386499345569e-07), ('es', 4.3404833143189604e-07), ('tl', 3.3903924149577855e-07), ('mn', 3.1742695156126506e-07), ('uk', 3.111113663287929e-07), ('pt', 2.974118745522462e-07), ('km', 2.45400261412045e-07), ('nb', 2.44311393421879e-07), ('br', 2.361262091649684e-07), ('is', 1.6744621965858352e-07), ('vo', 1.1623863610897373e-07), ('sl', 1.1255238112569407e-07), ('sk', 1.0188625377914079e-07), ('se', 8.937200752241362e-08), ('ca', 8.220039832460332e-08), ('am', 7.377499976691904e-08), ('bs', 6.242983133332428e-08), ('da', 5.65589837074933e-08), ('tr', 4.7950588802830366e-08), ('bn', 4.1764456763934405e-08), ('kk', 3.696808528126726e-08), ('gu', 3.486169811098909e-08), ('hr', 2.7307941417741773e-08), ('pa', 2.7150978056364916e-08), ('as', 2.542018389827033e-08), ('az', 2.4425081481129985e-08), ('ka', 2.2423528126520916e-08), ('lt', 2.0958811561874612e-08), ('eu', 2.0343842634025164e-08), ('mr', 1.7358803103263064e-08), ('ur', 1.5349310111786227e-08), ('et', 1.5281946788061445e-08), ('or', 1.3650943533411975e-08), ('mk', 1.3538199094114394e-08), ('hi', 1.126846960438557e-08), ('fa', 1.1236394202902213e-08), ('sr', 1.1179996653889562e-08), ('ml', 1.0757890112433114e-08), ('te', 9.296997663016953e-09), ('lv', 8.609086486177823e-09), ('kn', 7.472269511116448e-09), ('af', 7.136002539508618e-09), ('fi', 6.996691832623385e-09), ('ku', 6.931737527143431e-09), ('sv', 5.184914721072085e-09), ('mt', 5.101252483893451e-09), ('ne', 4.379418035731972e-09), ('mg', 4.041719069597912e-09), ('hu', 3.848573321874355e-09), ('sq', 3.831580821056942e-09), ('vi', 3.205823430957484e-09), ('ug', 1.6902740438470224e-09), ('ga', 1.1843145772621335e-09), ('ta', 1.0153927556584555e-09), ('gl', 9.850247142407677e-10), ('cs', 9.611533181894737e-10), ('wa', 8.327958886567568e-10), ('be', 5.0586464217473225e-11), ('dz', 4.876116023858696e-14)]
>>> langid.rank("️wunderbar")
[('zh', 0.9928204705819458), ('ja', 0.0065785229592405705), ('ko', 0.00038846511466535897), ('ar', 4.1408258069092674e-05), ('jv', 3.5224872068297655e-05), ('en', 1.6612952762747267e-05), ('de', 1.4317074941863288e-05), ('qu', 1.244498916800758e-05), ('ru', 1.2175607654316526e-05), ('lb', 1.2110582975832658e-05), ('ms', 1.0621617421597854e-05), ('el', 7.080716875555405e-06), ('la', 6.181768270527322e-06), ('bg', 5.722886583220138e-06), ('rw', 5.383872388674567e-06), ('no', 4.087949465473779e-06), ('fo', 3.544647301692967e-06), ('th', 3.225987949444041e-06), ('ky', 2.2206115855257554e-06), ('nl', 1.607727472526966e-06), ('hy', 1.4007056676162333e-06), ('zu', 1.3214916491765556e-06), ('sw', 1.2498255366416236e-06), ('si', 1.1304218558942313e-06), ('ro', 9.505256464019786e-07), ('cy', 8.460239489735446e-07), ('an', 7.772158405223286e-07), ('ps', 7.303143910130873e-07), ('ht', 6.840844861484241e-07), ('pl', 6.501235175370973e-07), ('id', 6.439492422441017e-07), ('oc', 6.416247130132337e-07), ('fr', 6.145676129535128e-07), ('eo', 6.112717723472432e-07), ('xh', 5.864737297378976e-07), ('it', 5.754600361422645e-07), ('he', 4.604095080825112e-07), ('lo', 4.3820524153448525e-07), ('nn', 4.372386499345569e-07), ('es', 4.3404833143189604e-07), ('tl', 3.3903924149577855e-07), ('mn', 3.1742695156126506e-07), ('uk', 3.111113663287929e-07), ('pt', 2.974118745522462e-07), ('km', 2.45400261412045e-07), ('nb', 2.44311393421879e-07), ('br', 2.361262091649684e-07), ('is', 1.6744621965858352e-07), ('vo', 1.1623863610897373e-07), ('sl', 1.1255238112569407e-07), ('sk', 1.0188625377914079e-07), ('se', 8.937200752241362e-08), ('ca', 8.220039832460332e-08), ('am', 7.377499976691904e-08), ('bs', 6.242983133332428e-08), ('da', 5.65589837074933e-08), ('tr', 4.7950588802830366e-08), ('bn', 4.1764456763934405e-08), ('kk', 3.696808528126726e-08), ('gu', 3.486169811098909e-08), ('hr', 2.7307941417741773e-08), ('pa', 2.7150978056364916e-08), ('as', 2.542018389827033e-08), ('az', 2.4425081481129985e-08), ('ka', 2.2423528126520916e-08), ('lt', 2.0958811561874612e-08), ('eu', 2.0343842634025164e-08), ('mr', 1.7358803103263064e-08), ('ur', 1.5349310111786227e-08), ('et', 1.5281946788061445e-08), ('or', 1.3650943533411975e-08), ('mk', 1.3538199094114394e-08), ('hi', 1.126846960438557e-08), ('fa', 1.1236394202902213e-08), ('sr', 1.1179996653889562e-08), ('ml', 1.0757890112433114e-08), ('te', 9.296997663016953e-09), ('lv', 8.609086486177823e-09), ('kn', 7.472269511116448e-09), ('af', 7.136002539508618e-09), ('fi', 6.996691832623385e-09), ('ku', 6.931737527143431e-09), ('sv', 5.184914721072085e-09), ('mt', 5.101252483893451e-09), ('ne', 4.379418035731972e-09), ('mg', 4.041719069597912e-09), ('hu', 3.848573321874355e-09), ('sq', 3.831580821056942e-09), ('vi', 3.205823430957484e-09), ('ug', 1.6902740438470224e-09), ('ga', 1.1843145772621335e-09), ('ta', 1.0153927556584555e-09), ('gl', 9.850247142407677e-10), ('cs', 9.611533181894737e-10), ('wa', 8.327958886567568e-10), ('be', 5.0586464217473225e-11), ('dz', 4.876116023858696e-14)]
>>> langid.rank("️wunderbar")
[('zh', 0.9928204705819458), ('ja', 0.0065785229592405705), ('ko', 0.00038846511466535897), ('ar', 4.1408258069092674e-05), ('jv', 3.5224872068297655e-05), ('en', 1.6612952762747267e-05), ('de', 1.4317074941863288e-05), ('qu', 1.244498916800758e-05), ('ru', 1.2175607654316526e-05), ('lb', 1.2110582975832658e-05), ('ms', 1.0621617421597854e-05), ('el', 7.080716875555405e-06), ('la', 6.181768270527322e-06), ('bg', 5.722886583220138e-06), ('rw', 5.383872388674567e-06), ('no', 4.087949465473779e-06), ('fo', 3.544647301692967e-06), ('th', 3.225987949444041e-06), ('ky', 2.2206115855257554e-06), ('nl', 1.607727472526966e-06), ('hy', 1.4007056676162333e-06), ('zu', 1.3214916491765556e-06), ('sw', 1.2498255366416236e-06), ('si', 1.1304218558942313e-06), ('ro', 9.505256464019786e-07), ('cy', 8.460239489735446e-07), ('an', 7.772158405223286e-07), ('ps', 7.303143910130873e-07), ('ht', 6.840844861484241e-07), ('pl', 6.501235175370973e-07), ('id', 6.439492422441017e-07), ('oc', 6.416247130132337e-07), ('fr', 6.145676129535128e-07), ('eo', 6.112717723472432e-07), ('xh', 5.864737297378976e-07), ('it', 5.754600361422645e-07), ('he', 4.604095080825112e-07), ('lo', 4.3820524153448525e-07), ('nn', 4.372386499345569e-07), ('es', 4.3404833143189604e-07), ('tl', 3.3903924149577855e-07), ('mn', 3.1742695156126506e-07), ('uk', 3.111113663287929e-07), ('pt', 2.974118745522462e-07), ('km', 2.45400261412045e-07), ('nb', 2.44311393421879e-07), ('br', 2.361262091649684e-07), ('is', 1.6744621965858352e-07), ('vo', 1.1623863610897373e-07), ('sl', 1.1255238112569407e-07), ('sk', 1.0188625377914079e-07), ('se', 8.937200752241362e-08), ('ca', 8.220039832460332e-08), ('am', 7.377499976691904e-08), ('bs', 6.242983133332428e-08), ('da', 5.65589837074933e-08), ('tr', 4.7950588802830366e-08), ('bn', 4.1764456763934405e-08), ('kk', 3.696808528126726e-08), ('gu', 3.486169811098909e-08), ('hr', 2.7307941417741773e-08), ('pa', 2.7150978056364916e-08), ('as', 2.542018389827033e-08), ('az', 2.4425081481129985e-08), ('ka', 2.2423528126520916e-08), ('lt', 2.0958811561874612e-08), ('eu', 2.0343842634025164e-08), ('mr', 1.7358803103263064e-08), ('ur', 1.5349310111786227e-08), ('et', 1.5281946788061445e-08), ('or', 1.3650943533411975e-08), ('mk', 1.3538199094114394e-08), ('hi', 1.126846960438557e-08), ('fa', 1.1236394202902213e-08), ('sr', 1.1179996653889562e-08), ('ml', 1.0757890112433114e-08), ('te', 9.296997663016953e-09), ('lv', 8.609086486177823e-09), ('kn', 7.472269511116448e-09), ('af', 7.136002539508618e-09), ('fi', 6.996691832623385e-09), ('ku', 6.931737527143431e-09), ('sv', 5.184914721072085e-09), ('mt', 5.101252483893451e-09), ('ne', 4.379418035731972e-09), ('mg', 4.041719069597912e-09), ('hu', 3.848573321874355e-09), ('sq', 3.831580821056942e-09), ('vi', 3.205823430957484e-09), ('ug', 1.6902740438470224e-09), ('ga', 1.1843145772621335e-09), ('ta', 1.0153927556584555e-09), ('gl', 9.850247142407677e-10), ('cs', 9.611533181894737e-10), ('wa', 8.327958886567568e-10), ('be', 5.0586464217473225e-11), ('dz', 4.876116023858696e-14)]
>>> langid.rank("️wunderbar ")
[('zh', 0.9928204705819458), ('ja', 0.0065785229592405705), ('ko', 0.00038846511466535897), ('ar', 4.1408258069092674e-05), ('jv', 3.5224872068297655e-05), ('en', 1.6612952762747267e-05), ('de', 1.4317074941863288e-05), ('qu', 1.244498916800758e-05), ('ru', 1.2175607654316526e-05), ('lb', 1.2110582975832658e-05), ('ms', 1.0621617421597854e-05), ('el', 7.080716875555405e-06), ('la', 6.181768270527322e-06), ('bg', 5.722886583220138e-06), ('rw', 5.383872388674567e-06), ('no', 4.087949465473779e-06), ('fo', 3.544647301692967e-06), ('th', 3.225987949444041e-06), ('ky', 2.2206115855257554e-06), ('nl', 1.607727472526966e-06), ('hy', 1.4007056676162333e-06), ('zu', 1.3214916491765556e-06), ('sw', 1.2498255366416236e-06), ('si', 1.1304218558942313e-06), ('ro', 9.505256464019786e-07), ('cy', 8.460239489735446e-07), ('an', 7.772158405223286e-07), ('ps', 7.303143910130873e-07), ('ht', 6.840844861484241e-07), ('pl', 6.501235175370973e-07), ('id', 6.439492422441017e-07), ('oc', 6.416247130132337e-07), ('fr', 6.145676129535128e-07), ('eo', 6.112717723472432e-07), ('xh', 5.864737297378976e-07), ('it', 5.754600361422645e-07), ('he', 4.604095080825112e-07), ('lo', 4.3820524153448525e-07), ('nn', 4.372386499345569e-07), ('es', 4.3404833143189604e-07), ('tl', 3.3903924149577855e-07), ('mn', 3.1742695156126506e-07), ('uk', 3.111113663287929e-07), ('pt', 2.974118745522462e-07), ('km', 2.45400261412045e-07), ('nb', 2.44311393421879e-07), ('br', 2.361262091649684e-07), ('is', 1.6744621965858352e-07), ('vo', 1.1623863610897373e-07), ('sl', 1.1255238112569407e-07), ('sk', 1.0188625377914079e-07), ('se', 8.937200752241362e-08), ('ca', 8.220039832460332e-08), ('am', 7.377499976691904e-08), ('bs', 6.242983133332428e-08), ('da', 5.65589837074933e-08), ('tr', 4.7950588802830366e-08), ('bn', 4.1764456763934405e-08), ('kk', 3.696808528126726e-08), ('gu', 3.486169811098909e-08), ('hr', 2.7307941417741773e-08), ('pa', 2.7150978056364916e-08), ('as', 2.542018389827033e-08), ('az', 2.4425081481129985e-08), ('ka', 2.2423528126520916e-08), ('lt', 2.0958811561874612e-08), ('eu', 2.0343842634025164e-08), ('mr', 1.7358803103263064e-08), ('ur', 1.5349310111786227e-08), ('et', 1.5281946788061445e-08), ('or', 1.3650943533411975e-08), ('mk', 1.3538199094114394e-08), ('hi', 1.126846960438557e-08), ('fa', 1.1236394202902213e-08), ('sr', 1.1179996653889562e-08), ('ml', 1.0757890112433114e-08), ('te', 9.296997663016953e-09), ('lv', 8.609086486177823e-09), ('kn', 7.472269511116448e-09), ('af', 7.136002539508618e-09), ('fi', 6.996691832623385e-09), ('ku', 6.931737527143431e-09), ('sv', 5.184914721072085e-09), ('mt', 5.101252483893451e-09), ('ne', 4.379418035731972e-09), ('mg', 4.041719069597912e-09), ('hu', 3.848573321874355e-09), ('sq', 3.831580821056942e-09), ('vi', 3.205823430957484e-09), ('ug', 1.6902740438470224e-09), ('ga', 1.1843145772621335e-09), ('ta', 1.0153927556584555e-09), ('gl', 9.850247142407677e-10), ('cs', 9.611533181894737e-10), ('wa', 8.327958886567568e-10), ('be', 5.0586464217473225e-11), ('dz', 4.876116023858696e-14)]
>>> langid.rank("️wunderbar wunderbar")
[('zh', 0.8292747652785193), ('jv', 0.08584567244536759), ('de', 0.05102290406741519), ('ms', 0.01846538938672487), ('rw', 0.011278761798545766), ('zu', 0.001671397137805495), ('lb', 0.0010829355842848353), ('qu', 0.0005832810973334379), ('xh', 0.00039822810842440767), ('la', 9.621298010155744e-05), ('fo', 7.108813895775326e-05), ('ko', 7.034607461032755e-05), ('en', 4.073465186853253e-05), ('ky', 1.9835709320975084e-05), ('ja', 1.5937070356101868e-05), ('ht', 1.4811547418357725e-05), ('sw', 9.623311954126167e-06), ('an', 7.5723443536668756e-06), ('cy', 5.496016488282588e-06), ('lo', 4.798035215605e-06), ('se', 4.531292165616887e-06), ('pl', 3.3810621649029084e-06), ('no', 2.367282541317847e-06), ('oc', 1.8883109960008556e-06), ('ar', 1.6070814782777182e-06), ('tl', 1.4810688852051008e-06), ('ps', 1.287723280469018e-06), ('hy', 8.319399193182018e-07), ('si', 6.467997220342434e-07), ('nl', 3.201203064538715e-07), ('br', 2.6680496108931717e-07), ('eo', 2.5979373360746573e-07), ('ku', 2.3751078530068452e-07), ('id', 2.1944025320190693e-07), ('vo', 1.5784998639426843e-07), ('is', 1.5328583556787233e-07), ('it', 1.387115378800745e-07), ('am', 1.0543621642926642e-07), ('nn', 4.710963987507007e-08), ('km', 3.68933519449979e-08), ('th', 3.0197194884313876e-08), ('af', 2.4877234695711385e-08), ('mn', 2.295340614913973e-08), ('bs', 2.0674611361045597e-08), ('et', 1.7926741007217306e-08), ('mg', 1.6386129512777282e-08), ('el', 1.543390389861973e-08), ('he', 1.0329136258015913e-08), ('nb', 1.0140603281735252e-08), ('es', 7.768440450072785e-09), ('ru', 7.205224634719422e-09), ('fr', 7.2035413291483e-09), ('lt', 6.776419793400924e-09), ('az', 6.7718846384117305e-09), ('kk', 6.649133365127158e-09), ('ka', 5.110711821284464e-09), ('mt', 3.11261687857195e-09), ('tr', 3.0043952397806197e-09), ('ro', 2.9819815991412353e-09), ('hr', 2.9688620536567503e-09), ('ur', 2.083914839620625e-09), ('pt', 1.827458771087846e-09), ('sk', 1.6569108132171535e-09), ('eu', 1.496275461961667e-09), ('sl', 1.0752757039619155e-09), ('ca', 9.350437689656174e-10), ('wa', 8.223215529098318e-10), ('ga', 7.69030137603146e-10), ('ug', 7.681815073650124e-10), ('bn', 7.576498721150876e-10), ('as', 6.69440195142181e-10), ('gu', 5.94424522293588e-10), ('sq', 2.5479751515843107e-10), ('fa', 2.354233192136297e-10), ('pa', 1.833278456497039e-10), ('bg', 1.712831336781061e-10), ('lv', 1.5552927754246522e-10), ('or', 1.4212304185680015e-10), ('mr', 1.0481709860685237e-10), ('uk', 9.61421980553661e-11), ('da', 8.40995816588907e-11), ('ml', 8.05270920986517e-11), ('te', 5.7679102107932226e-11), ('fi', 3.483024399647947e-11), ('hi', 3.2173534391263656e-11), ('mk', 3.1599836395053165e-11), ('kn', 3.089330122267583e-11), ('vi', 1.8878947474031977e-11), ('sv', 1.7945383911693068e-11), ('gl', 9.909427701342852e-12), ('ne', 7.854058383954012e-12), ('hu', 7.708246189105303e-12), ('ta', 6.398434520445388e-12), ('sr', 4.727314889403123e-12), ('cs', 2.770877495246804e-13), ('be', 1.279259144409701e-13), ('dz', 4.078417757612451e-17)]
>>> langid.rank('wunderbar')
[('de', 0.6503315690434672), ('en', 0.11596045044921321), ('zh', 0.0699736942100247), ('pl', 0.05783840321052503), ('xh', 0.018145338183298606), ('ms', 0.011032574595175405), ('nl', 0.00893483035667102), ('es', 0.008071226823772837), ('rw', 0.007877222399353033), ('mt', 0.0075871671223098625), ('it', 0.00685146390768504), ('fr', 0.0052717838386513146), ('zu', 0.004737085004634088), ('jv', 0.004131051251327707), ('pt', 0.0024858543866690752), ('ar', 0.002234159234979442), ('ko', 0.0017976489711238274), ('et', 0.0017118559384868492), ('ja', 0.0013351516450480208), ('cy', 0.0010983563884041131), ('lt', 0.0009588132812860693), ('no', 0.000951013381983051), ('lb', 0.0008369332838098044), ('id', 0.0007867488463552349), ('sk', 0.0006723864083168762), ('ku', 0.0005425678843744174), ('tr', 0.00045886496392803837), ('eo', 0.0004411268627940554), ('sv', 0.0004150147400791327), ('oc', 0.0003616025821079207), ('la', 0.0002771880561695966), ('fi', 0.0002623161454800777), ('br', 0.0002448507642931113), ('qu', 0.00024050601384825444), ('eu', 0.00023655454563857104), ('hu', 0.00022148160257201647), ('ro', 0.00021347242595583783), ('lv', 0.0002070085834320212), ('tl', 0.00019823868588437078), ('ca', 0.00018849873963471987), ('vo', 0.00017631869249845042), ('ht', 0.00016812866637903447), ('is', 0.0001680612569300022), ('hr', 0.00016766064521404303), ('sw', 0.00016280436223774648), ('da', 0.00015972724722411151), ('he', 0.00015825059510499629), ('vi', 0.00014855581572915312), ('sl', 0.00014527916537300623), ('an', 0.00013956089703051055), ('af', 0.0001307108551913882), ('ru', 0.00012991449800521366), ('ga', 0.00011646476578160291), ('se', 0.000112560167112669), ('nb', 0.00010735828554482774), ('nn', 0.00010587862156044339), ('ka', 0.00010215390040238935), ('am', 9.988917179840933e-05), ('fa', 9.428928623042383e-05), ('fo', 9.101878230507523e-05), ('el', 8.599944639792726e-05), ('bs', 7.667452330378166e-05), ('az', 6.777789021556405e-05), ('lo', 6.725303600762494e-05), ('mg', 6.365897350608821e-05), ('th', 6.106936829422913e-05), ('gl', 6.070326505516053e-05), ('si', 5.63709107752398e-05), ('wa', 5.3856638038600104e-05), ('cs', 5.335790265486965e-05), ('ky', 4.7245670884845804e-05), ('sq', 4.7170457494920505e-05), ('ps', 4.198177975397085e-05), ('bn', 3.5824236071314225e-05), ('ur', 3.530273629808421e-05), ('ml', 2.8088117524393276e-05), ('gu', 2.3353586302661594e-05), ('uk', 2.3104600159857556e-05), ('kk', 2.2985646975380097e-05), ('hy', 2.2625192930896517e-05), ('ta', 2.1042104415353486e-05), ('te', 2.0333812825109273e-05), ('bg', 2.0003375454411503e-05), ('ug', 1.979124500466659e-05), ('mr', 1.762000965093424e-05), ('mn', 1.5181620903080626e-05), ('as', 1.4511621463594436e-05), ('pa', 1.4474134160880863e-05), ('km', 1.4009833344404556e-05), ('mk', 1.3578006175154581e-05), ('hi', 1.0256809421089112e-05), ('sr', 8.683693637506968e-06), ('be', 8.630821071501332e-06), ('or', 7.3735619034962e-06), ('kn', 5.227389383813129e-06), ('ne', 4.494828285537928e-06), ('dz', 3.7227178371875574e-06)]

wrong detection

Hello,

with an english text "Ángel Di María: Louis van Gaal dynamic was why I left Manchester United", the classifier returns ('la', 0.9665266986710674) because of the "Ángel Di María" is a Latin name.

Is there any way to overcome this situation?

Thanks in advance,
Canh

Weird classification for Latvian & ex-Yougoslavian languages

Hi,

I noticed some weird classification problems for Latvian, the text (a subtitle) being typically classified as Luxemburgish (!). As the codes for the two languages follow each other, I was wondering whether this might be due to a bug (in the training data or the classifier)?

Bosnian, Croatian and Serbian are also difficult to classify (these languages being very close, and until recently often classified as one single language). However, it seems that the classification of Croatian "overweights" the two other languages -- if I try to classify Bosnian and Serbian subtitles, I typically get Croatian as only possible language.

Misclassification of text

Hi, we're using langid to find the language in which a press release has been written. Our text is thus in HTML.

It generally works pretty well, but it misclassified this page as being in "ky" instead of "es":

http://www.anpenavarra.com/index.php/comunicacion/notas-de-prensa/259-nota-de-prensa-encuentro-hispano-luso-de-docentes

>>> import langid
>>> import urllib2
>>> x = urllib2.urlopen('http://www.anpenavarra.com/index.php/comunicacion/notas-de-prensa/259-nota-de-prensa-encuentro-hispano-luso-de-docentes').read()
>>> langid.classify(x)
('ky', 1.0)

We're a bit puzzled by this misclassification. Is there anything we can do to help langid workaround whatever is causing this misclassification? (like cleaning the text before we pass it to langid?)

Thanks!

Add support for python3

Currently, langid fails to import when running python3.

langid.set_languages(['en','it'])

I'm getting the same problem reported 3 months ago, having just installed langid-1.1.3dev for py2,7 with pip install.

Not detecting Hindi at all.

Hi,

First of all, i could not understand how to give input for Hindi language. Its there in list of pre-trained language.

So i gave these test sentence of hindi.

"tum kahaan ja rahe ho" - Result -normalized (Fi, '0.8528')
i gave the above sentence in cmd on windows.
"तुम कहाँ जा रहे हो" - Result -normalized (Zh, '0.9999')
i gave the above hindi sentence in cmd with "test.docx" file as input
python langid.py -n < test.docx

What could be the possible reasons for not detecting hindi.

I tried with litthe bigger sentences too. Doesnot seem to work. What am i missing ??

DeprecationWarning: using a non-integer number instead of an integer

I'm receiving these warning while using langid:

Warning (from warnings module):
  File "C:\Python34\lib\site-packages\langid\langid.py", line 167
    nb_ptc = np.array(nb_ptc).reshape(len(nb_ptc)/len(nb_pc), len(nb_pc))
DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

Warning (from warnings module):
  File "C:\Python34\lib\site-packages\langid\langid.py", line 243
    arr = np.zeros((self.nb_numfeats,), dtype='uint32')
DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

I'm using python 3.4.1, numpy 1.8.1, and langid 1.1.4dev (python 3 branch)

pip install langid does not work

Hello,
I try to install the library with pip on OS X 10.6, python 3.3.
Running the pip command I receive the following.

$ pip install langid
Downloading/unpacking langid
Could not find a version that satisfies the requirement langid (from versions: 1.0dev, 1.1.1dev, 1.1.2dev, 1.1.3dev, 1.1.4dev, 1.1dev)
Cleaning up...
No distributions matching the version for langid
Storing debug log for failure in /Users/xx/.pip/pip.log

So I downloaded the package and tried to install it without pip

$ python setup.py install

Looked good until here:

Extracting langid-1.1.4dev-py2.6.egg to /Library/Python/2.6/site-packages
SyntaxError: ('invalid syntax', ('/Library/Python/2.6/site-packages/langid-1.1.4dev-py2.6.egg/langid/train/common.py', 39, 34, "  with gzip.open(path, 'rb') as f, tempfile.TemporaryFile() as t:\n"))

Adding langid 1.1.4dev to easy-install.pth file
error: /Library/Python/2.6/site-packages/easy-install.pth: Permission denied

Why isnt it installing? Furthermore, I dont want it to install with python 2.6 (I guess thats the default python version of OS X 10.6) but with python 3.3

broken link in README

http://www.csse.unimelb.edu.au/~mlui

Server not found

Strange classifications

I get these results for english text,

>>> feeling
('en', 0.16946150595865342)
>>> good
('en', 0.16946150595865342)
>>> feeling good
('de', 0.2691886134361688)
>>>

Am I right to assume that these results can be more accurate for english if I improved the training data for the english language?

Repeating string yields different results

Knowing that most lang id systems perform worse on short strings, I have been experimenting with normalising the length:

MIN_LEN = 30
id = langid.rank(s)[0]
print langid.rank(s)[0]
while len(s) < MIN_LEN:
    s += '  ' + s
    print langid.rank(s)[0]
len_norm_id = langid.rank(s)[0]

I have noticed the following:

If id ie the original score was correct, the probability increases significantly after length normalisation.

If not, the probability only increases < ~10% or the identified language changes (usually to another incorrect language).

It is not a golden rule, but it is reliable enough that we could use it to:

increase probability on short strings
return 'und' in the cases where it is very fickle

Hindi, Arabic, Korean and Japanese

Hello!

Thanks so much for the great tool and paper, it is really helping me learn about this stuff.
Had a question about the language model provided in the code and what it was trained on.

I'm finding that strings from languages like korean and hindi are not working correctly. For example
Hindi: यह एक स्ट्रिंग है कि मैं हिंदी में लिखना है
Korean: 이것은 아랍어 문자열입니다

both get incorrectly matched as english with a confidence of 0.0196
Upon further inspection, I find that the dot product on the nb_classify function is returning a vector of 0s. I would take this to mean that the model simply wasn't trained on these languages. However, upon closer inspection i found that the nb_pc vector (which i think is the prior probabilities for each class?) is non zero for hindi.

Am i misunderstanding something? Was the basic model trained on hindi etc?

Thanks

sina

Getting an error on performing import

This is the error I'm getting. I looked at the code and everything seems fine, so I'm stumped.

>>> import langid
Traceback (most recent call last):
  File "", line 1, in 
    import langid
  File "C:\Python33\lib\langid\__init__.py", line 1, in 
    from langid import classify, rank
ImportError: cannot import name classify
>>>

Can you tell me how to identify those 97 languages?

As written in README, langid.py comes pre-trained on 97 languages. How could I reproduce the conclusion? I gave a try for UG language, but it told me it's ZH. I attempted to test JA language, it reported error:

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 21: illegal multibyte sequence

So, the problem is how to reproduce your conclusion? Thanks in advance.

NBtrain.py gzip issue

When training langid.py on my own dataset, I got the following issue when I reached the NBtrain.py step:

Traceback (most recent call last):
File "langid/train/NBtrain.py", line 278, in
nb_ptc = learn_ptc(paths, tk_nextmove, tk_output, cm, temp_path, args)
File "langid/train/NBtrain.py", line 199, in learn_ptc
reads, ids, prods = zip(*pass_ptc_out)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 655, in next
raise value
IOError: Not a gzipped file

I will be working on fixing it and ping you back

incorrect classification

written in russian, so correct

>>> langid.classify('пожалуйста')
('ru', 0.9189426432579942)

written in russian, so incorrect

>>> langid.classify('пример текста на русском')
('sr', 0.9431929502384854)

written in russian, so incorrect

>>> langid.classify('вомбат')
('uk', 0.4581717399240493)

written in hebrew, so incorrect

>>> langid.classify('וְאָהַבְתָּ אֵת יְיָ | אֱלֹהֶיךָ, בְּכָל-לְבָֽבְךָ, וּבְכָל-נַפְשְׁךָ, וּבְכָל-מְאֹדֶֽךָ. וְהָיוּ הַדְּבָרִים הָאֵלֶּה, אֲשֶׁר | אָֽנֹכִי מְצַוְּ')
('ko', 1.0)

written in old english, so incorrect (not sure that this example is correct)

>>> langid.classify(' On þyssum geare man halgode þet mynster æt Westmynstre on Cyldamæsse dæg 7 se cyng Eadward forðferde on Twelfts mæsse æfen 7 hine mann bebyrgede on Twelftan')
('is', 0.9999936552704334)

Should langid.py return probabilities?

When running langid.py directly, I get negative numbers, which are different from the 0.99 mentioned in the docs. Should I expect a number between zero and one?

$ python3 langid.py < ../README.rst
('en', -21016.57295703888)
$ python2.7 langid.py < ../README.rst
('en', -21016.57295703888)

Do the smallest values win?

$ python langid.py
>>> the
('en', 9.061840057373047)
>>> the the the the the the the the the the
('en', -240.49303674697876)

Thanks!

saffsd / langid.py Goto Github PK

langid.py's People

Contributors

Stargazers

Watchers

Forkers

langid.py's Issues

optimized version by Ralf Brown

< probs = self.norm_probs(self.nb_classprobs(fv))

< probs = self.norm_probs(self.nb_classprobs(fv))

Recommend Projects

Recommend Topics

Recommend Org

Jobs