GithubHelp home page GithubHelp logo

langid.py's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

langid.py's Issues

Not detecting Hindi at all.

Hi,

First of all, i could not understand how to give input for Hindi language. Its there in list of pre-trained language.

So i gave these test sentence of hindi.

  1. "tum kahaan ja rahe ho" - Result -normalized (Fi, '0.8528')
    i gave the above sentence in cmd on windows.
  2. "तुम कहाँ जा रहे हो" - Result -normalized (Zh, '0.9999')
    i gave the above hindi sentence in cmd with "test.docx" file as input
    python langid.py -n < test.docx

What could be the possible reasons for not detecting hindi.

I tried with litthe bigger sentences too. Doesnot seem to work. What am i missing ??

Caching Resource

Running

import langid
langid.classify("India is a country") ## statement number 1

takes a lot of time to run "statement number 1".

but,

import langid
langid.classify("I like cricket")
langid.classify("India is a country") ## statement number 2

does not take a lot of time to run "statement number 2".

So, Does langid.classify caches some information? Can we manage that? Thanks

Misclassification of text

Hi, we're using langid to find the language in which a press release has been written. Our text is thus in HTML.

It generally works pretty well, but it misclassified this page as being in "ky" instead of "es":

http://www.anpenavarra.com/index.php/comunicacion/notas-de-prensa/259-nota-de-prensa-encuentro-hispano-luso-de-docentes

>>> import langid
>>> import urllib2
>>> x = urllib2.urlopen('http://www.anpenavarra.com/index.php/comunicacion/notas-de-prensa/259-nota-de-prensa-encuentro-hispano-luso-de-docentes').read()
>>> langid.classify(x)
('ky', 1.0)

We're a bit puzzled by this misclassification. Is there anything we can do to help langid workaround whatever is causing this misclassification? (like cleaning the text before we pass it to langid?)

Thanks!

Weird result

I got this as a result :

this is a test
('en', -40.536659240722656)

Could you tell me please what should I do to fix this ?

Unable to do a pip install

When i try to do a pip install I get the following error.

pip install langid
Downloading/unpacking langid
  Could not find a version that satisfies the requirement langid (from versions: 1.0dev, 1.1.1dev, 1.1.2dev, 1.1.3dev, 1.1.4dev, 1.1dev)
Cleaning up...
No distributions matching the version for langid

Different classification for upper/lower-case sentences

I'm noticing something very peculiar with all-caps strings.

Running langid -n:

>>> ceci est une phrase française
('fr', 0.9999966296917099)
>>> CECI EST UNE PHRASE FRANÇAISE
('pt', 0.4985860132092562)

>>> this is an english phrase
('en', 0.9999771953554634)
>>> THIS IS AN ENGLISH PHRASE
('en', 0.16946150595865334)

>>> ali bongo veut jouer la montre
('fr', 0.9999628977673475)
>>> ALI BONGO VEUT JOUER LA MONTRE
('en', 0.16946150595865334)

Note the 'en', 0.169461... This value happens often for all-caps words and phrases

>>> mange
('en', 0.16946150595865334)
>>> MANGE
('en', 0.16946150595865334)

>>> fille
('fr', 0.3545074995041862)
>>> FILLE
('en', 0.16946150595865334)

>>> femme
('da', 0.5528002996661378)
>>> FEMME
('en', 0.16946150595865334)

I understand that I could always lowercase the input string, but I haven't seen this mentioned anywhere in the issue tracker, so I'm submitting this issue just so this can be documented.

Hard-coded lookup for very short strings?

It's understandable that performance for very short strings is poor. Could we create a mapping with hand-assigned weights for those?

I believe strings like 'yeah', 'no', 'si', 'haha', 'hehe' and so on should always be classified reasonably. I am happy to donate my mapping for this.

Class probability computation is very inefficient (patch enclosed)

The following patch produces the same output with a 4.4-fold speedup for language identification (not counting startup time) in --line mode given 650-byte average line lengths, and a 33-fold speedup with 62-byte average line lengths when using the default language model. Larger models with more features show an even larger speedup.

The speedup results from avoiding a matrix multiplication against a feature-count vector which is mostly zeros. You may wish to tweak the cut-over from "short" to "long" texts by adjusting the self.nb_numfeats/10; it could probably be moved higher, but I was being conservative.

259a260,302

optimized version by Ralf Brown

def instance2classprobs(self, text):
"""
Compute class probabilities for an instance according to the trained model
"""
if isinstance(text, unicode):
text = text.encode('utf8')

# Convert the text to a sequence of ascii values
ords = map(ord, text)

state = 0
if len(ords) < self.nb_numfeats / 10:
    # for very short texts, just apply each production every time the
    # state changes, rather than counting the number of occurrences of
    # each state
    pdc = np.zeros(len(self.nb_classes))
    for letter in ords:
        state = self.tk_nextmove[(state << 8) + letter]
        for index in self.tk_output.get(state, []):
            # compute the dot product incrementally, avoiding lots
            # of multiplications by zero with a sparse
            # feature-count vector
            pdc += self.nb_ptc[index]
else:
    # Count the number of times we enter each state
    statecount = defaultdict(int)
    for letter in ords:
        state = self.tk_nextmove[(state << 8) + letter]
        statecount[state] += 1

    # Update all the productions corresponding to the state
    arr = np.zeros((self.nb_numfeats,), dtype='uint32')
    for state in statecount:
        for index in self.tk_output.get(state, []):
            arr[index] += statecount[state]
    # compute the partial log-probability of the document given each class
    pdc = np.dot(arr,self.nb_ptc)

# compute the partial log-probability of the document in each class
pd = pdc + self.nb_pc
return pd

271,272c314,315
< fv = self.instance2fv(text)

< probs = self.norm_probs(self.nb_classprobs(fv))

probs = self.instance2classprobs(text)
probs = self.norm_probs(probs)

282,283c325,326
< fv = self.instance2fv(text)

< probs = self.norm_probs(self.nb_classprobs(fv))

probs = self.instance2classprobs(text)
probs = self.norm_probs(probs)

Different result when giving the same text

I have a database from which I read. I want to identify the language in a specific cell, defined by column.

I read from my database like this:

connector = sqlite3.connect("somedb.db")
selecter = connector.cursor()
selecter.execute(''' SELECT tags FROM sometable''')
for row in selecter: #iterate through all the rows in db
    #print (type(row)) #tuple
    rf = str(row)
    #print (type(rf)) #string
    lan = langid.classify("{}".format(rf))

Technically, it works. It identifies the languages used and later on (not displayed here) writes the identified language back into the database.

So, now comes the weird part.
I wanted to double check some results manually. So I have these words:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"

When I perform the language identification on the database it plots me Portuguese into the database.
But, performing it like this:

a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
lan = langid.classify(a)

Well, that returns me French. Apart from that it is neither French nor Portuguese, why is it returned different results?!

Fix pypi classifiers

Your setup.py file does not define any pypi classifier, which I think prevents pip3 install --pre langid. Here is the list: http://pypi.python.org/pypi?%3Aaction=list_classifiers. I think we should start with:

Programming Language :: Python :: 2
Programming Language :: Python :: 2.7
Programming Language :: Python :: 3

This would fix the Python 3 issue. But since we're on such a good start, we could continue.

You then need to choose a status. Either Beta or Production/Stable, probably Production/Stable.

Development Status :: 4 - Beta
Development Status :: 5 - Production/Stable

Then, an intended audience. I suggest:

Intended Audience :: Developers
Intended Audience :: Science/Research

Then the license. Is your custom license OSI Approved?

And finally a topic:

Topic :: Scientific/Engineering :: Artificial Intelligence

Once you tell me your choices, I can send a pull request. Thanks!

Hindi, Arabic, Korean and Japanese

Hello!

Thanks so much for the great tool and paper, it is really helping me learn about this stuff.
Had a question about the language model provided in the code and what it was trained on.

I'm finding that strings from languages like korean and hindi are not working correctly. For example
Hindi: यह एक स्ट्रिंग है कि मैं हिंदी में लिखना है
Korean: 이것은 아랍어 문자열입니다

both get incorrectly matched as english with a confidence of 0.0196
Upon further inspection, I find that the dot product on the nb_classify function is returning a vector of 0s. I would take this to mean that the model simply wasn't trained on these languages. However, upon closer inspection i found that the nb_pc vector (which i think is the prior probabilities for each class?) is non zero for hindi.

Am i misunderstanding something? Was the basic model trained on hindi etc?

Thanks

  • sina

Unicode support

The classifier apparently assumes utf8-encoded text:

>>> langid.classify('mangé')
('fr', 0.0016145239007830734)
>>> langid.classify(u'mangé')
('zh', -0.0063975259411870739)

adding something like:

if isinstance(text, unicode):
  text = text.encode('utf8')

would help avoiding unexpected behavior...

wrong detection

Hello,

with an english text "Ángel Di María: Louis van Gaal dynamic was why I left Manchester United", the classifier returns ('la', 0.9665266986710674) because of the "Ángel Di María" is a Latin name.

Is there any way to overcome this situation?

Thanks in advance,
Canh

Poor classification with mono-cased text

I am classifying 3 million texts, and usually classification performance is good. However, for a few texts that are all uppercase, performance is very poor. Here is a toy example

>>> import langid
>>> print langid.classify('The quick brown fox jumps over the lazy dog')
('en', 0.9999999999138445)
>>> print langid.classify('THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG')
('pl', 0.4406207245841342)

Also, all lowercase text can have poor performance

>>> print langid.classify('I do not speak English. Do you?')
('en', 0.9071235348010346)
>>> print langid.classify('i do not speak English. do you?')
('pt', 0.9283325957561123)

Do you think langid should be retrained on all-capitalized and all-lowercase examples? This is analogous to how image classification training sets are augmented by distorting the original images through transformations such as mirroring or warping.

Should langid.py return probabilities?

When running langid.py directly, I get negative numbers, which are different from the 0.99 mentioned in the docs. Should I expect a number between zero and one?

$ python3 langid.py < ../README.rst
('en', -21016.57295703888)
$ python2.7 langid.py < ../README.rst
('en', -21016.57295703888)

Do the smallest values win?

$ python langid.py
>>> the
('en', 9.061840057373047)
>>> the the the the the the the the the the
('en', -240.49303674697876)

Thanks!

Can you tell me how to identify those 97 languages?

As written in README, langid.py comes pre-trained on 97 languages. How could I reproduce the conclusion? I gave a try for UG language, but it told me it's ZH. I attempted to test JA language, it reported error:

UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 21: illegal multibyte sequence

So, the problem is how to reproduce your conclusion? Thanks in advance.

Fix versioning

Your versioning scheme is not PEP 440 compliant, which means than many people, as you have noticed, have issues when installing langid.py with pip.

First, 1.1.4dev is not valid, you need to use 1.1.4.dev0 instead.

Second, is there any good reason to choose a dev suffix? Can you simply remove tag_build = dev from setup.cfg before releases?

Thanks!

NBtrain.py gzip issue

When training langid.py on my own dataset, I got the following issue when I reached the NBtrain.py step:

Traceback (most recent call last):
File "langid/train/NBtrain.py", line 278, in
nb_ptc = learn_ptc(paths, tk_nextmove, tk_output, cm, temp_path, args)
File "langid/train/NBtrain.py", line 199, in learn_ptc
reads, ids, prods = zip(*pass_ptc_out)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py", line 655, in next
raise value
IOError: Not a gzipped file

I will be working on fixing it and ping you back

train.py uses excessive memory (patch enclosed)

The following patch adds a bunch of {var}=None statements to let Python reuse memory that is no longer needed. This reduces memory use by more than a factor of two.
The patch also bypasses generation of domain_dist_vec when --no_domain_ig is specified, since it is never used in that case.

*** ./train.py  2013-06-25 19:12:19.000000000 -0400
--- ../new/train.py 2013-08-01 20:45:35.867486680 -0400
***************
*** 123,126 ****
--- 123,127 ----

    items = [ (d,l,p) for (d,l,n,p) in indexer.items ]
+   indexer = None
    if args.debug:
      # output the language index
***************
*** 191,198 ****
          write_weights(doc_count, doc_count_path)
          print "wrote DF counts for all features to:", doc_count_path
- 
      if DFfeats is None:
        # Choose the first-stage features
        DFfeats = ngram_select(doc_count, args.max_order, args.df_tokens)

      if args.debug:
--- 192,199 ----
          write_weights(doc_count, doc_count_path)
          print "wrote DF counts for all features to:", doc_count_path
        if DFfeats is None:
          # Choose the first-stage features
          DFfeats = ngram_select(doc_count, args.max_order, args.df_tokens)
+       doc_count = None

      if args.debug:
***************
*** 213,222 ****
--- 214,227 ----
      DF_scanner = Scanner(DFfeats)
      b_dirs = build_index(items, DF_scanner, buckets_dir, args.buckets, args.jobs, args.chunksize)
+     DF_scanner = None

      # Build vectors of domain and language distributions for use in IG calculation
+     if not args.no_domain_ig:
        domain_dist_vec = numpy.array([ domain_dist[domain_index[d]]
                 for d in sorted(domain_index, key=domain_index.get)], dtype=int)
+     domain_dist = None
      lang_dist_vec = numpy.array([ lang_dist[lang_index[l]]
              for l in sorted(lang_index.keys(), key=lang_index.get)], dtype=int)
+     lang_dist = None

      # Compute IG
***************
*** 235,241 ****
--- 240,249 ----
          write_weights(ig, weights_path)
        ig_vals[label] = dict((row[0], numpy.array(row[1].flat)) for row in ig)
+       ig = None
+     DFfeats = None

      # Select features according to the LD criteria
      features_per_lang = select_LD_features(ig_vals['lang'], ig_vals.get('domain'), args.feats_per_lang, ignore_domain = args.no_domain_ig)
+     ig_vals = None
      LDfeats = reduce(set.union, map(set, features_per_lang.values()))
      print 'selected %d features' % len(LDfeats)
***************
*** 251,254 ****
--- 259,263 ----
            writer.writerow(map(repr,features_per_lang[i]))
        print 'wrote LD.perlang features to "%s"' % feature_path + '.perlang'
+     features_per_lang = None

    # Compile a scanner for the LDfeats
***************
*** 259,277 ****
        cPickle.dump((tk_nextmove, tk_output, LDfeats), f)
      print "wrote scanner to {0}".format(scanner_path)

    # Assemble the NB model
    langs = sorted(lang_index, key=lang_index.get)

    cm = generate_cm([ (l,p) for d,l,p in items], len(langs))
    paths = zip(*items)[2]

    nb_classes = langs
    nb_pc = learn_pc(cm)
    nb_ptc = learn_ptc(paths, tk_nextmove, tk_output, cm, buckets_dir, args)

    # output the model
    output_path = os.path.join(model_dir, 'model')
    model = nb_ptc, nb_pc, nb_classes, tk_nextmove, tk_output
!   string = base64.b64encode(bz2.compress(cPickle.dumps(model)))
    with open(output_path, 'w') as f:
      f.write(string)
--- 268,298 ----
        cPickle.dump((tk_nextmove, tk_output, LDfeats), f)
      print "wrote scanner to {0}".format(scanner_path)
+   LDfeats = None

    # Assemble the NB model
    langs = sorted(lang_index, key=lang_index.get)
+   lang_index = None

    cm = generate_cm([ (l,p) for d,l,p in items], len(langs))
    paths = zip(*items)[2]
+   items = None

    nb_classes = langs
+   langs = None
    nb_pc = learn_pc(cm)
    nb_ptc = learn_ptc(paths, tk_nextmove, tk_output, cm, buckets_dir, args)
+   paths = None
+   cm = None

    # output the model
    output_path = os.path.join(model_dir, 'model')
    model = nb_ptc, nb_pc, nb_classes, tk_nextmove, tk_output
!   dump = cPickle.dumps(model)
!   tk_nextmove = None
!   tk_output = None
!   nb_pc = None
!   nb_classes = None
!   model = None
!   string = base64.b64encode(bz2.compress(dump))
    with open(output_path, 'w') as f:
      f.write(string)

Diff to fix the ValueError issue

diff --git a/langid/langid.py b/langid/langid.py
index 3c39275..36a9159 100644
--- a/langid/langid.py
+++ b/langid/langid.py
@@ -226,7 +226,7 @@
       # to speed up processing.
       for lang in langs:
         if lang not in nb_classes:
-          raise ValueError, "Unknown language code %s" % lang
+          raise ValueError("Unknown language code %s" % lang)
 
       subset_mask = np.fromiter((l in langs for l in nb_classes), dtype=bool)
       self.nb_classes = [ c for c in nb_classes if c in langs ]

error installing langid (with buildout)

Page at http://pypi.python.org/simple/langid/ links to .py file(s) without version info; an index scan is required.
Getting distribution for 'langid'.
While:
Installing client1.
Getting distribution for 'langid'.
Error: Couldn't find a distribution for 'langid'.

Maybe you could add the source (tar.gz) distribution as well on pypi (python setup.py sdist)

some quotes ("️) causes classification as Chinese

I have noticed that 'wunderbar' is classified as Chinese, but only sometimes. Well, you see why:

>>> langid.rank(' wunderbar')
[('de', 0.9778415187189662), ('ms', 0.010616691993507496), ('rw', 0.005629123117595187), ('jv', 0.002381279333979642), ('en', 0.0012907605583217631), ('xh', 0.0007049424071661806), ('zu', 0.0005406729256266108), ('pl', 0.00033030266440511896), ('zh', 0.00012068794780089398), ('lb', 8.737333954204245e-05), ('ku', 7.951447183144349e-05), ('et', 4.4845458685428935e-05), ('it', 3.176656196745408e-05), ('qu', 3.148454827117921e-05), ('se', 2.1674043639160725e-05), ('ht', 2.1155926636937293e-05), ('la', 2.0910529457339416e-05), ('cy', 2.0232843521824427e-05), ('mt', 1.844486496954288e-05), ('nl', 1.4311942620855673e-05), ('lt', 1.2972404851637802e-05), ('no', 1.2377528866306211e-05), ('oc', 1.1861996353338544e-05), ('tl', 1.1204590626950813e-05), ('fo', 1.1022700757718407e-05), ('an', 8.329868622385504e-06), ('sw', 7.053226420262944e-06), ('af', 6.599790475577181e-06), ('lo', 6.0179664803498765e-06), ('vo', 5.556359238025519e-06), ('es', 4.8332113719175565e-06), ('ko', 4.777415287420478e-06), ('id', 4.391051103680497e-06), ('ky', 3.818515217450824e-06), ('br', 3.5191782008061838e-06), ('eo', 3.4779281046734827e-06), ('mg', 3.466248539257764e-06), ('am', 3.229264725976488e-06), ('is', 3.074756887874788e-06), ('fr', 2.8661019894490493e-06), ('ga', 2.537918306496631e-06), ('ps', 1.5075166885307474e-06), ('wa', 1.4472239069400242e-06), ('ar', 1.3391240428809958e-06), ('bs', 1.1932147667325364e-06), ('si', 1.153093780014326e-06), ('tr', 1.0981617162215484e-06), ('eu', 1.0016241426877151e-06), ('az', 9.650942971628987e-07), ('ka', 9.325534232313878e-07), ('hr', 8.631094356760578e-07), ('pt', 6.746837968524499e-07), ('sk', 6.693679308944927e-07), ('nn', 6.645596091493626e-07), ('hy', 4.715304772325221e-07), ('nb', 4.182398787212912e-07), ('ja', 3.924990952428872e-07), ('lv', 3.7621043483560154e-07), ('ug', 3.60804060385915e-07), ('sq', 3.2894525454536553e-07), ('sv', 2.952765918502936e-07), ('fi', 2.675271490842995e-07), ('kk', 2.636153614232699e-07), ('he', 2.6168240445933393e-07), ('ur', 2.4873320080711625e-07), ('ca', 2.431349486376223e-07), ('sl', 2.2170225585481938e-07), ('fa', 1.6505684273619443e-07), ('gl', 1.6464847504277298e-07), ('km', 1.4689751678433855e-07), ('ro', 1.3200234729630338e-07), ('vi', 1.1652110654659623e-07), ('mn', 1.1039890158774906e-07), ('da', 9.616333939918114e-08), ('el', 8.865316433779712e-08), ('hu', 8.476379472092254e-08), ('bn', 7.533416864544197e-08), ('th', 7.145543835277235e-08), ('gu', 5.8311942506196446e-08), ('as', 4.503109767538008e-08), ('ru', 4.463188951350857e-08), ('ml', 3.794148223120744e-08), ('pa', 3.216318214492323e-08), ('ta', 3.194030431437515e-08), ('mr', 2.6181357262362214e-08), ('te', 2.500582090488879e-08), ('or', 2.0345725506260737e-08), ('cs', 1.7323718652257704e-08), ('hi', 1.482088515139454e-08), ('mk', 1.2543768035052028e-08), ('be', 1.1737055387189823e-08), ('kn', 9.089409826915146e-09), ('uk', 5.491771809519326e-09), ('ne', 5.147506514007945e-09), ('sr', 4.828762668563497e-09), ('dz', 2.9114805800159365e-09), ('bg', 1.2081540813937604e-09)]
>>> langid.rank("️wunderbar")
[('zh', 0.9928204705819458), ('ja', 0.0065785229592405705), ('ko', 0.00038846511466535897), ('ar', 4.1408258069092674e-05), ('jv', 3.5224872068297655e-05), ('en', 1.6612952762747267e-05), ('de', 1.4317074941863288e-05), ('qu', 1.244498916800758e-05), ('ru', 1.2175607654316526e-05), ('lb', 1.2110582975832658e-05), ('ms', 1.0621617421597854e-05), ('el', 7.080716875555405e-06), ('la', 6.181768270527322e-06), ('bg', 5.722886583220138e-06), ('rw', 5.383872388674567e-06), ('no', 4.087949465473779e-06), ('fo', 3.544647301692967e-06), ('th', 3.225987949444041e-06), ('ky', 2.2206115855257554e-06), ('nl', 1.607727472526966e-06), ('hy', 1.4007056676162333e-06), ('zu', 1.3214916491765556e-06), ('sw', 1.2498255366416236e-06), ('si', 1.1304218558942313e-06), ('ro', 9.505256464019786e-07), ('cy', 8.460239489735446e-07), ('an', 7.772158405223286e-07), ('ps', 7.303143910130873e-07), ('ht', 6.840844861484241e-07), ('pl', 6.501235175370973e-07), ('id', 6.439492422441017e-07), ('oc', 6.416247130132337e-07), ('fr', 6.145676129535128e-07), ('eo', 6.112717723472432e-07), ('xh', 5.864737297378976e-07), ('it', 5.754600361422645e-07), ('he', 4.604095080825112e-07), ('lo', 4.3820524153448525e-07), ('nn', 4.372386499345569e-07), ('es', 4.3404833143189604e-07), ('tl', 3.3903924149577855e-07), ('mn', 3.1742695156126506e-07), ('uk', 3.111113663287929e-07), ('pt', 2.974118745522462e-07), ('km', 2.45400261412045e-07), ('nb', 2.44311393421879e-07), ('br', 2.361262091649684e-07), ('is', 1.6744621965858352e-07), ('vo', 1.1623863610897373e-07), ('sl', 1.1255238112569407e-07), ('sk', 1.0188625377914079e-07), ('se', 8.937200752241362e-08), ('ca', 8.220039832460332e-08), ('am', 7.377499976691904e-08), ('bs', 6.242983133332428e-08), ('da', 5.65589837074933e-08), ('tr', 4.7950588802830366e-08), ('bn', 4.1764456763934405e-08), ('kk', 3.696808528126726e-08), ('gu', 3.486169811098909e-08), ('hr', 2.7307941417741773e-08), ('pa', 2.7150978056364916e-08), ('as', 2.542018389827033e-08), ('az', 2.4425081481129985e-08), ('ka', 2.2423528126520916e-08), ('lt', 2.0958811561874612e-08), ('eu', 2.0343842634025164e-08), ('mr', 1.7358803103263064e-08), ('ur', 1.5349310111786227e-08), ('et', 1.5281946788061445e-08), ('or', 1.3650943533411975e-08), ('mk', 1.3538199094114394e-08), ('hi', 1.126846960438557e-08), ('fa', 1.1236394202902213e-08), ('sr', 1.1179996653889562e-08), ('ml', 1.0757890112433114e-08), ('te', 9.296997663016953e-09), ('lv', 8.609086486177823e-09), ('kn', 7.472269511116448e-09), ('af', 7.136002539508618e-09), ('fi', 6.996691832623385e-09), ('ku', 6.931737527143431e-09), ('sv', 5.184914721072085e-09), ('mt', 5.101252483893451e-09), ('ne', 4.379418035731972e-09), ('mg', 4.041719069597912e-09), ('hu', 3.848573321874355e-09), ('sq', 3.831580821056942e-09), ('vi', 3.205823430957484e-09), ('ug', 1.6902740438470224e-09), ('ga', 1.1843145772621335e-09), ('ta', 1.0153927556584555e-09), ('gl', 9.850247142407677e-10), ('cs', 9.611533181894737e-10), ('wa', 8.327958886567568e-10), ('be', 5.0586464217473225e-11), ('dz', 4.876116023858696e-14)]
>>> langid.rank("️wunderbar")
[('zh', 0.9928204705819458), ('ja', 0.0065785229592405705), ('ko', 0.00038846511466535897), ('ar', 4.1408258069092674e-05), ('jv', 3.5224872068297655e-05), ('en', 1.6612952762747267e-05), ('de', 1.4317074941863288e-05), ('qu', 1.244498916800758e-05), ('ru', 1.2175607654316526e-05), ('lb', 1.2110582975832658e-05), ('ms', 1.0621617421597854e-05), ('el', 7.080716875555405e-06), ('la', 6.181768270527322e-06), ('bg', 5.722886583220138e-06), ('rw', 5.383872388674567e-06), ('no', 4.087949465473779e-06), ('fo', 3.544647301692967e-06), ('th', 3.225987949444041e-06), ('ky', 2.2206115855257554e-06), ('nl', 1.607727472526966e-06), ('hy', 1.4007056676162333e-06), ('zu', 1.3214916491765556e-06), ('sw', 1.2498255366416236e-06), ('si', 1.1304218558942313e-06), ('ro', 9.505256464019786e-07), ('cy', 8.460239489735446e-07), ('an', 7.772158405223286e-07), ('ps', 7.303143910130873e-07), ('ht', 6.840844861484241e-07), ('pl', 6.501235175370973e-07), ('id', 6.439492422441017e-07), ('oc', 6.416247130132337e-07), ('fr', 6.145676129535128e-07), ('eo', 6.112717723472432e-07), ('xh', 5.864737297378976e-07), ('it', 5.754600361422645e-07), ('he', 4.604095080825112e-07), ('lo', 4.3820524153448525e-07), ('nn', 4.372386499345569e-07), ('es', 4.3404833143189604e-07), ('tl', 3.3903924149577855e-07), ('mn', 3.1742695156126506e-07), ('uk', 3.111113663287929e-07), ('pt', 2.974118745522462e-07), ('km', 2.45400261412045e-07), ('nb', 2.44311393421879e-07), ('br', 2.361262091649684e-07), ('is', 1.6744621965858352e-07), ('vo', 1.1623863610897373e-07), ('sl', 1.1255238112569407e-07), ('sk', 1.0188625377914079e-07), ('se', 8.937200752241362e-08), ('ca', 8.220039832460332e-08), ('am', 7.377499976691904e-08), ('bs', 6.242983133332428e-08), ('da', 5.65589837074933e-08), ('tr', 4.7950588802830366e-08), ('bn', 4.1764456763934405e-08), ('kk', 3.696808528126726e-08), ('gu', 3.486169811098909e-08), ('hr', 2.7307941417741773e-08), ('pa', 2.7150978056364916e-08), ('as', 2.542018389827033e-08), ('az', 2.4425081481129985e-08), ('ka', 2.2423528126520916e-08), ('lt', 2.0958811561874612e-08), ('eu', 2.0343842634025164e-08), ('mr', 1.7358803103263064e-08), ('ur', 1.5349310111786227e-08), ('et', 1.5281946788061445e-08), ('or', 1.3650943533411975e-08), ('mk', 1.3538199094114394e-08), ('hi', 1.126846960438557e-08), ('fa', 1.1236394202902213e-08), ('sr', 1.1179996653889562e-08), ('ml', 1.0757890112433114e-08), ('te', 9.296997663016953e-09), ('lv', 8.609086486177823e-09), ('kn', 7.472269511116448e-09), ('af', 7.136002539508618e-09), ('fi', 6.996691832623385e-09), ('ku', 6.931737527143431e-09), ('sv', 5.184914721072085e-09), ('mt', 5.101252483893451e-09), ('ne', 4.379418035731972e-09), ('mg', 4.041719069597912e-09), ('hu', 3.848573321874355e-09), ('sq', 3.831580821056942e-09), ('vi', 3.205823430957484e-09), ('ug', 1.6902740438470224e-09), ('ga', 1.1843145772621335e-09), ('ta', 1.0153927556584555e-09), ('gl', 9.850247142407677e-10), ('cs', 9.611533181894737e-10), ('wa', 8.327958886567568e-10), ('be', 5.0586464217473225e-11), ('dz', 4.876116023858696e-14)]
>>> langid.rank("️wunderbar")
[('zh', 0.9928204705819458), ('ja', 0.0065785229592405705), ('ko', 0.00038846511466535897), ('ar', 4.1408258069092674e-05), ('jv', 3.5224872068297655e-05), ('en', 1.6612952762747267e-05), ('de', 1.4317074941863288e-05), ('qu', 1.244498916800758e-05), ('ru', 1.2175607654316526e-05), ('lb', 1.2110582975832658e-05), ('ms', 1.0621617421597854e-05), ('el', 7.080716875555405e-06), ('la', 6.181768270527322e-06), ('bg', 5.722886583220138e-06), ('rw', 5.383872388674567e-06), ('no', 4.087949465473779e-06), ('fo', 3.544647301692967e-06), ('th', 3.225987949444041e-06), ('ky', 2.2206115855257554e-06), ('nl', 1.607727472526966e-06), ('hy', 1.4007056676162333e-06), ('zu', 1.3214916491765556e-06), ('sw', 1.2498255366416236e-06), ('si', 1.1304218558942313e-06), ('ro', 9.505256464019786e-07), ('cy', 8.460239489735446e-07), ('an', 7.772158405223286e-07), ('ps', 7.303143910130873e-07), ('ht', 6.840844861484241e-07), ('pl', 6.501235175370973e-07), ('id', 6.439492422441017e-07), ('oc', 6.416247130132337e-07), ('fr', 6.145676129535128e-07), ('eo', 6.112717723472432e-07), ('xh', 5.864737297378976e-07), ('it', 5.754600361422645e-07), ('he', 4.604095080825112e-07), ('lo', 4.3820524153448525e-07), ('nn', 4.372386499345569e-07), ('es', 4.3404833143189604e-07), ('tl', 3.3903924149577855e-07), ('mn', 3.1742695156126506e-07), ('uk', 3.111113663287929e-07), ('pt', 2.974118745522462e-07), ('km', 2.45400261412045e-07), ('nb', 2.44311393421879e-07), ('br', 2.361262091649684e-07), ('is', 1.6744621965858352e-07), ('vo', 1.1623863610897373e-07), ('sl', 1.1255238112569407e-07), ('sk', 1.0188625377914079e-07), ('se', 8.937200752241362e-08), ('ca', 8.220039832460332e-08), ('am', 7.377499976691904e-08), ('bs', 6.242983133332428e-08), ('da', 5.65589837074933e-08), ('tr', 4.7950588802830366e-08), ('bn', 4.1764456763934405e-08), ('kk', 3.696808528126726e-08), ('gu', 3.486169811098909e-08), ('hr', 2.7307941417741773e-08), ('pa', 2.7150978056364916e-08), ('as', 2.542018389827033e-08), ('az', 2.4425081481129985e-08), ('ka', 2.2423528126520916e-08), ('lt', 2.0958811561874612e-08), ('eu', 2.0343842634025164e-08), ('mr', 1.7358803103263064e-08), ('ur', 1.5349310111786227e-08), ('et', 1.5281946788061445e-08), ('or', 1.3650943533411975e-08), ('mk', 1.3538199094114394e-08), ('hi', 1.126846960438557e-08), ('fa', 1.1236394202902213e-08), ('sr', 1.1179996653889562e-08), ('ml', 1.0757890112433114e-08), ('te', 9.296997663016953e-09), ('lv', 8.609086486177823e-09), ('kn', 7.472269511116448e-09), ('af', 7.136002539508618e-09), ('fi', 6.996691832623385e-09), ('ku', 6.931737527143431e-09), ('sv', 5.184914721072085e-09), ('mt', 5.101252483893451e-09), ('ne', 4.379418035731972e-09), ('mg', 4.041719069597912e-09), ('hu', 3.848573321874355e-09), ('sq', 3.831580821056942e-09), ('vi', 3.205823430957484e-09), ('ug', 1.6902740438470224e-09), ('ga', 1.1843145772621335e-09), ('ta', 1.0153927556584555e-09), ('gl', 9.850247142407677e-10), ('cs', 9.611533181894737e-10), ('wa', 8.327958886567568e-10), ('be', 5.0586464217473225e-11), ('dz', 4.876116023858696e-14)]
>>> langid.rank("️wunderbar ")
[('zh', 0.9928204705819458), ('ja', 0.0065785229592405705), ('ko', 0.00038846511466535897), ('ar', 4.1408258069092674e-05), ('jv', 3.5224872068297655e-05), ('en', 1.6612952762747267e-05), ('de', 1.4317074941863288e-05), ('qu', 1.244498916800758e-05), ('ru', 1.2175607654316526e-05), ('lb', 1.2110582975832658e-05), ('ms', 1.0621617421597854e-05), ('el', 7.080716875555405e-06), ('la', 6.181768270527322e-06), ('bg', 5.722886583220138e-06), ('rw', 5.383872388674567e-06), ('no', 4.087949465473779e-06), ('fo', 3.544647301692967e-06), ('th', 3.225987949444041e-06), ('ky', 2.2206115855257554e-06), ('nl', 1.607727472526966e-06), ('hy', 1.4007056676162333e-06), ('zu', 1.3214916491765556e-06), ('sw', 1.2498255366416236e-06), ('si', 1.1304218558942313e-06), ('ro', 9.505256464019786e-07), ('cy', 8.460239489735446e-07), ('an', 7.772158405223286e-07), ('ps', 7.303143910130873e-07), ('ht', 6.840844861484241e-07), ('pl', 6.501235175370973e-07), ('id', 6.439492422441017e-07), ('oc', 6.416247130132337e-07), ('fr', 6.145676129535128e-07), ('eo', 6.112717723472432e-07), ('xh', 5.864737297378976e-07), ('it', 5.754600361422645e-07), ('he', 4.604095080825112e-07), ('lo', 4.3820524153448525e-07), ('nn', 4.372386499345569e-07), ('es', 4.3404833143189604e-07), ('tl', 3.3903924149577855e-07), ('mn', 3.1742695156126506e-07), ('uk', 3.111113663287929e-07), ('pt', 2.974118745522462e-07), ('km', 2.45400261412045e-07), ('nb', 2.44311393421879e-07), ('br', 2.361262091649684e-07), ('is', 1.6744621965858352e-07), ('vo', 1.1623863610897373e-07), ('sl', 1.1255238112569407e-07), ('sk', 1.0188625377914079e-07), ('se', 8.937200752241362e-08), ('ca', 8.220039832460332e-08), ('am', 7.377499976691904e-08), ('bs', 6.242983133332428e-08), ('da', 5.65589837074933e-08), ('tr', 4.7950588802830366e-08), ('bn', 4.1764456763934405e-08), ('kk', 3.696808528126726e-08), ('gu', 3.486169811098909e-08), ('hr', 2.7307941417741773e-08), ('pa', 2.7150978056364916e-08), ('as', 2.542018389827033e-08), ('az', 2.4425081481129985e-08), ('ka', 2.2423528126520916e-08), ('lt', 2.0958811561874612e-08), ('eu', 2.0343842634025164e-08), ('mr', 1.7358803103263064e-08), ('ur', 1.5349310111786227e-08), ('et', 1.5281946788061445e-08), ('or', 1.3650943533411975e-08), ('mk', 1.3538199094114394e-08), ('hi', 1.126846960438557e-08), ('fa', 1.1236394202902213e-08), ('sr', 1.1179996653889562e-08), ('ml', 1.0757890112433114e-08), ('te', 9.296997663016953e-09), ('lv', 8.609086486177823e-09), ('kn', 7.472269511116448e-09), ('af', 7.136002539508618e-09), ('fi', 6.996691832623385e-09), ('ku', 6.931737527143431e-09), ('sv', 5.184914721072085e-09), ('mt', 5.101252483893451e-09), ('ne', 4.379418035731972e-09), ('mg', 4.041719069597912e-09), ('hu', 3.848573321874355e-09), ('sq', 3.831580821056942e-09), ('vi', 3.205823430957484e-09), ('ug', 1.6902740438470224e-09), ('ga', 1.1843145772621335e-09), ('ta', 1.0153927556584555e-09), ('gl', 9.850247142407677e-10), ('cs', 9.611533181894737e-10), ('wa', 8.327958886567568e-10), ('be', 5.0586464217473225e-11), ('dz', 4.876116023858696e-14)]
>>> langid.rank("️wunderbar wunderbar")
[('zh', 0.8292747652785193), ('jv', 0.08584567244536759), ('de', 0.05102290406741519), ('ms', 0.01846538938672487), ('rw', 0.011278761798545766), ('zu', 0.001671397137805495), ('lb', 0.0010829355842848353), ('qu', 0.0005832810973334379), ('xh', 0.00039822810842440767), ('la', 9.621298010155744e-05), ('fo', 7.108813895775326e-05), ('ko', 7.034607461032755e-05), ('en', 4.073465186853253e-05), ('ky', 1.9835709320975084e-05), ('ja', 1.5937070356101868e-05), ('ht', 1.4811547418357725e-05), ('sw', 9.623311954126167e-06), ('an', 7.5723443536668756e-06), ('cy', 5.496016488282588e-06), ('lo', 4.798035215605e-06), ('se', 4.531292165616887e-06), ('pl', 3.3810621649029084e-06), ('no', 2.367282541317847e-06), ('oc', 1.8883109960008556e-06), ('ar', 1.6070814782777182e-06), ('tl', 1.4810688852051008e-06), ('ps', 1.287723280469018e-06), ('hy', 8.319399193182018e-07), ('si', 6.467997220342434e-07), ('nl', 3.201203064538715e-07), ('br', 2.6680496108931717e-07), ('eo', 2.5979373360746573e-07), ('ku', 2.3751078530068452e-07), ('id', 2.1944025320190693e-07), ('vo', 1.5784998639426843e-07), ('is', 1.5328583556787233e-07), ('it', 1.387115378800745e-07), ('am', 1.0543621642926642e-07), ('nn', 4.710963987507007e-08), ('km', 3.68933519449979e-08), ('th', 3.0197194884313876e-08), ('af', 2.4877234695711385e-08), ('mn', 2.295340614913973e-08), ('bs', 2.0674611361045597e-08), ('et', 1.7926741007217306e-08), ('mg', 1.6386129512777282e-08), ('el', 1.543390389861973e-08), ('he', 1.0329136258015913e-08), ('nb', 1.0140603281735252e-08), ('es', 7.768440450072785e-09), ('ru', 7.205224634719422e-09), ('fr', 7.2035413291483e-09), ('lt', 6.776419793400924e-09), ('az', 6.7718846384117305e-09), ('kk', 6.649133365127158e-09), ('ka', 5.110711821284464e-09), ('mt', 3.11261687857195e-09), ('tr', 3.0043952397806197e-09), ('ro', 2.9819815991412353e-09), ('hr', 2.9688620536567503e-09), ('ur', 2.083914839620625e-09), ('pt', 1.827458771087846e-09), ('sk', 1.6569108132171535e-09), ('eu', 1.496275461961667e-09), ('sl', 1.0752757039619155e-09), ('ca', 9.350437689656174e-10), ('wa', 8.223215529098318e-10), ('ga', 7.69030137603146e-10), ('ug', 7.681815073650124e-10), ('bn', 7.576498721150876e-10), ('as', 6.69440195142181e-10), ('gu', 5.94424522293588e-10), ('sq', 2.5479751515843107e-10), ('fa', 2.354233192136297e-10), ('pa', 1.833278456497039e-10), ('bg', 1.712831336781061e-10), ('lv', 1.5552927754246522e-10), ('or', 1.4212304185680015e-10), ('mr', 1.0481709860685237e-10), ('uk', 9.61421980553661e-11), ('da', 8.40995816588907e-11), ('ml', 8.05270920986517e-11), ('te', 5.7679102107932226e-11), ('fi', 3.483024399647947e-11), ('hi', 3.2173534391263656e-11), ('mk', 3.1599836395053165e-11), ('kn', 3.089330122267583e-11), ('vi', 1.8878947474031977e-11), ('sv', 1.7945383911693068e-11), ('gl', 9.909427701342852e-12), ('ne', 7.854058383954012e-12), ('hu', 7.708246189105303e-12), ('ta', 6.398434520445388e-12), ('sr', 4.727314889403123e-12), ('cs', 2.770877495246804e-13), ('be', 1.279259144409701e-13), ('dz', 4.078417757612451e-17)]
>>> langid.rank('wunderbar')
[('de', 0.6503315690434672), ('en', 0.11596045044921321), ('zh', 0.0699736942100247), ('pl', 0.05783840321052503), ('xh', 0.018145338183298606), ('ms', 0.011032574595175405), ('nl', 0.00893483035667102), ('es', 0.008071226823772837), ('rw', 0.007877222399353033), ('mt', 0.0075871671223098625), ('it', 0.00685146390768504), ('fr', 0.0052717838386513146), ('zu', 0.004737085004634088), ('jv', 0.004131051251327707), ('pt', 0.0024858543866690752), ('ar', 0.002234159234979442), ('ko', 0.0017976489711238274), ('et', 0.0017118559384868492), ('ja', 0.0013351516450480208), ('cy', 0.0010983563884041131), ('lt', 0.0009588132812860693), ('no', 0.000951013381983051), ('lb', 0.0008369332838098044), ('id', 0.0007867488463552349), ('sk', 0.0006723864083168762), ('ku', 0.0005425678843744174), ('tr', 0.00045886496392803837), ('eo', 0.0004411268627940554), ('sv', 0.0004150147400791327), ('oc', 0.0003616025821079207), ('la', 0.0002771880561695966), ('fi', 0.0002623161454800777), ('br', 0.0002448507642931113), ('qu', 0.00024050601384825444), ('eu', 0.00023655454563857104), ('hu', 0.00022148160257201647), ('ro', 0.00021347242595583783), ('lv', 0.0002070085834320212), ('tl', 0.00019823868588437078), ('ca', 0.00018849873963471987), ('vo', 0.00017631869249845042), ('ht', 0.00016812866637903447), ('is', 0.0001680612569300022), ('hr', 0.00016766064521404303), ('sw', 0.00016280436223774648), ('da', 0.00015972724722411151), ('he', 0.00015825059510499629), ('vi', 0.00014855581572915312), ('sl', 0.00014527916537300623), ('an', 0.00013956089703051055), ('af', 0.0001307108551913882), ('ru', 0.00012991449800521366), ('ga', 0.00011646476578160291), ('se', 0.000112560167112669), ('nb', 0.00010735828554482774), ('nn', 0.00010587862156044339), ('ka', 0.00010215390040238935), ('am', 9.988917179840933e-05), ('fa', 9.428928623042383e-05), ('fo', 9.101878230507523e-05), ('el', 8.599944639792726e-05), ('bs', 7.667452330378166e-05), ('az', 6.777789021556405e-05), ('lo', 6.725303600762494e-05), ('mg', 6.365897350608821e-05), ('th', 6.106936829422913e-05), ('gl', 6.070326505516053e-05), ('si', 5.63709107752398e-05), ('wa', 5.3856638038600104e-05), ('cs', 5.335790265486965e-05), ('ky', 4.7245670884845804e-05), ('sq', 4.7170457494920505e-05), ('ps', 4.198177975397085e-05), ('bn', 3.5824236071314225e-05), ('ur', 3.530273629808421e-05), ('ml', 2.8088117524393276e-05), ('gu', 2.3353586302661594e-05), ('uk', 2.3104600159857556e-05), ('kk', 2.2985646975380097e-05), ('hy', 2.2625192930896517e-05), ('ta', 2.1042104415353486e-05), ('te', 2.0333812825109273e-05), ('bg', 2.0003375454411503e-05), ('ug', 1.979124500466659e-05), ('mr', 1.762000965093424e-05), ('mn', 1.5181620903080626e-05), ('as', 1.4511621463594436e-05), ('pa', 1.4474134160880863e-05), ('km', 1.4009833344404556e-05), ('mk', 1.3578006175154581e-05), ('hi', 1.0256809421089112e-05), ('sr', 8.683693637506968e-06), ('be', 8.630821071501332e-06), ('or', 7.3735619034962e-06), ('kn', 5.227389383813129e-06), ('ne', 4.494828285537928e-06), ('dz', 3.7227178371875574e-06)]

Training on Windows returns error at DFfeatureselect.py step

I'm trying to train a new language identifier model on my own languages dataset. Unfortunately, it crashes at the DFfeatureselect.py script, returning "TypeError: marshal.load() arg must be file" error message. Below is the log until the crash point.

C:\langid.py-master\langid\train>C:\Python27\python.exe train.py corpus
corpus path: corpus
model path: ..model
langs(22): el(26) eo(42) en(1674) af(285) ca(287) am(2426) an(226) cy(79) ar(82) cs(432) et(449) az(534) es(457) be(292) bg(818) bn(65) de(2795) da(90) dz(220) br(532) bs(493) as(101)
domains(1): domain(12405)
identified 12405 documents
will tokenize 12405 documents
using byte NGram tokenizer, max_order: 4
chunk size: 50 (249 chunks)
job count: 8
whole-document tokenization
tokenized chunk (1/249) [11880 keys]
tokenized chunk (2/249) [12305 keys]
tokenized chunk (3/249) [10517 keys]
tokenized chunk (4/249) [18799 keys]
tokenized chunk (5/249) [17955 keys]
tokenized chunk (6/249) [6092 keys]
tokenized chunk (7/249) [21901 keys]
tokenized chunk (8/249) [11344 keys]
tokenized chunk (9/249) [6342 keys]
tokenized chunk (10/249) [6499 keys]
tokenized chunk (11/249) [5452 keys]
tokenized chunk (12/249) [5734 keys]
tokenized chunk (13/249) [6204 keys]
tokenized chunk (14/249) [5252 keys]
tokenized chunk (15/249) [6565 keys]
tokenized chunk (16/249) [3035 keys]
tokenized chunk (17/249) [2157 keys]
tokenized chunk (18/249) [9931 keys]
tokenized chunk (19/249) [8004 keys]
tokenized chunk (20/249) [5949 keys]
tokenized chunk (21/249) [8345 keys]
tokenized chunk (22/249) [13381 keys]
tokenized chunk (23/249) [18026 keys]
tokenized chunk (24/249) [15978 keys]
tokenized chunk (25/249) [12526 keys]
tokenized chunk (26/249) [17599 keys]
tokenized chunk (27/249) [11572 keys]
tokenized chunk (28/249) [18360 keys]
tokenized chunk (29/249) [8206 keys]
tokenized chunk (30/249) [11074 keys]
tokenized chunk (31/249) [14938 keys]
tokenized chunk (32/249) [12470 keys]
tokenized chunk (33/249) [10483 keys]
tokenized chunk (34/249) [14454 keys]
tokenized chunk (35/249) [9515 keys]
tokenized chunk (36/249) [10757 keys]
tokenized chunk (37/249) [8575 keys]
tokenized chunk (38/249) [13322 keys]
tokenized chunk (39/249) [8586 keys]
tokenized chunk (40/249) [8388 keys]
tokenized chunk (41/249) [16794 keys]
tokenized chunk (42/249) [6053 keys]
tokenized chunk (43/249) [8165 keys]
tokenized chunk (44/249) [4032 keys]
tokenized chunk (45/249) [3898 keys]
tokenized chunk (46/249) [3113 keys]
tokenized chunk (47/249) [2738 keys]
tokenized chunk (48/249) [12874 keys]
tokenized chunk (49/249) [7597 keys]
tokenized chunk (50/249) [4921 keys]
tokenized chunk (51/249) [3117 keys]
tokenized chunk (52/249) [8515 keys]
tokenized chunk (53/249) [9234 keys]
tokenized chunk (54/249) [13384 keys]
tokenized chunk (55/249) [13649 keys]
tokenized chunk (56/249) [13531 keys]
tokenized chunk (57/249) [12832 keys]
tokenized chunk (58/249) [12293 keys]
tokenized chunk (59/249) [25620 keys]
tokenized chunk (60/249) [6443 keys]
tokenized chunk (61/249) [15453 keys]
tokenized chunk (62/249) [10807 keys]
tokenized chunk (63/249) [19978 keys]
tokenized chunk (64/249) [44970 keys]
tokenized chunk (65/249) [14168 keys]
tokenized chunk (66/249) [12106 keys]
tokenized chunk (67/249) [27309 keys]
tokenized chunk (68/249) [12115 keys]
tokenized chunk (69/249) [20707 keys]
tokenized chunk (70/249) [19919 keys]
tokenized chunk (71/249) [11967 keys]
tokenized chunk (72/249) [16046 keys]
tokenized chunk (73/249) [8409 keys]
tokenized chunk (74/249) [20964 keys]
tokenized chunk (75/249) [12275 keys]
tokenized chunk (76/249) [16301 keys]
tokenized chunk (77/249) [12272 keys]
tokenized chunk (78/249) [21592 keys]
tokenized chunk (79/249) [19530 keys]
tokenized chunk (80/249) [17342 keys]
tokenized chunk (81/249) [19946 keys]
tokenized chunk (82/249) [15298 keys]
tokenized chunk (83/249) [17531 keys]
tokenized chunk (84/249) [17299 keys]
tokenized chunk (85/249) [24131 keys]
tokenized chunk (86/249) [16513 keys]
tokenized chunk (87/249) [19510 keys]
tokenized chunk (88/249) [14266 keys]
tokenized chunk (89/249) [22952 keys]
tokenized chunk (90/249) [15482 keys]
tokenized chunk (91/249) [15573 keys]
tokenized chunk (92/249) [20496 keys]
tokenized chunk (93/249) [18156 keys]
tokenized chunk (94/249) [22490 keys]
tokenized chunk (95/249) [29002 keys]
tokenized chunk (96/249) [20352 keys]
tokenized chunk (97/249) [44165 keys]
tokenized chunk (98/249) [34627 keys]
tokenized chunk (99/249) [49905 keys]
tokenized chunk (100/249) [53103 keys]
tokenized chunk (101/249) [51983 keys]
tokenized chunk (102/249) [31038 keys]
tokenized chunk (103/249) [31409 keys]
tokenized chunk (104/249) [33165 keys]
tokenized chunk (105/249) [37822 keys]
tokenized chunk (106/249) [10940 keys]
tokenized chunk (107/249) [71118 keys]
tokenized chunk (108/249) [38858 keys]
tokenized chunk (109/249) [37634 keys]
tokenized chunk (110/249) [51967 keys]
tokenized chunk (111/249) [56836 keys]
tokenized chunk (112/249) [27115 keys]
tokenized chunk (113/249) [15849 keys]
tokenized chunk (114/249) [14734 keys]
tokenized chunk (115/249) [26009 keys]
tokenized chunk (116/249) [19294 keys]
tokenized chunk (117/249) [32044 keys]
tokenized chunk (118/249) [29201 keys]
tokenized chunk (119/249) [39628 keys]
tokenized chunk (120/249) [6244 keys]
tokenized chunk (121/249) [7435 keys]
tokenized chunk (122/249) [21227 keys]
tokenized chunk (123/249) [29732 keys]
tokenized chunk (124/249) [35250 keys]
tokenized chunk (125/249) [10271 keys]
tokenized chunk (126/249) [32891 keys]
tokenized chunk (127/249) [7873 keys]
tokenized chunk (128/249) [10418 keys]
tokenized chunk (129/249) [7311 keys]
tokenized chunk (130/249) [9516 keys]
tokenized chunk (131/249) [11074 keys]
tokenized chunk (132/249) [15263 keys]
tokenized chunk (133/249) [11205 keys]
tokenized chunk (134/249) [8567 keys]
tokenized chunk (135/249) [7678 keys]
tokenized chunk (136/249) [44950 keys]
tokenized chunk (137/249) [21967 keys]
tokenized chunk (138/249) [35438 keys]
tokenized chunk (139/249) [49606 keys]
tokenized chunk (140/249) [55683 keys]
tokenized chunk (141/249) [49369 keys]
tokenized chunk (142/249) [48286 keys]
tokenized chunk (143/249) [44039 keys]
tokenized chunk (144/249) [11811 keys]
tokenized chunk (145/249) [41120 keys]
tokenized chunk (146/249) [69629 keys]
tokenized chunk (147/249) [70067 keys]
tokenized chunk (148/249) [46883 keys]
tokenized chunk (149/249) [52358 keys]
tokenized chunk (150/249) [127523 keys]
tokenized chunk (151/249) [37044 keys]
tokenized chunk (152/249) [74712 keys]
tokenized chunk (153/249) [63824 keys]
tokenized chunk (154/249) [55408 keys]
tokenized chunk (155/249) [61234 keys]
tokenized chunk (156/249) [54418 keys]
tokenized chunk (157/249) [39921 keys]
tokenized chunk (158/249) [62581 keys]
tokenized chunk (159/249) [71439 keys]
tokenized chunk (160/249) [53094 keys]
tokenized chunk (161/249) [76232 keys]
tokenized chunk (162/249) [36778 keys]
tokenized chunk (163/249) [71083 keys]
tokenized chunk (164/249) [71121 keys]
tokenized chunk (165/249) [54315 keys]
tokenized chunk (166/249) [62550 keys]
tokenized chunk (167/249) [67024 keys]
tokenized chunk (168/249) [69247 keys]
tokenized chunk (169/249) [66758 keys]
tokenized chunk (170/249) [54992 keys]
tokenized chunk (171/249) [62659 keys]
tokenized chunk (172/249) [60409 keys]
tokenized chunk (173/249) [44923 keys]
tokenized chunk (174/249) [43095 keys]
tokenized chunk (175/249) [50332 keys]
tokenized chunk (176/249) [62506 keys]
tokenized chunk (177/249) [51782 keys]
tokenized chunk (178/249) [71541 keys]
tokenized chunk (179/249) [63289 keys]
tokenized chunk (180/249) [85046 keys]
tokenized chunk (181/249) [63942 keys]
tokenized chunk (182/249) [58598 keys]
tokenized chunk (183/249) [63150 keys]
tokenized chunk (184/249) [47424 keys]
tokenized chunk (185/249) [65839 keys]
tokenized chunk (186/249) [93418 keys]
tokenized chunk (187/249) [12910 keys]
tokenized chunk (188/249) [53958 keys]
tokenized chunk (189/249) [37259 keys]
tokenized chunk (190/249) [11532 keys]
tokenized chunk (191/249) [52861 keys]
tokenized chunk (192/249) [14390 keys]
tokenized chunk (193/249) [11546 keys]
tokenized chunk (194/249) [43913 keys]
tokenized chunk (195/249) [66130 keys]
tokenized chunk (196/249) [10962 keys]
tokenized chunk (197/249) [9993 keys]
tokenized chunk (198/249) [11903 keys]
tokenized chunk (199/249) [28550 keys]
tokenized chunk (200/249) [10199 keys]
tokenized chunk (201/249) [11053 keys]
tokenized chunk (202/249) [11845 keys]
tokenized chunk (203/249) [10557 keys]
tokenized chunk (204/249) [10736 keys]
tokenized chunk (205/249) [19925 keys]
tokenized chunk (206/249) [18973 keys]
tokenized chunk (207/249) [22198 keys]
tokenized chunk (208/249) [13544 keys]
tokenized chunk (209/249) [12096 keys]
tokenized chunk (210/249) [10717 keys]
tokenized chunk (211/249) [23275 keys]
tokenized chunk (212/249) [11339 keys]
tokenized chunk (213/249) [11669 keys]
tokenized chunk (214/249) [12482 keys]
tokenized chunk (215/249) [15175 keys]
tokenized chunk (216/249) [53832 keys]
tokenized chunk (217/249) [52319 keys]
tokenized chunk (218/249) [51782 keys]
tokenized chunk (219/249) [48032 keys]
tokenized chunk (220/249) [44353 keys]
tokenized chunk (221/249) [47209 keys]
tokenized chunk (222/249) [43914 keys]
tokenized chunk (223/249) [48074 keys]
tokenized chunk (224/249) [27881 keys]
tokenized chunk (225/249) [39001 keys]
tokenized chunk (226/249) [41330 keys]
tokenized chunk (227/249) [45242 keys]
tokenized chunk (228/249) [51633 keys]
tokenized chunk (229/249) [38759 keys]
tokenized chunk (230/249) [33628 keys]
tokenized chunk (231/249) [37245 keys]
tokenized chunk (232/249) [28676 keys]
tokenized chunk (233/249) [40631 keys]
tokenized chunk (234/249) [37609 keys]
tokenized chunk (235/249) [41072 keys]
tokenized chunk (236/249) [39166 keys]
tokenized chunk (237/249) [42001 keys]
tokenized chunk (238/249) [14521 keys]
tokenized chunk (239/249) [43873 keys]
tokenized chunk (240/249) [5256 keys]
tokenized chunk (241/249) [5307 keys]
tokenized chunk (242/249) [15233 keys]
tokenized chunk (243/249) [34008 keys]
tokenized chunk (244/249) [16667 keys]
tokenized chunk (245/249) [7618 keys]
tokenized chunk (246/249) [18999 keys]
tokenized chunk (247/249) [17754 keys]
tokenized chunk (248/249) [22048 keys]
tokenized chunk (249/249) [21140 keys]
Traceback (most recent call last):
File "train.py", line 196, in
doc_count = tally(b_dirs, args.jobs)
File "C:\langid.py-master\langid\train\DFfeatureselect.py", line 92, in tally
for i, keycount in enumerate(pass_sum_df_out):
File "C:\Python27\lib\multiprocessing\pool.py", line 620, in next
raise value
TypeError: marshal.load() arg must be file

pip install langid does not work

Hello,
I try to install the library with pip on OS X 10.6, python 3.3.
Running the pip command I receive the following.

$ pip install langid
Downloading/unpacking langid
Could not find a version that satisfies the requirement langid (from versions: 1.0dev, 1.1.1dev, 1.1.2dev, 1.1.3dev, 1.1.4dev, 1.1dev)
Cleaning up...
No distributions matching the version for langid
Storing debug log for failure in /Users/xx/.pip/pip.log

So I downloaded the package and tried to install it without pip

$ python setup.py install

Looked good until here:

Extracting langid-1.1.4dev-py2.6.egg to /Library/Python/2.6/site-packages
SyntaxError: ('invalid syntax', ('/Library/Python/2.6/site-packages/langid-1.1.4dev-py2.6.egg/langid/train/common.py', 39, 34, "  with gzip.open(path, 'rb') as f, tempfile.TemporaryFile() as t:\n"))

Adding langid 1.1.4dev to easy-install.pth file
error: /Library/Python/2.6/site-packages/easy-install.pth: Permission denied

Why isnt it installing? Furthermore, I dont want it to install with python 2.6 (I guess thats the default python version of OS X 10.6) but with python 3.3

incorrect classification

written in russian, so correct

>>> langid.classify('пожалуйста')
('ru', 0.9189426432579942)

written in russian, so incorrect

>>> langid.classify('пример текста на русском')
('sr', 0.9431929502384854)

written in russian, so incorrect

>>> langid.classify('вомбат')
('uk', 0.4581717399240493)

written in hebrew, so incorrect

>>> langid.classify('וְאָהַבְתָּ אֵת יְיָ | אֱלֹהֶיךָ, בְּכָל-לְבָֽבְךָ, וּבְכָל-נַפְשְׁךָ, וּבְכָל-מְאֹדֶֽךָ. וְהָיוּ הַדְּבָרִים הָאֵלֶּה, אֲשֶׁר | אָֽנֹכִי מְצַוְּ')
('ko', 1.0)

written in old english, so incorrect (not sure that this example is correct)

>>> langid.classify(' On þyssum geare man halgode þet mynster æt Westmynstre on Cyldamæsse dæg 7 se cyng Eadward forðferde on Twelfts mæsse æfen 7 hine mann bebyrgede on Twelftan')
('is', 0.9999936552704334)

Mixed languages polarized by "en"

I have the following text

that is a mix of english and sesotho:

>>>Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom\nSomebody tell my mama

Having the whole sentence as-it-is I get a wrong identification:

>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom\nSomebody tell my mama
('en', -233.38300132751465)

Removing the whole en sentences, and I get the right ISO-639-1 language code: sl:

>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino                                                       
('sl', -154.34662437438965)

Also keeping only one en sentence, the right language is recognized:

>>> Ska rebona re phela\nKgale re sokola rona re phelela mmino\nO skang potja ka dilo\nKgale re sokola rona re phelela mmino (×2)\nWe five minutes from freedom                    
('sl', -217.8226833343506)

So, it seems that the detector is being "polarized" by the en sentences in this phrase.

training data

Thanks for making langid available! It's awesome! We (researchers at Carnegie Mellon University) would like to augment the training data with more languages. Shall we send you the data so that you can retrain the models when your time permits? Alternatively, feel free to send us the data and we would retrain the models ourselves.

many thanks!
waleed ammar

-d option causes syntax error with -b

The -d option produces a syntax error if you attempt to use it with -b; it refers to a variable nb_classes which was probably in scope before the code was refactored. (It does exist in some other methods but I am unable to figure out how to stitch it back together.)

 vnix$ ./langid/langid.py -d -b sample.txt
Traceback (most recent call last):
  File "./langid/langid.py", line 587, in <module>
    main()
  File "./langid/langid.py", line 558, in main
    writer.writerow(['path']+nb_classes)
NameError: global name 'nb_classes' is not defined

error in running IGweight.py

Hello

I have the following error when I run IGweight.py

computing information gain Traceback (most recent call last): File "/home/motaz/tmp/langid.py/langid/train/IGweight.py", line 246, in <module> ig = compute_IG(bucketlist, features, dist, args.binarize, suffix, args.jobs) File "/home/motaz/tmp/langid.py/langid/train/IGweight.py", line 164, in compute_IG for i, (t, w) in enumerate(pass_IG_out): File "/usr/lib/python2.7/multiprocessing/pool.py", line 668, in next raise value IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

What do you think the cause of this error? How can I fix it?

Thanks

Runtimewarning freeze uwsgi worker

Thank you for this nice lib.
We are using langid with a webservice and are facing a really weird problem : from time to time, the uwsgi worker is killed because of this portion of code :

with np.errstate(over='ignore'):
    pd = (1/np.exp(pd[None,:] - pd[:,None]).sum(1))
return pd

After a few googling, we found this message :
http://stackoverflow.com/questions/19213768/numpy-runtime-warning-causing-apache-workers-to-freeze-in-sending-reply-state
It seems that there is something with warning and numpy

As we are on nginx, the proposed fix is not working.
Do you have any idea of how we could workd around this ?

Detection accuracy

**功夫是一门博大精深的武学艺术 , **功夫app , 介绍**功夫的分类、特点、器材、门派等与**功夫有关的内容!让广大读者能够更完整的了解**功夫的精华!

If run the above snippet through your detection tool I am getting "en" as the answer. This is due to a three letter word in the snippet ("app"). Is it possible to fix this issue ?

Detection error when encounter full-width characters

The langid mistakens full-width English texts like 'hello world' as CJK language texts.
>>> import langid
>>> langid.classify('hello world')
('zh', 0.9339664571825803)

Training a new language on Windows doesn't work

I am trying to train it on some language files I downloaded from the internet. But unfortunately no matter what I try, it always crashes.

D:\Django\langid\Scripts>python.exe LDfeatureselect.py -c d:\corpus\wikipedia\langid\corpus -o features -j 1
output path: features
temp path: c:\users\nick\appdata\local\temp
corpus path: d:\corpus\wikipedia\langid\corpus
will tokenize 2 files
langs: ['am', 'af']
domains: ['domain1']
chunk size: 1 (3 chunks)
Traceback (most recent call last):
File "LDfeatureselect.py", line 533, in
chunk_paths, features, chunk_offsets = build_inverted_index(paths, options)
File "LDfeatureselect.py", line 423, in build_inverted_index
for i, keycount in enumerate(pass1_out):
File "C:\Python27\Lib\multiprocessing\pool.py", line 626, in next
raise value
OSError: [Errno 9] Bad file descriptor

I am using Python 2.7.3 on Windows 7 64bit and the latest version of langid.

langid.set_languages(['en','it'])

langid.set_languages is used a few times in the readme file but when running it I receive this error:
Traceback (most recent call last):
File "<pyshell#26>", line 1, in
langid.set_languages(['en','it'])
AttributeError: 'module' object has no attribute 'set_languages'

By looking at the code I think langid.identifier.set_languages(['en','it']) would do the job.
I thought maybe this needs to be corrected.

Weird classification for Latvian & ex-Yougoslavian languages

Hi,

I noticed some weird classification problems for Latvian, the text (a subtitle) being typically classified as Luxemburgish (!). As the codes for the two languages follow each other, I was wondering whether this might be due to a bug (in the training data or the classifier)?

Bosnian, Croatian and Serbian are also difficult to classify (these languages being very close, and until recently often classified as one single language). However, it seems that the classification of Croatian "overweights" the two other languages -- if I try to classify Bosnian and Serbian subtitles, I typically get Croatian as only possible language.

Strange classifications

I get these results for english text,

>>> feeling
('en', 0.16946150595865342)
>>> good
('en', 0.16946150595865342)
>>> feeling good
('de', 0.2691886134361688)
>>> 

Am I right to assume that these results can be more accurate for english if I improved the training data for the english language?

Getting an error on performing import

This is the error I'm getting. I looked at the code and everything seems fine, so I'm stumped.

>>> import langid
Traceback (most recent call last):
  File "", line 1, in 
    import langid
  File "C:\Python33\lib\langid\__init__.py", line 1, in 
    from langid import classify, rank
ImportError: cannot import name classify
>>>

Make redirection in a stream fashion

Hello,

I am trying out this fancy language detection script,
and I am actually thinking -
do you think it makes more sense to make
for line in sys.stdin.readlines():
into
for line in sys.stdin:

in langid.py ?

Repetition of words causes detection error

When I input strings like 'hello world hello world hello world', langid can't identify it as English text.
>>> import langid
>>> langid.classify('hello world hello world hello world')
('af', 0.683057652874482)

DeprecationWarning: using a non-integer number instead of an integer

I'm receiving these warning while using langid:

Warning (from warnings module):
  File "C:\Python34\lib\site-packages\langid\langid.py", line 167
    nb_ptc = np.array(nb_ptc).reshape(len(nb_ptc)/len(nb_pc), len(nb_pc))
DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

Warning (from warnings module):
  File "C:\Python34\lib\site-packages\langid\langid.py", line 243
    arr = np.zeros((self.nb_numfeats,), dtype='uint32')
DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

I'm using python 3.4.1, numpy 1.8.1, and langid 1.1.4dev (python 3 branch)

Python3 branch does not work with python 3(.5) Fix is here

Hi Saffsd,

langid fails on line 167 of langid.py with the very example provided on the front page.
This is because in python3, dividing integers produces a float. The fix is to change line 163 from

nb_numfeats = len(nb_ptc) / len(nb_pc)

to

nb_numfeats = int(len(nb_ptc) / len(nb_pc))

and to change line 167 to this:

nb_ptc = np.array(nb_ptc).reshape(nb_numfeats, len(nb_pc))

This fix should be easy to make.
Thanks.

Seeking advice regarding classification problem only present with Chinese

Hello,

I have some sample texts, which originate in PDFs, with my goal being to classify the language automatically. I've extracted the text content with pdfminer and whilst langid works excellently with all my samples in a variety of languages, it seems to have problems for me when I run it with Chinese (I have samples in both simplified or traditional) because it always suggests 'en'.

Does anyone have any advice on how I should approach investigating what the problem might be?

Are there any standard example documents that I could try that would confirm there isn't something quirky with my PDF extraction?

I could be wrong, but I don't think it's necessarily a UTF-8 encoding issue as I have managed to get it working with other non-Latin texts (eg Cyrillic).

The languages that I've found to work with my samples, so far, are: en, it, de, ru. I will be checking pt, fr, pl and ja ones shortly.

There is a tiny portion of English in the header section, but that does not throw off the language detection for the other samples and I have tried focusing on pages where the body of the text is entirely Chinese and present in significantly larger quantities than in the header.

It also makes no difference if I preselect the languages (unfortunately the false suggestion of English needs to be in the list, as there are likely to be samples in English present)

langid.set_languages(['en','es','pt','fr','ru','pl','de','it','ja', 'zh'])

Even if I try taking out English then it merely suggests a different wrong language (eg German), although the confidence level is fairly low (eg typically 0.16 to 0.25, whether it guesses English or German).

My set up is Windows 7, with Python 2.7 (needed due to use of PDFMiner, although I could try Python 3.5 if it was thought to solve the issue).

Many thanks,
Neil

-b option does not appear to work

I tried to use the -b option to classify a bunch of files, but it does't appear to be doing anything. It starts up a multitude of subprocesses which appear to want to use standard input.

vnix$ chmod +x ./langid/langid.py

vnix$ ./langid/langid.py -b path/to/sample.txt
^C
Process PoolWorker-25:
Traceback (most recent call last):
Process PoolWorker-15:
Process PoolWorker-24:
Process PoolWorker-12:
Process PoolWorker-23:
Process PoolWorker-20:
Process PoolWorker-17:
Process PoolWorker-29:
Process PoolWorker-7:
Process PoolWorker-22:
Process PoolWorker-2:
Traceback (most recent call last):
Process PoolWorker-14:
Process PoolWorker-13:
Process PoolWorker-5:
Process PoolWorker-9:
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
Process PoolWorker-28:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Process PoolWorker-18:
Process PoolWorker-3:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Process PoolWorker-10:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Process PoolWorker-21:
Process PoolWorker-16:
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
Process PoolWorker-27:
Process PoolWorker-4:
    self._target(*self._args, **self._kwargs)
    self.run()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
Traceback (most recent call last):
Traceback (most recent call last):
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self.run()
    self.run()
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
Process PoolWorker-26:
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self.run()
    self.run()
    self.run()
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
    self.run()
    self.run()
Traceback (most recent call last):
    self.run()
    self.run()
    self.run()
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    task = get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
    self.run()
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
    self.run()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self.run()
    self.run()
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    self.run()
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    task = get()
    task = get()
    task = get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
Process PoolWorker-11:
Traceback (most recent call last):
Process PoolWorker-8:
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
Process PoolWorker-19:
Process PoolWorker-31:
Process PoolWorker-6:
    self._target(*self._args, **self._kwargs)
    task = get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Process PoolWorker-1:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
    task = get()
    task = get()
    task = get()
Traceback (most recent call last):
    task = get()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    racquire()
    task = get()
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
KeyboardInterrupt
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    racquire()
    racquire()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    racquire()
    self.run()
    racquire()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
    task = get()
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    racquire()
    self.run()
    self.run()
    task = get()
KeyboardInterrupt
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self.run()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
    task = get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
    racquire()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    task = get()
KeyboardInterrupt
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
    racquire()
KeyboardInterrupt
    racquire()
KeyboardInterrupt
    racquire()
KeyboardInterrupt
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)
    task = get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    task = get()
    task = get()
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
    task = get()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 374, in get
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
    self._target(*self._args, **self._kwargs)

(etc for another hundred or so lines, lost track so stopped copy/pasting). The process still cannot be stopped; logging in in a separate window reveals a Python process with some 30 child Python processes. Killing the top process stops the program, but appears to leave the child processes still running, now orphaned.

Repeating string yields different results

Knowing that most lang id systems perform worse on short strings, I have been experimenting with normalising the length:

MIN_LEN = 30
id = langid.rank(s)[0]
print langid.rank(s)[0]
while len(s) < MIN_LEN:
    s += '  ' + s
    print langid.rank(s)[0]
len_norm_id = langid.rank(s)[0]

I have noticed the following:

If id ie the original score was correct, the probability increases significantly after length normalisation.

If not, the probability only increases < ~10% or the identified language changes (usually to another incorrect language).

It is not a golden rule, but it is reliable enough that we could use it to:

  • increase probability on short strings
  • return 'und' in the cases where it is very fickle

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.