GithubHelp home page GithubHelp logo

Comments (7)

ralfbrown avatar ralfbrown commented on August 29, 2024

Oops, let's try that again (first-time poster on GitHub):

   ***************
   *** 258,261 ****
   --- 258,304 ----
         return arr

   +   ## optimized version by Ralf Brown
   +   def instance2classprobs(self, text):
   +     """
   +     Compute class probabilities for an instance according to the trained model
   +     """
   +     if isinstance(text, unicode):
   +       text = text.encode('utf8')
   +
   +     # Convert the text to a sequence of ascii values
   +     ords = map(ord, text)
   +
   +     state = 0
   +     if len(ords) < self.nb_numfeats / 10:
   +         # for very short texts, just apply each production every time the
   +         # state changes, rather than counting the number of occurrences of
   +         # each state
   +         pdc = np.zeros(len(self.nb_classes))
   +         for letter in ords:
   +             state = self.tk_nextmove[(state << 8) + letter]
   +             for index in self.tk_output.get(state, []):
   +                 # compute the dot product incrementally, avoiding lots
   +                 # of multiplications by zero with a sparse
   +                 # feature-count vector
   +                 pdc += self.nb_ptc[index]
   +     else:
   +         # Count the number of times we enter each state
   +         statecount = defaultdict(int)
   +         for letter in ords:
   +             state = self.tk_nextmove[(state << 8) + letter]
   +             statecount[state] += 1
   +
   +         # Update all the productions corresponding to the state
   +         arr = np.zeros((self.nb_numfeats,), dtype='uint32')
   +         for state in statecount:
   +             for index in self.tk_output.get(state, []):
   +                 arr[index] += statecount[state]
   +         # compute the partial log-probability of the document given each class
   +         pdc = np.dot(arr,self.nb_ptc)
   +
   +     # compute the partial log-probability of the document in each class
   +     pd = pdc + self.nb_pc
   +     return pd
   +
       def nb_classprobs(self, fv):
         # compute the partial log-probability of the document given each class
   ***************
   *** 269,274 ****
         Classify an instance.
         """
   !     fv = self.instance2fv(text)
   !     probs = self.norm_probs(self.nb_classprobs(fv))
         cl = np.argmax(probs)
         conf = float(probs[cl])
   --- 312,317 ----
         Classify an instance.
         """
   !     probs = self.instance2classprobs(text)
   !     probs = self.norm_probs(probs)
         cl = np.argmax(probs)
         conf = float(probs[cl])
   ***************
   *** 280,285 ****
         Return a list of languages in order of likelihood.
         """
   !     fv = self.instance2fv(text)
   !     probs = self.norm_probs(self.nb_classprobs(fv))
         return [(str(k),float(v)) for (v,k) in sorted(zip(probs, self.nb_classes), reverse=True)]

   --- 323,328 ----
         Return a list of languages in order of likelihood.
         """
   !     probs = self.instance2classprobs(text)
   !     probs = self.norm_probs(probs)
         return [(str(k),float(v)) for (v,k) in sorted(zip(probs, self.nb_classes), reverse=True)]

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

Hi Ralf! Thanks for your patch. I've applied it to a new branch. Dawid Weiss addresses the same issue (sparse feature vector) in his Java implementation (https://github.com/carrotsearch/langid-java). His solution uses a different data structure. I do appreciate the simplicity of your patch, handling "long" and "short" strings separately.

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

I did some experimentation to try and determine a suitable threshold for transitioning from "short" to "long" handling. Your initial estimate using the in-built feature set was num_feats / 10, which worked out to 748. I've made a slight change to that, allowing the user to specify the document length in bytes where the changeover should occur. I experimented with samples from a European government dataset, sampling 1000 strings of n bytes, for 100 <= n <= 3000 in increments of 100. The break-even point in my experiments appears to be right around 650 bytes, and I've made this change in the a/m branch.

from langid.py.

ralfbrown avatar ralfbrown commented on August 29, 2024

I'm glad I could help to improve your code. The speed issue came up
when I went to compare the performance of langid.py against my own
language identifier (a module within
http://la-strings.sourceforge.net) on the very large set of languages
I'm working with. Given 50,000+ features and 1290 classes (1188
languages, some with multiple encodings), langid.py was taking 0.7
seconds per string, which meant that an evaluation run would take over
a week, compared to less than two minutes for my program.... It will
still take a full day based on a 70-second startup time with the big
models (180-200MB on disk, ~2.5GB RAM after gc); I haven't decided yet
whether it will be worth my time to run langid.py in server mode and
write a client for my evaluation script to use.

The solution in my patch is in fact the same approach I use in my own
'whatlang' to incrementally compute the cosine similarity between the
input text and each language's model. 'whatlang' has been optimized
for short strings from the beginning.

I took a quick look at langid-java, but it seems to be a raw library
without a main() to let it run standalone, so I haven't tried running
it yet.

BTW, I set the switchover point relative to num_feats because it will
be higher for larger feature sets. My initial estimate was based on
the fact that most states in the FSA seem to fire two or three
productions, and an N-byte input can reach no more than N states, so a
string of num_feats/10 bytes would yield 30% or fewer non-zero entries
in the feature count vector. OTOH, 650 bytes is big enough that few
individual lines in real text will exceed it (Europarl isn't your
everyday text :-).

On 8/1/13, saffsd [email protected] wrote:

I did some experimentation to try and determine a suitable threshold for
transitioning from "short" to "long" handling. Your initial estimate using
the in-built feature set was num_feats / 10, which worked out to 748. I've
made a slight change to that, allowing the user to specify the document
length in bytes where the changeover should occur. I experimented with
samples from a European government dataset, sampling 1000 strings of n
bytes, for 100 <= n <= 3000 in increments of 100. The break-even point in my
experiments appears to be right around 650 bytes, and I've made this change
in the a/m branch.


Reply to this email directly or view it on GitHub:
#11 (comment)

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

I'm familiar with your work, I read your paper when it came out. Great to see others releasing open-source tools!

I've never pushed this implementation of langid.py to the limits you're taking it to, I'll be very interested to see the final outcome of your comparison. I hope you are planning to publish it? I'm not sure server mode will help, the main advantage is to avoid unpacking the model multiple times. Are you doing that? If so, it would be faster if you access langid.py as a library, creating an single instance of the LanguageIdentifier class and then using that. I think this implementation of langid.py is as fast as I'm able to make it while sticking with numpy, I suspect I could do a fair bit better if I implemented the core estimation as a C extension module.

Ah, I see your rationale for the num_feats/10, that is quite sensible.

As an aside, I've experimented with la-strings briefly. Is there any way to access the language identifier component directly? What I ended up doing was invoking it as "la-strings -i -I 1 ", and then taking the most common prediction as the prediction for the whole document, which seems like a rather roundabout way of doing it.

from langid.py.

ralfbrown avatar ralfbrown commented on August 29, 2024

I'm not sure server mode will help, the main advantage is to avoid unpacking the model multiple times.

That's in fact what is happening, because my evaluation script calls
the selected language identification program once for each of the test
files (currently 1203).

I've experimented with la-strings briefly. Is there any way to access the language identifier component directly?

The language identification code is all in the langident/ subdirectory
and separately compilable, with a front-end program called 'whatlang'.
You can tell 'whatlang' to give identification on
slightly-overlapping fixed-size blocks, the "entire" file (up to the
first 500,000 bytes), or line-by-line. There's also a
LanguageIdentifier class for incorporation into other programs.

I'll be very interested to see the final outcome of your comparison.

I have a paper at the Text, Speech, and Discourse conference next
month. The published version compares whatlang against mguesser,
libtextcat, and Shuyo's Java LangDetect. I'm planning to add
langid.py for the oral presentation. Based on evaluating a 29-file
subset of the test set using the full model, it looks like langid.py
will be the only one to give whatlang a run for the money on accuracy
if I can get 6-grams working -- hence yesterday's memory-reduction
patch. With 5-grams and 2000 features per class (--df_tokens 20000),
langid.py made 586 classification errors out of 21607 versus 478 for
whatlang using 6-grams and 3500 features per class. For a bunch of
the languages, byte 5-grams are effectively character bigrams.

from langid.py.

saffsd avatar saffsd commented on August 29, 2024

I've added handling of the --line option in --batch mode, this should be the most effective way for you to avoid unpacking the model over and over. You can find the modified version of langid.py in the new branch I opened for this issue. The file itself is here. Please let me know if this is helpful.

Congratulations on your publication! I will look for it when the proceedings come online.

from langid.py.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.