GithubHelp home page GithubHelp logo

berkeleylm's People

Watchers

 avatar

berkeleylm's Issues

Are the log probabilities comparable across language models?

I am training multiple language models using Kneser-Ney on different corpuses, 
and then trying to classify new sentences by scoring them with each language 
model and taking the highest score (Naive Bayes). 

Does this work using this library's Kneser-Ney smoothing? As in, are the 
distributions properly normalized so that I can compare scores across language 
models?

Original issue reported on code.google.com by [email protected] on 18 Jul 2013 at 12:45

MakeKneserNeyArpaFromText throws ArrayIndexOutOfBoundsException

I am running edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText on some German 
text but keep running into an ArrayIndexOutOfBoundsException exception. If I 
try to build a model from very limited data no such error arises. Is there a 
limit on the number of distinct characters the input text can contain? The out 
of bounds array value is 256 which is suspiciously the size of a byte.

I have attached the input file (German wikipedia data prepared for a character 
level n-gram model).

Here is the output I am seeing:

Reading text files [de-test.txt] and writing to file en-test.model {
    Reading from files [de-test.txt] {
        On line 0
        Writing ARPA {
            On order 1
            Writing line 0
            On order 2
            Writing line 0
            On order 3
            Writing line 0
            Writing line 0
            On order 4
            Writing line 0
[WARNING] 
java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
    at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 256
    at java.lang.Long.valueOf(Long.java:548)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:132)
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap$KeyIterator.next(ExplicitWordHashMap.java:113)
    at edu.berkeley.nlp.lm.collections.Iterators$Transform.next(Iterators.java:107)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.writeToPrintWriter(KneserNeyLmReaderCallback.java:130)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.cleanup(KneserNeyLmReaderCallback.java:111)
    at edu.berkeley.nlp.lm.io.TextReader.countNgrams(TextReader.java:85)
    at edu.berkeley.nlp.lm.io.TextReader.readFromFiles(TextReader.java:51)
    at edu.berkeley.nlp.lm.io.TextReader.parse(TextReader.java:44)
    at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:280)
    at edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText.main(MakeKneserNeyArpaFromText.java:55)


Original issue reported on code.google.com by [email protected] on 9 Aug 2012 at 4:48

Attachments:

-mx1000m not appropriate for Google n-grams

make-binary-from-google.sh currently uses -mx1000m

java -ea -mx1000m -server -cp ../src 
edu.berkeley.nlp.lm.io.MakeLmBinaryFromGoogle 
../test/edu/berkeley/nlp/lm/io/googledir google.binary

However, I quickly run out of heap space.

I tried -mx4000m but that ran out of heap space in about 2.5hrs.

What is an appropriate -mx setting for training on all 5 grams?
What size EC2 instance should I spin up?
How long will it take to train on all 5grams?

Original issue reported on code.google.com by [email protected] on 24 Nov 2011 at 2:59

Trying to build a language model on higher-order n-grams.

What steps will reproduce the problem?
1. An n-gram dataset in Google Web-IT format, but with no unigrams or bigrams 
(because I am only interested in higher-order n-grams).
2. To conform to the required format, place an empty vocab_cs.gz file under 
subdir "1gms", and create an empty subdir by the name "2gms" with one empty 
file in it called "2gm-0001"
3. The file names under the subdirs for higher-order n-grams do not start with 
<n>gm-0001 (for example, the files under 3gms start with 3gm-0021.

What is the expected output? What do you see instead?
Expected output:
    the expected binary file.
What actually happens:
    after reading and adding the n-grams, the following error is thrown:
    <a really big number> missing suffixes or prefixes were found, doing another pass to add n-grams {
    Exception in thread "main" java.lang.NullPointerException
            at edu.berkeley.nlp.lm.io.LmReaders.buildMapCommon(LmReaders.java:473)
            at edu.berkeley.nlp.lm.io.LmReaders.secondPassGoogle(LmReaders.java:417)
            at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:228)
            at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:204)
            at edu.berkeley.nlp.lm.io.MakeLmBinaryFromGoogle.main(MakeLmBinaryFromGoogle.java:36)

From the source code, I can see that the null pointer exception is thrown at 
the line which says
    numNgramsForEachWord[ngramOrder].incrementCount(headWord, 1);

What version of the product are you using? On what operating system?
    Tried with 1.1.2 and 1.1.5, both on Ubuntu 12.04

Please provide any additional information below.
    I am unable to share the dataset here, but I did manage to reproduce the error by making changes in the folder "/test/edu/berkeley/nlp/lm/io/googledir". These changes are the ones I describe in steps 1, 2 and 3 above. It seems that the empty vocab_cs.gz is what is causing this.

So the core of my question is this:

    What should I do if I only want to build a language model on 3-, 4- and 5-grams?

Original issue reported on code.google.com by [email protected] on 21 Nov 2014 at 3:57

creating and reading arpa files is 1. locale dependant 2. seems to have problems with multiple tabs in the text 3. seems to have some problem with the lack of newlines

I'm working with textfiles extracted from the "Reuters-21587, distribution 1.0" 
dataset and I have had trouble creating and then reading an ARPA file from it.
1. The code seems to be dependant on the use of "." as decimal separator, so 
using a german locale results in this error:
Exception in thread "main" java.lang.NumberFormatException: For input string: 
"-2,624282"
    at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
    at java.lang.Float.parseFloat(Unknown Source)
    at edu.berkeley.nlp.lm.io.ArpaLmReader.parseLine(ArpaLmReader.java:176)
    at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGrams(ArpaLmReader.java:148)
    at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:78)
    at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:18)
    at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:549)
    at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:526)
    at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:136)
    at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:131)
    at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:112)
    at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:108)
    at [...]

2. Using text files with multiple tabs results in this exception:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String 
index out of range: -4
    at java.lang.String.substring(Unknown Source)
    at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGram(ArpaLmReader.java:200)
    at edu.berkeley.nlp.lm.io.ArpaLmReader.parseLine(ArpaLmReader.java:172)
    at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGrams(ArpaLmReader.java:148)
    at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:78)
    at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:18)
    at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:549)
    at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:526)
    at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:136)
    at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:131)
    at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:112)
    at edu.berkeley.nlp.lm.io.LmReaders.readContextEncodedLmFromArpa(LmReaders.java:108)
    at [...]

3. Stripping all duplicate whitespace-characters and replacing them with one 
single space resulted in another error:
Exception in thread "main" java.lang.RuntimeException: Hash map is full with 
100 keys. Should never happen.
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap.put(ExplicitWordHashMap.java:56)
    at edu.berkeley.nlp.lm.map.HashNgramMap.putHelpWithSuffixIndex(HashNgramMap.java:283)
    at edu.berkeley.nlp.lm.map.HashNgramMap.putWithOffsetAndSuffix(HashNgramMap.java:247)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.addNgram(KneserNeyLmReaderCallback.java:171)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.call(KneserNeyLmReaderCallback.java:148)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.call(KneserNeyLmReaderCallback.java:37)
    at edu.berkeley.nlp.lm.io.TextReader.countNgrams(TextReader.java:80)
    at edu.berkeley.nlp.lm.io.TextReader.readFromFiles(TextReader.java:53)
    at edu.berkeley.nlp.lm.io.TextReader.parse(TextReader.java:47)
    at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:301)
    at [...]
I could work arround this issue by adding a newline character at the start of 
each textfile. 

I'm creating and reading the model with the following code:

    static void createModel(File dir, File arpa) {
        List<String> files = new LinkedList<>();
        for(File file : dir.listFiles())
            files.add(file.getAbsolutePath());
        final StringWordIndexer wordIndexer = new StringWordIndexer();
        wordIndexer.setStartSymbol(ArpaLmReader.START_SYMBOL);
        wordIndexer.setEndSymbol(ArpaLmReader.END_SYMBOL);
        wordIndexer.setUnkSymbol(ArpaLmReader.UNK_SYMBOL);
        LmReaders.createKneserNeyLmFromTextFiles(files, wordIndexer, 3, arpa, new ConfigOptions());
    }

    public static void main(String[] args) throws IOException {
        Locale.setDefault(Locale.US);
        File arpa = new File([...]);
        File directory = new File([...]);
        createModel(directory, arpa);
        ContextEncodedNgramLanguageModel<String> lm = LmReaders.readContextEncodedLmFromArpa(arpa.getAbsolutePath());
    }


Original issue reported on code.google.com by [email protected] on 16 Oct 2012 at 8:52

Attachments:

Cannot train unigram model with Kneser-Ney

When I try to train a unigram Kneser-Ney model, I get the exception below. This 
is the offending line:

dotdotTypeCounts = new LongArray[maxNgramOrder - 2];

here is the exception:

Exception in thread "main" java.lang.NegativeArraySizeException

at 
edu.berkeley.nlp.lm.values.KneserNeyCountValueContainer.<init>(KneserNeyCountVal
ueContainer.java:85)

at 
edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.<init>(KneserNeyLmReaderCallbac
k.java:123)

at 
edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:3
01)

at 
edu.berkeley.nlp.lm.io.LmReaders.readKneserNeyLmFromTextFile(LmReaders.java:283)

at 
edu.berkeley.nlp.lm.io.LmReaders.readKneserNeyLmFromTextFile(LmReaders.java:272)

at dragon.lm.NGramLanguageModel.<init>(NGramLanguageModel.java:85)

at 
dragon.ml.NaiveBayesClassifier.initalizeLanguageModels(NaiveBayesClassifier.java
:154)

at dragon.ml.NaiveBayesClassifier.main(NaiveBayesClassifier.java:189)

Original issue reported on code.google.com by [email protected] on 18 Jul 2013 at 11:52

ArrayOutOfBoundsException when reading in a large ARPA file.

I've written my own ARPA file generator, and when I create a small test file 
with it, reading it in by doing:

    NGramLanguageModel arpaLm = new NGramLanguageModel(arpaLmFilePath);

everything works fine. For ARPA files generated with a larger data set (see 
attached), I get an ArrayOutOfBoundsException:

    Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
        at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGram(ArpaLmReader.java:201)
            at edu.berkeley.nlp.lm.io.ArpaLmReader.parseLine(ArpaLmReader.java:172)
        at edu.berkeley.nlp.lm.io.ArpaLmReader.parseNGrams(ArpaLmReader.java:148)
        at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:78)
        at edu.berkeley.nlp.lm.io.ArpaLmReader.parse(ArpaLmReader.java:18)
        at edu.berkeley.nlp.lm.io.LmReaders.firstPassCommon(LmReaders.java:549)
        at edu.berkeley.nlp.lm.io.LmReaders.firstPassArpa(LmReaders.java:526)
        at edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:171)
        at edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:151)
        at dragon.lm.NGramLanguageModel.<init>(NGramLanguageModel.java:68)
        at dragon.lm.NGramLanguageModel.main(NGramLanguageModel.java:191)

Any guidance you could give me would be appreciated! The file is encoded as 
UTF-8.

Thanks.

Here's the version of Java I'm using:

    $ java -version
     java version "1.7.0_09"
     Java(TM) SE Runtime Environment (build 1.7.0_09-b05)
     Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)

Original issue reported on code.google.com by [email protected] on 17 Jul 2013 at 10:48

Runtime exception: Hash map is full with 100 keys. Should never happen.

What steps will reproduce the problem?
1. building a LM over some input files consistently generates this exception
2.
3.

What is the expected output? What do you see instead?

The expected output is a learned LM written to a file. Instead, I get the 
exception:
Runtime exception: Hash map is full with 100 keys. Should never happen.


What version of the product are you using? On what operating system?
berkeleylm 1.1.3 on Windows 7

Please provide any additional information below.

java.lang.RuntimeException: Hash map is full with 100 keys. Should never happen.
    at edu.berkeley.nlp.lm.map.ExplicitWordHashMap.put(ExplicitWordHashMap.java:56)
    at edu.berkeley.nlp.lm.map.HashNgramMap.putHelpWithSuffixIndex(HashNgramMap.java:283)
    at edu.berkeley.nlp.lm.map.HashNgramMap.putWithOffsetAndSuffix(HashNgramMap.java:247)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.addNgram(KneserNeyLmReaderCallback.java:171)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.call(KneserNeyLmReaderCallback.java:148)
    at edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback.call(KneserNeyLmReaderCallback.java:37)
    at edu.berkeley.nlp.lm.io.TextReader.countNgrams(TextReader.java:80)
    at edu.berkeley.nlp.lm.io.TextReader.readFromFiles(TextReader.java:53)
    at edu.berkeley.nlp.lm.io.TextReader.parse(TextReader.java:47)
    at edu.berkeley.nlp.lm.io.LmReaders.createKneserNeyLmFromTextFiles(LmReaders.java:301)
    at edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText.main(MakeKneserNeyArpaFromText.java:57)
    at yr.haifa.NLP.lm.BerkleyLanguageModel.train(BerkleyLanguageModel.java:51)

Original issue reported on code.google.com by [email protected] on 24 Apr 2013 at 1:14

ArrayIndexOutOfBoundsException while calling getLogProb


I have two ngram language models, A and B. B is a 3-gram LM trained on a 
super-set of the data used to train the 5-gram LM A. When I use B to estimate 
the likelihood of some sequences, the following exception is raised very 
frequently:

java.lang.ArrayIndexOutOfBoundsException: 2
    at edu.berkeley.nlp.lm.map.HashNgramMap.getOffsetHelpFromMap(HashNgramMap.java:405)
    at edu.berkeley.nlp.lm.map.HashNgramMap.getOffsetForContextEncoding(HashNgramMap.java:396)
    at edu.berkeley.nlp.lm.map.HashNgramMap.getValueAndOffset(HashNgramMap.java:294)
    at edu.berkeley.nlp.lm.ArrayEncodedProbBackoffLm.getBackoffSum(ArrayEncodedProbBackoffLm.java:133)
    at edu.berkeley.nlp.lm.ArrayEncodedProbBackoffLm.getLogProb(ArrayEncodedProbBackoffLm.java:97)
    at edu.berkeley.nlp.lm.ArrayEncodedNgramLanguageModel$DefaultImplementations.getLogProb(ArrayEncodedNgramLanguageModel.java:65)
    at edu.berkeley.nlp.lm.ArrayEncodedProbBackoffLm.getLogProb(ArrayEncodedProbBackoffLm.java:163)

The exception is not raised when using A.
Interestingly, when using B the exception is not _always_ raised, also for very 
similar strings. For example, the string:

"till you drive over the telly ."

does not generate an exception, while

"till you drive over the failure ."

does.

Even though it should not be relevant, both "telly" and "failure" are observed 
unigrams.

I am using berkeleylm 1.1.2 on OSX 10.8.2.
java -version:
 java version "1.6.0_37"
 Java(TM) SE Runtime Environment (build 1.6.0_37-b06-434-11M3909)
 Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01-434, mixed mode)

Both language models are estimated with make-kneserney-arpa-from-raw-text and 
subsequently converted to binary using make-binary-from-arpa. 

The problematic language model is quite large, so uploading it for testing 
could be complicated. I am wondering whether anyone has ever observed a similar 
error and has any clue about the cause of the problem.

Thanks!

Original issue reported on code.google.com by [email protected] on 3 Feb 2013 at 2:40

What is the status of the code?

When would it be possible to use the code?

Could you provide scripts so that one can easily import and use the Google 
n-gram corpus?

Original issue reported on code.google.com by [email protected] on 12 May 2011 at 7:19

Unrealistic perplexity

I'm trying to evaluate 5-gram model on a Vietnamese corpus but the perplexity 
doesn't seem to be right...


What steps will reproduce the problem?
1. Download and extract problem.zip
2. Follow the README file


What is the expected output? What do you see instead?

The result from BerkeleyLM and SRILM should be comparable but in fact 
BerkeleyLM return an unrealistic perplexity of around 1.


What version of the product are you using? On what operating system?

1.1.5 on Ubuntu.

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 12 Feb 2014 at 3:27

Attachments:

broken link on http://tomato.banatao.berkeley.edu:8080/berkeleylm_binaries/

The link for the Google Books corpus in Web1T format currently points to:
http://tomato.banatao.berkeley.edu:8080/google_books_dirs/books_google_ngrams_gr
e.tar.gz

... but it should be books_google_ngrams_ger.tar.gz.

Original issue reported on code.google.com by alex.rudnick on 11 Mar 2014 at 6:37

Exception in thread "main" java.lang.NoClassDefFoundError: edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText

What steps will reproduce the problem?
1. Download berkeleylm-1.0.0 or berkeleylm-1.0b3
2. Run examples\make-kneserney-arpa-from-raw-text.sh without the -server option

What is the expected output? What do you see instead?
Generate the ngram arpa file

What version of the product are you using? On what operating system?
1.0.0 or 1.0b3

Message Error:
--------------
Exception in thread "main" java.lang.NoClassDefFoundError: 
edu/berkeley/nlp/lm/io/MakeKneserNeyArpaFromText
Caused by: java.lang.ClassNotFoundException: 
edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
Could not find the main class: 
edu.berkeley.nlp.lm.io.MakeKneserNeyArpaFromText.  Program will exit.

Original issue reported on code.google.com by [email protected] on 19 Feb 2012 at 12:38

Frequency Map

Good Afternoon,

How to generate a map of frequency of n-grams?

Thank you.

Original issue reported on code.google.com by [email protected] on 8 Dec 2014 at 4:48

ArrayIndexOutOfBoundsException when running MakeLmBinaryFromGoogle

When running MakeLmBinaryFromGoogle I get the exception below (last lines of 
logger output also pasted).
The same exception is thrown if I call readLmFromGoogleNgramDir(path, compress) 
directly with compress set to true.
I could not yet figure out what is going on.
Do you have any clues?

-Torsten

<trace ---------------------------------------------------------->
                Line 13587000
                Line 13588000
            } [1m14s]
        } [1m14s]
        Reading ngrams of order 2 {
Exception in thread "main"      } [0s]
java.lang.ArrayIndexOutOfBoundsException: 1
    at edu.berkeley.nlp.lm.map.CompressedNgramMap.handleNgramsFinished(CompressedNgramMap.java:135)
    at edu.berkeley.nlp.lm.io.NgramMapAddingCallback.handleNgramOrderFinished(NgramMapAddingCallback.java:40)
    at edu.berkeley.nlp.lm.io.GoogleLmReader.parse(GoogleLmReader.java:99)
    at edu.berkeley.nlp.lm.io.GoogleLmReader.parse(GoogleLmReader.java:25)
    at edu.berkeley.nlp.lm.io.LmReaders.buildMapCommon(LmReaders.java:437)
    at edu.berkeley.nlp.lm.io.LmReaders.secondPassGoogle(LmReaders.java:391)
    at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:210)
    at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:193)
    at de.tudarmstadt.ukp.dkpro.teaching.frequency.berkeleylm.CreateGoogleBinary.run(CreateGoogleBinary.java:25)
    at de.tudarmstadt.ukp.dkpro.teaching.frequency.berkeleylm.CreateGoogleBinary.main(CreateGoogleBinary.java:18)

</trace ---------------------------------------------------------->

Original issue reported on code.google.com by [email protected] on 29 Jun 2011 at 8:20

Getting NAN on last trigram when using google binary

Hi
Adding to my previous posts in issues 19, I am trying to use google binary 
(from google books) and get log probabilities of trigrams from some text. I am 
getting NAN from the last trigrams. Attached is the code of what I am trying to 
do. I am slightly modified these files and added some System.out.printlns to 
see the outputs.

I text I am testing with is "Hello how are you". So essentially it is giving me 
a sent [7380255 15474 152 26 45 7380256]. 7380255 is the start symbol and 
7380256 is the stop symbol.

I am first getting the log probability of the bigram 7380255 15474, by passing 
startpos as 0 and endpos as 2. Thereafter I am getting the log probabilities of 
trigrams starting with startpos 0, like the code below

for (int i = 0; i <= sent.length - 3; i++) {
    System.out.println("Getting score from " + sent[i] + " to " + sent[i+2]);
    score = lm_.getLogProb(sent, i, i+3);
    System.out.println("score " + score);
    if(Float.isNaN(score))
    System.out.println("Returned NaN");
    else
    sentScore += score;
}

The problem is happening with within StupidBackoffLm in the following line 
probContext = localMap.getValueAndOffset(probContext, probContextOrder, 
ngram[i], scratch);
only with the last trigram when startpost is 3 and end pos is 6.
scratch.value is returning -1 with ngram[i] being the end symbol or 7380256. 
This is resulting in a NAN logprob. 

I tried the same with scoreSentence, it gives the same problem.


Can you please help me in understanding what mistake I am doing ?

Thanks
Regards
Debanjan

Original issue reported on code.google.com by [email protected] on 24 Mar 2014 at 11:36

Attachments:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException

Hello, now I want to build a chinese language model from an arpa file, However, 
it fails as following:
        Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 8641
    at edu.berkeley.nlp.lm.map.ImplicitWordHashMap.setWordRanges(ImplicitWordHashMap.java:84)
    at edu.berkeley.nlp.lm.map.ImplicitWordHashMap.<init>(ImplicitWordHashMap.java:52)
    at edu.berkeley.nlp.lm.map.HashNgramMap.<init>(HashNgramMap.java:66)
    at edu.berkeley.nlp.lm.map.HashNgramMap.createImplicitWordHashNgramMap(HashNgramMap.java:49)
    at edu.berkeley.nlp.lm.io.LmReaders.createNgramMap(LmReaders.java:473)
    at edu.berkeley.nlp.lm.io.LmReaders.buildMapCommon(LmReaders.java:439)
    at edu.berkeley.nlp.lm.io.LmReaders.buildMapArpa(LmReaders.java:419)
    at edu.berkeley.nlp.lm.io.LmReaders.secondPassArrayEncoded(LmReaders.java:383)
    at edu.berkeley.nlp.lm.io.LmReaders.readArrayEncodedLmFromArpa(LmReaders.java:160)

but when a use a smaller file, it is ok. is there any argument size need to 
ajdust? 

Original issue reported on code.google.com by [email protected] on 4 Oct 2011 at 3:48

StarPos and EndPos for ngram log probability

Hi

I am a bit confused on how to find the log probabilities of ngrams. From 
PerplexityTest.java the code looks like below
for (i = 1; i <= sent.length - lm_.getLmOrder(); ++i) {
                    final float score = lm_.getLogProb(sent, i, i + lm_.getLmOrder());
                    sentScore += score;
                }
The thing I am not getting is why is it starting from 1 and why i + 
lm_.getLmOrder() and why sent is only number of words in line + 2. 

Ideally I was expecting sent to be number of words line + 3. So if I have a 
sentence, Hello how are you, sent should be START START Hello how are you STOP. 
So the first trigram should be START START Hello. So if I wanted to find the 
log probability of the first trigram I would use startpos 0 and endpos 2. The 
last trigram will be "are you STOP" , startpos 4, and endpos 6.

Obviously I am making some assumptions here. I tried to dig the code to prove 
myself otherwise but unfortunately could not get much intelligence in this 
context.

I will be grateful for any help on this.

Regards
Deb

Original issue reported on code.google.com by [email protected] on 20 Mar 2014 at 9:03

How to train on Google n-grams

I see the example file for training on the Google n-grams.
However, I don't know how the Google n-gram directory should be laid out.

What directory structure should I have?
This is how I currently have things laid out:
.
./web_5gram_2
./web_5gram_2/data
./web_5gram_2/data/3gms
./web_5gram_2/data/4gms
./web_5gram_2/docs
./web_5gram_v1_1.btw
./web_5gram_v1_1.btw/data
./web_5gram_v1_1.btw/data/1gms
./web_5gram_v1_1.btw/data/2gms
./web_5gram_v1_1.btw/data/3gms
./web_5gram_v1_1.btw/docs
./web_5gram_4
./web_5gram_4/data
./web_5gram_4/data/4gms
./web_5gram_4/data/5gms
./web_5gram_4/docs
./web_5gram_5
./web_5gram_5/data
./web_5gram_5/data/5gms
./web_5gram_5/docs
./web_5gram_6
./web_5gram_6/data
./web_5gram_6/data/5gms
./web_5gram_6/docs
./web_5gram_3
./web_5gram_3/data
./web_5gram_3/data/4gms
./web_5gram_3/docs


From looking at src/edu/berkeley/nlp/lm/io/GoogleLmReader.java
it seemed that I should make one directory, alldata/, and put every data file 
in there. However, this didn't work either.

What is the correct way to lay out the ngram directory?

Original issue reported on code.google.com by [email protected] on 19 Nov 2011 at 11:55

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.