The description on the smt website and the text above the example in the README say that output.tar contains probabilities. However, the values are often greater than 1 and negatively correlate with expected probabilities. One can speculate that these are absolute values of log probabilities but the term "cost" used both in the file and in the readme is not encouraging.
Can you clarify what these values mean?
I also checked the google papers found when searching "ProdLM" language model but these do not answer it.
I noticed that there're ~50 files under heldout-monolingual.tokenized.shuffled folder. Which ones of them is meant for test data? Is heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050 for testing while the rest of heldout-monolingual.tokenized.shuffled/news.en.heldout-00* can be used as validation set?
Hello, Thank you for this awesome work.
Can I know the source of the dataset? Maybe the website or anything else. I want to learn about how to collect text data for building large language model. If you don't mind I would like to know about the sources. Thanks in advance.
While using the preprocessed data from http://www.statmt.org/lm-benchmark/ I noticed that some of the training data was duplicated in the heldout (aka test). This is in addition to train/news.en-00000-of-00100 which appears to be a complete copy of all the heldout data.
Using a simple python script to put the sentences into a dict, I see 303,465 unique heldout sentences and 3,223 duplicates to sentences in the training directory. Attached is a file bw_duplicates.txt with the duplicates. You can easily verify this by grep'ing for them in the training directory.
Is this a known issue? My concern is that many people use this data for benchmarking language models and the test data has about 1% of the training data mixed into it. That's probably not going to change the results much but it isn't desirable either.
I thought I'd point out that on the page that gives a link to this Github repo (http://www.statmt.org/lm-benchmark/), there is a dead link (to the bash and perl scripts) linking to the old code.google.com home of this repo, instead of Github. The scripts are all here, but the link should be updated if possible.
Hi,
I got a question when looking at the "prepare data" script.
I downloaded the news.20XX.en.shuffled data from 2007 to 2011 and it does not provide 2.9B words
as said in the paper or readme page. it's far less. does this mean I need to download more monolingual data from WMT11 ? but if so it is not included in the script ?
The reason why I am asking is because I am trying to do the same thing for 2008-2015 an d I come up with 2.8B words and 2.6B words after dedup/sorting.
Also for the Interpolated KN 5-gram, was it just Srilm being used ? plain ? make big lm ?