GithubHelp home page GithubHelp logo

idio / wiki2vec Goto Github PK

View Code? Open in Web Editor NEW
601.0 601.0 137.0 655 KB

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby

Scala 35.61% Shell 9.69% Python 15.63% Java 39.07%

wiki2vec's People

Contributors

ansiiso avatar dav009 avatar fedorn avatar keynmol avatar munichong avatar pengowray avatar phdowling avatar tgalery avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wiki2vec's Issues

Unable to process the wiki

Hi,

I am trying out your solution but it keeps on failing when I try to execute

sudo java -Xmx10G -Xms10G -cp /datadrive/wiki2vec/target/scala-2.10/wiki2vec-assembly-1.0.jar org.idio.wikipedia.dumps.CreateReadableWiki //datadrive/data/wiki-latest-pages-articles-multistream.xml.bz2

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:54) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 7158628352 bytes for committing reserved memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full

My machine has 14gb of RAM, runs Ubuntu 14.04 LTS, and java version is 1.7.0_76.
I tried playing with the -Xmx and -Xms arguments, running in 64bit mode with -d64 but all to no avail.

Help with determining window size and min count

I'm trying train a model that would include certain topics.
Relying on the default parameters somehow keeps the topic out of the model.
I was thinking of changing the window size to 5 and the min count to 5 to get more granular results. However, I don't seem to actually know what would be the effect of changing these parameters. Could someone please shed some light regarding the impact ?

Provide ChronicleMap output implementation

Depends on #25

If the implementation is to be found, it's worth being able to produce a ChronicleMap version of word2vec model. It has much lower memory footprint and the access is speed is enough for most experiments.

It also plays nicely with Spark, so more complicated stuff with vectors can be expressed more efficiently.

Memory problem in building wiki2vec model via gensim

Hi
Did you have memory problem in loading the trained wiki2vec model in gensim,
I trained with size=500, window=10, min_count=10 based on last enwikipedia dump . So it created the 13g wiki2vec model, For loading on gensim I have memoryerror problem.
Do you have any idea how much memory I need ?

Not able to get vectors for Wikipedia Articles

Hi!

I'n trying to get word vectors for Wikipedia topic names. For instance, I'd like to know the vector embedding generated for Barack Obama. I've tried the following queries:

  • model['dbpedia/Barack_Obama']
  • model['DbpediaID/Barack_Obama']
  • model['Barack_Obama']

but each of them gives me a key error. Could you tell me what the correct way to query the model would be?

Problems of generating Corpus file

Hello,

We are using prepare.sh to generate Corpus file, but the Corpus file we generate is empty, could you please give us some suggestion of how to solve the problem?

Thank you very much

Question about <WINDOW_SIZE>

Hello,

In the command you mentioned wiki2vec.sh pathToCorpus pathToOutputFile <MIN_WORD_COUNT> <VECTOR_SIZE> <WINDOW_SIZE>, what's the <WINDOW_SIZE> represent for?

In terms of training data, is it using CBOW, Skip-gram or log-linear?

Thank you

Dockerize and use subcommands

We can simplify the usage massively by dockerising and setting up all necessary dependencies and exposing only a subcommand-based interface.

This way set up will only consist of installing docker and pulling the container - the prepare.sh won't be necessary anymore.

enwiki-latest-pages-articles-multistream.xml.bz2 not a valid bz2

Hi,

I was running the prepare.sh file for en-US and its throwing the following exception, because of which the generated corpus is empty. Can you please suggest some alternate solution?

Exception in thread "main" java.io.IOException: Stream is not in the BZip2 format
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:255)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:138)
at org.idio.wikipedia.dumps.ReadableWiki.getWikipediaStream(ReadableWiki.scala:19)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:31)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)

DeepLearning4J unable to load `en.model`

I'm trying to load en.model with deeplearning4j's Word2Vec implementation.

The following code is used:

return WordVectorSerializer.readWord2VecModel(new File("/home/tom/FYP/en_1000_no_stem/en.model"));

but unfortunately this exception is thrown:

java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly
	at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2480)
	at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2266)
	at xyz.tomclarke.fyp.nlp.word2vec.Word2VecProcessor.loadPreTrainedData(Word2VecProcessor.java:36)
	at xyz.tomclarke.fyp.nlp.word2vec.TestWord2Vec.testLoadWiki2Vec(TestWord2Vec.java:172)

Running your Python 'quick start' example works fine, so I'm unsure where the problem lies - either with me for not loading it correctly in some way or with DL4J (in which case I apologise for making an issue here).

Has this issue been seen before? Do you know if this is the correct way for you to load your data with dl4j's implementation? Thank you for any help.

Support jsonpedia as input source

A lot of logic for cleaning wikipedia markup is already implemented in json-wikipedia and in general it's much easier to work with because annotations are explicitly specified separately from the text of the article.

We should add an option to use jsonpedia directly, without pre-processing the XML dump.

Issue creating corpus - Invalid byte 2 of 4-byte UTF-8 sequence

I'm trying to manually create a corpus, using the following command:
java -Xmx10G -Xms10G -cp target/scala-2.10/wiki2vec-assembly-1.0.jar org.idio.wikipedia.dumps.CreateReadableWiki working/enwiki-latest-pages-articles-multistream.xml.bz2 /mnt/hd0/Arthur/data/en-wiki-latest.lines

resulting in the following error:

[Fatal Error] :965698439:106: Invalid byte 2 of 4-byte UTF-8 sequence. Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 965698439; columnNumber: 106; Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala) Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanName(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) ... 5 more

I'm using the latest wikipedia dump, and the sha1sum matches.

Any idea on what can be causing this?

Wikipedia articles coverage

Hi @dav009, very promising work here!

I wrote a simple script to test the coverage of the prebuilt model for English Wikipedia articles. I used the Wikipedia article titles from a preprocessed Wikipedia Miner March 2014 dump.

Out of 4342357 articles, only 226319 had a matching vector (~5%). I have noticed that the model usually covers popular entities but does not cover tail entities. I guess this might be because words below a certain count were ignored and because of errors in preprocessing.

Any ideas on this? I have noticed that your TODOs include resolving redirects and also co-reference resolution inside the articles, but I guess we would expect better coverage even without these.

Thanks.

Include a word2vec implementation

Information about deeplearning4j and Spark's word2vec implementations being inadequate/buggy is two years old.

Worth checking if any of the issues were at least partially fixed and provide a tool that produces a workable word2vec model from start to end with no intermediate steps

null as title of the article

When the wikipedia is processed for word2vec corpus, the titles of the pages (the first word of each line) is null. So basically all pages start with "null..". Which part of the code takes care of that and how can we change it so instead of that we can present it with the page title?

Prepare.sh problem

Hi, I am trying to run prepare.sh. I am using MAC.

This is how I run it: sudo sh prepare.sh en_US data/
Downloading wiki dump and installing packages, e.g. hadoop and spark, are all fine. Compiling wiki2vec also receives a lot of "SUCCESSFUL".
However, I started to receive exceptions when when the program tried to create readable wiki:
Creating Readable Wiki..
Exception in thread "main" java.io.FileNotFoundException: data/enwiki-latest.lines (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:101)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Path to Readable Wikipedia: file://data//enwiki-latest.lines
Path to Wikipedia Redirects: fakePathToRedirect/file.nt
Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki
2015-06-13 00:37:07 INFO SecurityManager:59 - Changing view acls to: root
2015-06-13 00:37:07 INFO SecurityManager:59 - Changing modify acls to: root
2015-06-13 00:37:07 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
2015-06-13 00:37:08 INFO Slf4jLogger:80 - Slf4jLogger started
2015-06-13 00:37:08 INFO Remoting:74 - Starting remoting
2015-06-13 00:37:09 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://[email protected]:53495]
2015-06-13 00:37:09 INFO Utils:59 - Successfully started service 'sparkDriver' on port 53495.
2015-06-13 00:37:09 INFO SparkEnv:59 - Registering MapOutputTracker
2015-06-13 00:37:09 INFO SparkEnv:59 - Registering BlockManagerMaster
2015-06-13 00:37:09 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613003709-d48b
2015-06-13 00:37:10 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB
2015-06-13 00:37:12 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-06-13 00:37:13 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-b6dd5609-bb7d-4b8c-974c-272f3c32fd76
2015-06-13 00:37:13 INFO HttpServer:59 - Starting HTTP Server
2015-06-13 00:37:13 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 00:37:13 INFO AbstractConnector:338 - Started [email protected]:53496
2015-06-13 00:37:13 INFO Utils:59 - Successfully started service 'HTTP file server' on port 53496.
2015-06-13 00:37:13 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 00:37:13 INFO AbstractConnector:338 - Started [email protected]:4040
2015-06-13 00:37:13 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040.
2015-06-13 00:37:13 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040
2015-06-13 00:37:15 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:53496/jars/wiki2vec-assembly-1.0.jar with timestamp 1434170235402
2015-06-13 00:37:16 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://[email protected]:53495/user/HeartbeatReceiver
2015-06-13 00:37:16 INFO NettyBlockTransferService:59 - Server created on 53497
2015-06-13 00:37:16 INFO BlockManagerMaster:59 - Trying to register BlockManager
2015-06-13 00:37:16 INFO BlockManagerMasterActor:59 - Registering block manager localhost:53497 with 265.1 MB RAM, BlockManagerId(, localhost, 53497)
2015-06-13 00:37:16 INFO BlockManagerMaster:59 - Registered BlockManager
java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:90)
at scala.io.Source$.fromFile(Source.scala:75)
at scala.io.Source$.fromFile(Source.scala:53)
at scala.io.Source$.fromFile(Source.scala:59)
at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58)
at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
using empty redirect store..
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB)
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB)
2015-06-13 00:37:17 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:53497 (size: 22.2 KB, free: 265.1 MB)
2015-06-13 00:37:17 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0
2015-06-13 00:37:17 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB)
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB)
2015-06-13 00:37:17 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:53497 (size: 86.0 B, free: 265.1 MB)
2015-06-13 00:37:17 INFO BlockManagerMaster:59 - Updated info of block broadcast_1_piece0
2015-06-13 00:37:17 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.tip.id is deprecated. Instead, use mapreduce.task.id
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.job.id is deprecated. Instead, use mapreduce.job.id
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file://data/enwiki-latest.lines, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1074)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164)
at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Joining corpus..
prepare.sh: line 89: data//enwiki.corpus: No such file or directory
^___^ corpus : data//enwiki.corpus

I guessed the reason was that I passed a relative path, not a absolute path. So I commented some lines in prepare.sh because the wikidump and packages were already downloaded. Then, I re-ran it like this:
sudo sh prepare.sh en_US /Users/cwang/Downloads/wiki2vec-master/data
Below is what I got:
Language: en
Working directory: /Users/cwang/Downloads/wiki2vec-master/working
Creating Readable Wiki..
Exception in thread "main" java.io.FileNotFoundException: /Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:101)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Path to Readable Wikipedia: file:///Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines
Path to Wikipedia Redirects: fakePathToRedirect/file.nt
Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki
2015-06-13 10:14:03 INFO SecurityManager:59 - Changing view acls to: root
2015-06-13 10:14:03 INFO SecurityManager:59 - Changing modify acls to: root
2015-06-13 10:14:03 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
2015-06-13 10:14:03 INFO Slf4jLogger:80 - Slf4jLogger started
2015-06-13 10:14:03 INFO Remoting:74 - Starting remoting
2015-06-13 10:14:03 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://[email protected]:58893]
2015-06-13 10:14:03 INFO Utils:59 - Successfully started service 'sparkDriver' on port 58893.
2015-06-13 10:14:03 INFO SparkEnv:59 - Registering MapOutputTracker
2015-06-13 10:14:03 INFO SparkEnv:59 - Registering BlockManagerMaster
2015-06-13 10:14:03 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613101403-e43b
2015-06-13 10:14:03 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB
2015-06-13 10:14:04 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-06-13 10:14:04 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-846c5c00-63bc-4070-b1ef-903b6fcd3567
2015-06-13 10:14:04 INFO HttpServer:59 - Starting HTTP Server
2015-06-13 10:14:04 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:14:04 INFO AbstractConnector:338 - Started [email protected]:58894
2015-06-13 10:14:04 INFO Utils:59 - Successfully started service 'HTTP file server' on port 58894.
2015-06-13 10:14:04 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:14:04 INFO AbstractConnector:338 - Started [email protected]:4040
2015-06-13 10:14:04 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040.
2015-06-13 10:14:04 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040
2015-06-13 10:14:05 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:58894/jars/wiki2vec-assembly-1.0.jar with timestamp 1434204845068
2015-06-13 10:14:05 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://[email protected]:58893/user/HeartbeatReceiver
2015-06-13 10:14:05 INFO NettyBlockTransferService:59 - Server created on 58895
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Trying to register BlockManager
2015-06-13 10:14:05 INFO BlockManagerMasterActor:59 - Registering block manager localhost:58895 with 265.1 MB RAM, BlockManagerId(, localhost, 58895)
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Registered BlockManager
java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:90)
at scala.io.Source$.fromFile(Source.scala:75)
at scala.io.Source$.fromFile(Source.scala:53)
at scala.io.Source$.fromFile(Source.scala:59)
at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58)
at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
using empty redirect store..
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB)
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB)
2015-06-13 10:14:05 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:58895 (size: 22.2 KB, free: 265.1 MB)
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0
2015-06-13 10:14:05 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB)
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB)
2015-06-13 10:14:05 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:58895 (size: 86.0 B, free: 265.1 MB)
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Updated info of block broadcast_1_piece0
2015-06-13 10:14:05 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/cwang/Downloads/wiki2vec-master/working/enwiki already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1041)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164)
at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Joining corpus..
prepare.sh: line 89: /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus: No such file or directory
^^ corpus : /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus
RES1CWANG-M1:wiki2vec-master cwang$ sudo sh prepare.sh en_US /Users/cwang/Downloads/wiki2vec-master/data
prepare.sh: line 23: 2: No such file or directory
Language: en
Working directory: /Users/cwang/Downloads/wiki2vec-master/working
Creating Readable Wiki..
Exception in thread "main" java.io.FileNotFoundException: /Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:101)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Path to Readable Wikipedia: file:///Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines
Path to Wikipedia Redirects: fakePathToRedirect/file.nt
Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki
2015-06-13 10:15:35 INFO SecurityManager:59 - Changing view acls to: root
2015-06-13 10:15:35 INFO SecurityManager:59 - Changing modify acls to: root
2015-06-13 10:15:35 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
2015-06-13 10:15:36 INFO Slf4jLogger:80 - Slf4jLogger started
2015-06-13 10:15:36 INFO Remoting:74 - Starting remoting
2015-06-13 10:15:36 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://[email protected]:58941]
2015-06-13 10:15:36 INFO Utils:59 - Successfully started service 'sparkDriver' on port 58941.
2015-06-13 10:15:36 INFO SparkEnv:59 - Registering MapOutputTracker
2015-06-13 10:15:36 INFO SparkEnv:59 - Registering BlockManagerMaster
2015-06-13 10:15:36 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613101536-86b2
2015-06-13 10:15:36 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB
2015-06-13 10:15:36 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-06-13 10:15:36 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-914594ab-f207-47cf-b092-1bb988a1cd0c
2015-06-13 10:15:36 INFO HttpServer:59 - Starting HTTP Server
2015-06-13 10:15:37 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:15:37 INFO AbstractConnector:338 - Started [email protected]:58942
2015-06-13 10:15:37 INFO Utils:59 - Successfully started service 'HTTP file server' on port 58942.
2015-06-13 10:15:37 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:15:37 INFO AbstractConnector:338 - Started [email protected]:4040
2015-06-13 10:15:37 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040.
2015-06-13 10:15:37 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040
2015-06-13 10:15:37 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:58942/jars/wiki2vec-assembly-1.0.jar with timestamp 1434204937819
2015-06-13 10:15:37 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://[email protected]:58941/user/HeartbeatReceiver
2015-06-13 10:15:37 INFO NettyBlockTransferService:59 - Server created on 58943
2015-06-13 10:15:37 INFO BlockManagerMaster:59 - Trying to register BlockManager
2015-06-13 10:15:37 INFO BlockManagerMasterActor:59 - Registering block manager localhost:58943 with 265.1 MB RAM, BlockManagerId(, localhost, 58943)
2015-06-13 10:15:37 INFO BlockManagerMaster:59 - Registered BlockManager
java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:90)
at scala.io.Source$.fromFile(Source.scala:75)
at scala.io.Source$.fromFile(Source.scala:53)
at scala.io.Source$.fromFile(Source.scala:59)
at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58)
at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
using empty redirect store..
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB)
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB)
2015-06-13 10:15:38 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:58943 (size: 22.2 KB, free: 265.1 MB)
2015-06-13 10:15:38 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0
2015-06-13 10:15:38 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB)
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB)
2015-06-13 10:15:38 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:58943 (size: 86.0 B, free: 265.1 MB)
2015-06-13 10:15:38 INFO BlockManagerMaster:59 - Updated info of block broadcast_1_piece0
2015-06-13 10:15:38 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.tip.id is deprecated. Instead, use mapreduce.task.id
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.job.id is deprecated. Instead, use mapreduce.job.id
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1074)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164)
at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Joining corpus..
prepare.sh: line 88: /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus: No such file or directory
^
^ corpus : /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus

There is no data folder in my wiki2vec-master folder.
Could you please advise how I can fix it?

Thanks in advance!

English pretrained embeddings of size 300

Would it be possible for you to share pretrained embeddings of size 300 in English?

I'm trying to train some models using these embeddings, but given the size of my dataset, 1000 dimensions is very high and my machine cannot support it.

Error in parsing wikipedia

Hi,

After page 2029599, I get this error:
.......
2029599
Exception in thread "main" java.io.IOException: block overrun
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.getAndMoveToFrontDecode(BZip2CompressorInputStream.java:700)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:326)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:884)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:933)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:228)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:179)
at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)

Can you figure out the possible error?

Pre-trained model accuracy

Hi David,

I have tested the accuracy of your pre-trained model with "questions-words.txt" dataset from Google. The results are:

2016-05-23 12:20:41,450 : INFO : family: 0.0% (0/272)
2016-05-23 12:20:45,831 : INFO : gram1-adjective-to-adverb: 3.8% (23/600)
2016-05-23 12:20:47,155 : INFO : gram2-opposite: 14.3% (26/182)
2016-05-23 12:20:53,058 : INFO : gram3-comparative: 0.0% (0/812)
2016-05-23 12:20:55,041 : INFO : gram4-superlative: 5.5% (15/272)
2016-05-23 12:21:00,144 : INFO : gram5-present-participle: 0.0% (0/702)
2016-05-23 12:21:08,831 : INFO : gram7-past-tense: 4.1% (49/1190)
2016-05-23 12:21:14,729 : INFO : gram8-plural: 0.0% (0/812)
2016-05-23 12:21:18,418 : INFO : gram9-plural-verbs: 3.6% (18/507)
2016-05-23 12:21:18,419 : INFO : total: 2.4% (131/5349)

The accuracy is quite low (2.4%) is that normal?
Thanks

Issue creating corpus

Getting this error:

[info] Assembly up to date: /home/rg203/work/scripts/wiki2vec/target/scala-2.10/wiki2vec-assembly-1.0.jar
[success] Total time: 2 s, completed Jan 5, 2017 7:29:26 AM
Creating Readable Wiki..
Exception in thread "main" java.io.IOException: Stream is not in the BZip2 format
	at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:255)
	at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.<init>(BZip2CompressorInputStream.java:138)
	at org.idio.wikipedia.dumps.ReadableWiki.getWikipediaStream(ReadableWiki.scala:19)
	at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:31)
	at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
	at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
/home/rg203/work/scripts/wiki2vec/working/spark-1.2.0-bin-hadoop2.4/bin/spark-class: line 113: [: : integer expression expected
/home/rg203/work/scripts/wiki2vec/working/spark-1.2.0-bin-hadoop2.4/bin/spark-class: line 187: /usr/lib/jvm/java-8-oracle/jre/bin/java/bin/java: Not a directory
/home/rg203/work/scripts/wiki2vec/working/spark-1.2.0-bin-hadoop2.4/bin/spark-class: line 187: exec: /usr/lib/jvm/java-8-oracle/jre/bin/java/bin/java: cannot execute: Not a directory
Joining corpus..
cat: 'part*': No such file or directory
 ^___^ corpus : /home/rg203/work/scripts/wiki2vec/spanish_output//eswiki.corpus

Any ideas? Thanks for the help!

Chinese Wikipedia StackOverflowError

Chinese Wikipedia pops this error out when creating word2vec corpus using: org.idio.wikipedia.word2vec.Word2VecCorpus class.

   java.lang.StackOverflowError
   at java.util.regex.Pattern$CharProperty.match(Pattern.java:3705)
   at java.util.regex.Pattern$Curly.match0(Pattern.java:4160)
   at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
       .........
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
    at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4144)
    at java.util.regex.Pattern$Slice.match(Pattern.java:3882)
    at java.util.regex.Pattern$Start.match(Pattern.java:3420)
    at java.util.regex.Matcher.search(Matcher.java:1211)
    at java.util.regex.Matcher.find(Matcher.java:604)
    at java.util.regex.Matcher.replaceAll(Matcher.java:914)
    at scala.util.matching.Regex.replaceAllIn(Regex.scala:298)
    at org.idio.wikipedia.word2vec.ArticleCleaner$.cleanStyle(ArticleCleaner.scala:69)
    at org.idio.wikipedia.word2vec.Word2VecCorpus$$anonfun$cleanArticles$1.apply(Word2VecCorpus.scala:65)
    at org.idio.wikipedia.word2vec.Word2VecCorpus$$anonfun$cleanArticles$1.apply(Word2VecCorpus.scala:56)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1060)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:56)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
    at java.lang.Thread.run(Thread.java:809)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-01-25 16:25:08 WARN  TaskSetManager:71 - Lost task 57.0 in stage 0.0 (TID 57, localhost): TaskKilled (killed intentionally)```

Redirects handling

Hi,
Well it's a question rather than an issue. I'm not a scala programmer but I want to know how do you handle the redirects...I mean, what do you mean by 'Handling redirects'...Do you replace the redicrect by the entity to which it redirects? What kind of handling?
The second question is how do u deal with the non-entity pages (navigation, maintainance, and discussion pages) ?
Thanks!

Support request: tokenization

It would be great to add the org.apache.lucene.analysis for smarter tokenization for all languages. In this way, processing other languages such as Chinese is more sensible with your library.

Whether "dbc:Earthquakes" or "dbr:Vanilla_Ice" available in English Wikipedia (Feb 2015) 1000 dimension - No stemming - 10skipgram

Hello,

We are currently using the dataset "English Wikipedia (Feb 2015) 1000 dimension - No stemming - 10skipgram". We are search for some test cases on http://dbpedia.org/page/Earthquake. We have tried "DBPEDIA_ID/Vanilla_Ice" is available in the dataset. But when we try "dbc:Earthquakes" or "dbr:Vanilla_Ice", we will get error message KeyError "dbc:Earthquakes" not in vocabulary and "dbr:Vanilla_Ice" not in vocabulary. We are wonder whether the dataset stores data as "dbc:" or "dbo:"?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.