idio / wiki2vec Goto Github PK
View Code? Open in Web Editor NEWGenerating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby
Hi,
I am trying out your solution but it keeps on failing when I try to execute
sudo java -Xmx10G -Xms10G -cp /datadrive/wiki2vec/target/scala-2.10/wiki2vec-assembly-1.0.jar org.idio.wikipedia.dumps.CreateReadableWiki //datadrive/data/wiki-latest-pages-articles-multistream.xml.bz2
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:54) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 7158628352 bytes for committing reserved
memory.
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
My machine has 14gb of RAM, runs Ubuntu 14.04 LTS, and java version is 1.7.0_76.
I tried playing with the -Xmx and -Xms arguments, running in 64bit mode with -d64 but all to no avail.
I'm trying train a model that would include certain topics.
Relying on the default parameters somehow keeps the topic out of the model.
I was thinking of changing the window size to 5 and the min count to 5 to get more granular results. However, I don't seem to actually know what would be the effect of changing these parameters. Could someone please shed some light regarding the impact ?
We are trying to use the DBpedia vectors available at https://github.com/idio/wiki2vec#prebuilt-models
English Wikipedia (Feb 2015) 1000 dimension - No stemming - 10skipgram
Would you mind letting us know whether the vectors include multi-word entities (e.g. Barack_Obama) or are about only "single words" ? Thanks.
Depends on #25
If the implementation is to be found, it's worth being able to produce a ChronicleMap version of word2vec model. It has much lower memory footprint and the access is speed is enough for most experiments.
It also plays nicely with Spark, so more complicated stuff with vectors can be expressed more efficiently.
Do you know of any project aiming to support saving and loading wiki2vec or word2vec model files into a database (like mysql) and provide some adapter to query the vectors ?
Hi
Did you have memory problem in loading the trained wiki2vec model in gensim,
I trained with size=500, window=10, min_count=10 based on last enwikipedia dump . So it created the 13g wiki2vec model, For loading on gensim I have memoryerror problem.
Do you have any idea how much memory I need ?
https://github.com/idio/wiki2vec/raw/master/torrents/enwiki-gensim-word2vec-1000-nostem-10cbow.torrent
Torrent Not Found!
Any other sources than torrent download?
Hi!
I'n trying to get word vectors for Wikipedia topic names. For instance, I'd like to know the vector embedding generated for Barack Obama. I've tried the following queries:
model['dbpedia/Barack_Obama']
model['DbpediaID/Barack_Obama']
model['Barack_Obama']
but each of them gives me a key error. Could you tell me what the correct way to query the model would be?
in line number 23
if [ $# < 2 ]
should be replaced with
if [ $# -lt 2 ]
I can't import model.txt as the word2vec model using gensim. How can I do that?
Hello,
We are currently using the DBPedia dataset, "English Wikipedia (Feb 2015) 1000 dimension - No stemming - 10skipgram". Does DBPedia vectors have vectors for the predicates?
Thank you
Hello,
We are using prepare.sh to generate Corpus file, but the Corpus file we generate is empty, could you please give us some suggestion of how to solve the problem?
Thank you very much
Let's say I'm looking for specific concept which doesn't exist per-say in the model file.
How would I be able to find a nearest representative vector based on currently existing vectors?
@dav009 (I hope it's alright I'm tagging you here, thought of grabbing your attention if you're still available).
Thanks a lot!
Hello,
In the command you mentioned wiki2vec.sh pathToCorpus pathToOutputFile <MIN_WORD_COUNT> <VECTOR_SIZE> <WINDOW_SIZE>, what's the <WINDOW_SIZE> represent for?
In terms of training data, is it using CBOW, Skip-gram or log-linear?
Thank you
We can simplify the usage massively by dockerising and setting up all necessary dependencies and exposing only a subcommand-based interface.
This way set up will only consist of installing docker and pulling the container - the prepare.sh
won't be necessary anymore.
Hi,
I was running the prepare.sh file for en-US and its throwing the following exception, because of which the generated corpus is empty. Can you please suggest some alternate solution?
Exception in thread "main" java.io.IOException: Stream is not in the BZip2 format
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:255)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.(BZip2CompressorInputStream.java:138)
at org.idio.wikipedia.dumps.ReadableWiki.getWikipediaStream(ReadableWiki.scala:19)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:31)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Hi,
Can you please consider moving the model to another file repository?
May be the new Github large files fs. or dropbox.
Torrent is extremely slow.
Thanks!
I'm trying to load en.model
with deeplearning4j
's Word2Vec implementation.
The following code is used:
return WordVectorSerializer.readWord2VecModel(new File("/home/tom/FYP/en_1000_no_stem/en.model"));
but unfortunately this exception is thrown:
java.lang.RuntimeException: Unable to guess input file format. Please use corresponding loader directly
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2480)
at org.deeplearning4j.models.embeddings.loader.WordVectorSerializer.readWord2VecModel(WordVectorSerializer.java:2266)
at xyz.tomclarke.fyp.nlp.word2vec.Word2VecProcessor.loadPreTrainedData(Word2VecProcessor.java:36)
at xyz.tomclarke.fyp.nlp.word2vec.TestWord2Vec.testLoadWiki2Vec(TestWord2Vec.java:172)
Running your Python 'quick start' example works fine, so I'm unsure where the problem lies - either with me for not loading it correctly in some way or with DL4J (in which case I apologise for making an issue here).
Has this issue been seen before? Do you know if this is the correct way for you to load your data with dl4j's implementation? Thank you for any help.
A lot of logic for cleaning wikipedia markup is already implemented in json-wikipedia and in general it's much easier to work with because annotations are explicitly specified separately from the text of the article.
We should add an option to use jsonpedia directly, without pre-processing the XML dump.
I'm trying to manually create a corpus, using the following command:
java -Xmx10G -Xms10G -cp target/scala-2.10/wiki2vec-assembly-1.0.jar org.idio.wikipedia.dumps.CreateReadableWiki working/enwiki-latest-pages-articles-multistream.xml.bz2 /mnt/hd0/Arthur/data/en-wiki-latest.lines
resulting in the following error:
[Fatal Error] :965698439:106: Invalid byte 2 of 4-byte UTF-8 sequence. Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 965698439; columnNumber: 106; Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49) at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45) at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55) at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala) Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source) at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanName(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) ... 5 more
I'm using the latest wikipedia dump, and the sha1sum matches.
Any idea on what can be causing this?
Hi @dav009, very promising work here!
I wrote a simple script to test the coverage of the prebuilt model for English Wikipedia articles. I used the Wikipedia article titles from a preprocessed Wikipedia Miner March 2014 dump.
Out of 4342357
articles, only 226319
had a matching vector (~5%
). I have noticed that the model usually covers popular entities but does not cover tail entities. I guess this might be because words below a certain count were ignored and because of errors in preprocessing.
Any ideas on this? I have noticed that your TODOs include resolving redirects and also co-reference resolution inside the articles, but I guess we would expect better coverage even without these.
Thanks.
Information about deeplearning4j and Spark's word2vec implementations being inadequate/buggy is two years old.
Worth checking if any of the issues were at least partially fixed and provide a tool that produces a workable word2vec model from start to end with no intermediate steps
When the wikipedia is processed for word2vec corpus, the titles of the pages (the first word of each line) is null. So basically all pages start with "null..". Which part of the code takes care of that and how can we change it so instead of that we can present it with the page title?
Hi, I am trying to run prepare.sh. I am using MAC.
This is how I run it: sudo sh prepare.sh en_US data/
Downloading wiki dump and installing packages, e.g. hadoop and spark, are all fine. Compiling wiki2vec also receives a lot of "SUCCESSFUL".
However, I started to receive exceptions when when the program tried to create readable wiki:
Creating Readable Wiki..
Exception in thread "main" java.io.FileNotFoundException: data/enwiki-latest.lines (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:101)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Path to Readable Wikipedia: file://data//enwiki-latest.lines
Path to Wikipedia Redirects: fakePathToRedirect/file.nt
Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki
2015-06-13 00:37:07 INFO SecurityManager:59 - Changing view acls to: root
2015-06-13 00:37:07 INFO SecurityManager:59 - Changing modify acls to: root
2015-06-13 00:37:07 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
2015-06-13 00:37:08 INFO Slf4jLogger:80 - Slf4jLogger started
2015-06-13 00:37:08 INFO Remoting:74 - Starting remoting
2015-06-13 00:37:09 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://[email protected]:53495]
2015-06-13 00:37:09 INFO Utils:59 - Successfully started service 'sparkDriver' on port 53495.
2015-06-13 00:37:09 INFO SparkEnv:59 - Registering MapOutputTracker
2015-06-13 00:37:09 INFO SparkEnv:59 - Registering BlockManagerMaster
2015-06-13 00:37:09 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613003709-d48b
2015-06-13 00:37:10 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB
2015-06-13 00:37:12 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-06-13 00:37:13 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-b6dd5609-bb7d-4b8c-974c-272f3c32fd76
2015-06-13 00:37:13 INFO HttpServer:59 - Starting HTTP Server
2015-06-13 00:37:13 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 00:37:13 INFO AbstractConnector:338 - Started [email protected]:53496
2015-06-13 00:37:13 INFO Utils:59 - Successfully started service 'HTTP file server' on port 53496.
2015-06-13 00:37:13 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 00:37:13 INFO AbstractConnector:338 - Started [email protected]:4040
2015-06-13 00:37:13 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040.
2015-06-13 00:37:13 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040
2015-06-13 00:37:15 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:53496/jars/wiki2vec-assembly-1.0.jar with timestamp 1434170235402
2015-06-13 00:37:16 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://[email protected]:53495/user/HeartbeatReceiver
2015-06-13 00:37:16 INFO NettyBlockTransferService:59 - Server created on 53497
2015-06-13 00:37:16 INFO BlockManagerMaster:59 - Trying to register BlockManager
2015-06-13 00:37:16 INFO BlockManagerMasterActor:59 - Registering block manager localhost:53497 with 265.1 MB RAM, BlockManagerId(, localhost, 53497)
2015-06-13 00:37:16 INFO BlockManagerMaster:59 - Registered BlockManager
java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:90)
at scala.io.Source$.fromFile(Source.scala:75)
at scala.io.Source$.fromFile(Source.scala:53)
at scala.io.Source$.fromFile(Source.scala:59)
at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58)
at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
using empty redirect store..
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB)
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB)
2015-06-13 00:37:17 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:53497 (size: 22.2 KB, free: 265.1 MB)
2015-06-13 00:37:17 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0
2015-06-13 00:37:17 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB)
2015-06-13 00:37:17 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440
2015-06-13 00:37:17 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB)
2015-06-13 00:37:17 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:53497 (size: 86.0 B, free: 265.1 MB)
2015-06-13 00:37:17 INFO BlockManagerMaster:59 - Updated info of block broadcast_1_piece0
2015-06-13 00:37:17 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.tip.id is deprecated. Instead, use mapreduce.task.id
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
2015-06-13 00:37:18 INFO deprecation:1009 - mapred.job.id is deprecated. Instead, use mapreduce.job.id
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file://data/enwiki-latest.lines, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1642)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1074)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164)
at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Joining corpus..
prepare.sh: line 89: data//enwiki.corpus: No such file or directory
^___^ corpus : data//enwiki.corpus
I guessed the reason was that I passed a relative path, not a absolute path. So I commented some lines in prepare.sh because the wikidump and packages were already downloaded. Then, I re-ran it like this:
sudo sh prepare.sh en_US /Users/cwang/Downloads/wiki2vec-master/data
Below is what I got:
Language: en
Working directory: /Users/cwang/Downloads/wiki2vec-master/working
Creating Readable Wiki..
Exception in thread "main" java.io.FileNotFoundException: /Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:101)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Path to Readable Wikipedia: file:///Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines
Path to Wikipedia Redirects: fakePathToRedirect/file.nt
Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki
2015-06-13 10:14:03 INFO SecurityManager:59 - Changing view acls to: root
2015-06-13 10:14:03 INFO SecurityManager:59 - Changing modify acls to: root
2015-06-13 10:14:03 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
2015-06-13 10:14:03 INFO Slf4jLogger:80 - Slf4jLogger started
2015-06-13 10:14:03 INFO Remoting:74 - Starting remoting
2015-06-13 10:14:03 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://[email protected]:58893]
2015-06-13 10:14:03 INFO Utils:59 - Successfully started service 'sparkDriver' on port 58893.
2015-06-13 10:14:03 INFO SparkEnv:59 - Registering MapOutputTracker
2015-06-13 10:14:03 INFO SparkEnv:59 - Registering BlockManagerMaster
2015-06-13 10:14:03 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613101403-e43b
2015-06-13 10:14:03 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB
2015-06-13 10:14:04 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-06-13 10:14:04 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-846c5c00-63bc-4070-b1ef-903b6fcd3567
2015-06-13 10:14:04 INFO HttpServer:59 - Starting HTTP Server
2015-06-13 10:14:04 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:14:04 INFO AbstractConnector:338 - Started [email protected]:58894
2015-06-13 10:14:04 INFO Utils:59 - Successfully started service 'HTTP file server' on port 58894.
2015-06-13 10:14:04 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:14:04 INFO AbstractConnector:338 - Started [email protected]:4040
2015-06-13 10:14:04 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040.
2015-06-13 10:14:04 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040
2015-06-13 10:14:05 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:58894/jars/wiki2vec-assembly-1.0.jar with timestamp 1434204845068
2015-06-13 10:14:05 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://[email protected]:58893/user/HeartbeatReceiver
2015-06-13 10:14:05 INFO NettyBlockTransferService:59 - Server created on 58895
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Trying to register BlockManager
2015-06-13 10:14:05 INFO BlockManagerMasterActor:59 - Registering block manager localhost:58895 with 265.1 MB RAM, BlockManagerId(, localhost, 58895)
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Registered BlockManager
java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:90)
at scala.io.Source$.fromFile(Source.scala:75)
at scala.io.Source$.fromFile(Source.scala:53)
at scala.io.Source$.fromFile(Source.scala:59)
at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58)
at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
using empty redirect store..
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB)
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB)
2015-06-13 10:14:05 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:58895 (size: 22.2 KB, free: 265.1 MB)
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0
2015-06-13 10:14:05 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB)
2015-06-13 10:14:05 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440
2015-06-13 10:14:05 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB)
2015-06-13 10:14:05 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:58895 (size: 86.0 B, free: 265.1 MB)
2015-06-13 10:14:05 INFO BlockManagerMaster:59 - Updated info of block broadcast_1_piece0
2015-06-13 10:14:05 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/cwang/Downloads/wiki2vec-master/working/enwiki already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1041)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164)
at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Joining corpus..
prepare.sh: line 89: /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus: No such file or directory
^^ corpus : /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus
RES1CWANG-M1:wiki2vec-master cwang$ sudo sh prepare.sh en_US /Users/cwang/Downloads/wiki2vec-master/data
prepare.sh: line 23: 2: No such file or directory
Language: en
Working directory: /Users/cwang/Downloads/wiki2vec-master/working
Creating Readable Wiki..
Exception in thread "main" java.io.FileNotFoundException: /Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:101)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:30)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Path to Readable Wikipedia: file:///Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines
Path to Wikipedia Redirects: fakePathToRedirect/file.nt
Path to Output Corpus : file:///Users/cwang/Downloads/wiki2vec-master/working/enwiki
2015-06-13 10:15:35 INFO SecurityManager:59 - Changing view acls to: root
2015-06-13 10:15:35 INFO SecurityManager:59 - Changing modify acls to: root
2015-06-13 10:15:35 INFO SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
2015-06-13 10:15:36 INFO Slf4jLogger:80 - Slf4jLogger started
2015-06-13 10:15:36 INFO Remoting:74 - Starting remoting
2015-06-13 10:15:36 INFO Remoting:74 - Remoting started; listening on addresses :[akka.tcp://[email protected]:58941]
2015-06-13 10:15:36 INFO Utils:59 - Successfully started service 'sparkDriver' on port 58941.
2015-06-13 10:15:36 INFO SparkEnv:59 - Registering MapOutputTracker
2015-06-13 10:15:36 INFO SparkEnv:59 - Registering BlockManagerMaster
2015-06-13 10:15:36 INFO DiskBlockManager:59 - Created local directory at /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-local-20150613101536-86b2
2015-06-13 10:15:36 INFO MemoryStore:59 - MemoryStore started with capacity 265.1 MB
2015-06-13 10:15:36 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-06-13 10:15:36 INFO HttpFileServer:59 - HTTP File server directory is /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/spark-914594ab-f207-47cf-b092-1bb988a1cd0c
2015-06-13 10:15:36 INFO HttpServer:59 - Starting HTTP Server
2015-06-13 10:15:37 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:15:37 INFO AbstractConnector:338 - Started [email protected]:58942
2015-06-13 10:15:37 INFO Utils:59 - Successfully started service 'HTTP file server' on port 58942.
2015-06-13 10:15:37 INFO Server:272 - jetty-8.y.z-SNAPSHOT
2015-06-13 10:15:37 INFO AbstractConnector:338 - Started [email protected]:4040
2015-06-13 10:15:37 INFO Utils:59 - Successfully started service 'SparkUI' on port 4040.
2015-06-13 10:15:37 INFO SparkUI:59 - Started SparkUI at http://res1cwang-m1.home:4040
2015-06-13 10:15:37 INFO SparkContext:59 - Added JAR file:/Users/cwang/Downloads/wiki2vec-master/target/scala-2.10/wiki2vec-assembly-1.0.jar at http://192.168.1.8:58942/jars/wiki2vec-assembly-1.0.jar with timestamp 1434204937819
2015-06-13 10:15:37 INFO AkkaUtils:59 - Connecting to HeartbeatReceiver: akka.tcp://[email protected]:58941/user/HeartbeatReceiver
2015-06-13 10:15:37 INFO NettyBlockTransferService:59 - Server created on 58943
2015-06-13 10:15:37 INFO BlockManagerMaster:59 - Trying to register BlockManager
2015-06-13 10:15:37 INFO BlockManagerMasterActor:59 - Registering block manager localhost:58943 with 265.1 MB RAM, BlockManagerId(, localhost, 58943)
2015-06-13 10:15:37 INFO BlockManagerMaster:59 - Registered BlockManager
java.io.FileNotFoundException: fakePathToRedirect/file.nt (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at scala.io.Source$.fromFile(Source.scala:90)
at scala.io.Source$.fromFile(Source.scala:75)
at scala.io.Source$.fromFile(Source.scala:53)
at scala.io.Source$.fromFile(Source.scala:59)
at org.idio.wikipedia.redirects.RedirectStore$.readFile(RedirectStore.scala:58)
at org.idio.wikipedia.redirects.MapRedirectStore.(RedirectStore.scala:34)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:172)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
using empty redirect store..
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(159118) called with curMem=0, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_0 stored as values in memory (estimated size 155.4 KB, free 265.0 MB)
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(22692) called with curMem=159118, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 265.0 MB)
2015-06-13 10:15:38 INFO BlockManagerInfo:59 - Added broadcast_0_piece0 in memory on localhost:58943 (size: 22.2 KB, free: 265.1 MB)
2015-06-13 10:15:38 INFO BlockManagerMaster:59 - Updated info of block broadcast_0_piece0
2015-06-13 10:15:38 INFO SparkContext:59 - Created broadcast 0 from textFile at Word2VecCorpus.scala:25
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(40) called with curMem=181810, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_1 stored as values in memory (estimated size 40.0 B, free 265.0 MB)
2015-06-13 10:15:38 INFO MemoryStore:59 - ensureFreeSpace(86) called with curMem=181850, maxMem=278019440
2015-06-13 10:15:38 INFO MemoryStore:59 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 86.0 B, free 265.0 MB)
2015-06-13 10:15:38 INFO BlockManagerInfo:59 - Added broadcast_1_piece0 in memory on localhost:58943 (size: 86.0 B, free: 265.1 MB)
2015-06-13 10:15:38 INFO BlockManagerMaster:59 - Updated info of block broadcast_1_piece0
2015-06-13 10:15:38 INFO SparkContext:59 - Created broadcast 1 from broadcast at Word2VecCorpus.scala:30
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.tip.id is deprecated. Instead, use mapreduce.task.id
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
2015-06-13 10:15:38 INFO deprecation:1009 - mapred.job.id is deprecated. Instead, use mapreduce.job.id
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/cwang/Downloads/wiki2vec-master/data/enwiki-latest.lines
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1074)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:940)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:849)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1164)
at org.idio.wikipedia.word2vec.Word2VecCorpus.getWord2vecCorpus(Word2VecCorpus.scala:139)
at org.idio.wikipedia.word2vec.Word2VecCorpus$.main(Word2VecCorpus.scala:185)
at org.idio.wikipedia.word2vec.Word2VecCorpus.main(Word2VecCorpus.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Joining corpus..
prepare.sh: line 88: /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus: No such file or directory
^^ corpus : /Users/cwang/Downloads/wiki2vec-master/data/enwiki.corpus
There is no data folder in my wiki2vec-master folder.
Could you please advise how I can fix it?
Thanks in advance!
Would it be possible for you to share pretrained embeddings of size 300 in English?
I'm trying to train some models using these embeddings, but given the size of my dataset, 1000 dimensions is very high and my machine cannot support it.
Hi,
After page 2029599, I get this error:
.......
2029599
Exception in thread "main" java.io.IOException: block overrun
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.getAndMoveToFrontDecode(BZip2CompressorInputStream.java:700)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.initBlock(BZip2CompressorInputStream.java:326)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartA(BZip2CompressorInputStream.java:884)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.setupNoRandPartB(BZip2CompressorInputStream.java:933)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read0(BZip2CompressorInputStream.java:228)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.read(BZip2CompressorInputStream.java:179)
at org.apache.xerces.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at com.stratio.parsers.XMLDumpParser.parse(XMLDumpParser.java:49)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:45)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Can you figure out the possible error?
Hi David,
I have tested the accuracy of your pre-trained model with "questions-words.txt" dataset from Google. The results are:
2016-05-23 12:20:41,450 : INFO : family: 0.0% (0/272)
2016-05-23 12:20:45,831 : INFO : gram1-adjective-to-adverb: 3.8% (23/600)
2016-05-23 12:20:47,155 : INFO : gram2-opposite: 14.3% (26/182)
2016-05-23 12:20:53,058 : INFO : gram3-comparative: 0.0% (0/812)
2016-05-23 12:20:55,041 : INFO : gram4-superlative: 5.5% (15/272)
2016-05-23 12:21:00,144 : INFO : gram5-present-participle: 0.0% (0/702)
2016-05-23 12:21:08,831 : INFO : gram7-past-tense: 4.1% (49/1190)
2016-05-23 12:21:14,729 : INFO : gram8-plural: 0.0% (0/812)
2016-05-23 12:21:18,418 : INFO : gram9-plural-verbs: 3.6% (18/507)
2016-05-23 12:21:18,419 : INFO : total: 2.4% (131/5349)
The accuracy is quite low (2.4%) is that normal?
Thanks
Getting this error:
[info] Assembly up to date: /home/rg203/work/scripts/wiki2vec/target/scala-2.10/wiki2vec-assembly-1.0.jar
[success] Total time: 2 s, completed Jan 5, 2017 7:29:26 AM
Creating Readable Wiki..
Exception in thread "main" java.io.IOException: Stream is not in the BZip2 format
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.init(BZip2CompressorInputStream.java:255)
at org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream.<init>(BZip2CompressorInputStream.java:138)
at org.idio.wikipedia.dumps.ReadableWiki.getWikipediaStream(ReadableWiki.scala:19)
at org.idio.wikipedia.dumps.ReadableWiki.createReadableWiki(ReadableWiki.scala:31)
at org.idio.wikipedia.dumps.CreateReadableWiki$.main(ReadableWiki.scala:55)
at org.idio.wikipedia.dumps.CreateReadableWiki.main(ReadableWiki.scala)
Creating Word2vec Corpus
/home/rg203/work/scripts/wiki2vec/working/spark-1.2.0-bin-hadoop2.4/bin/spark-class: line 113: [: : integer expression expected
/home/rg203/work/scripts/wiki2vec/working/spark-1.2.0-bin-hadoop2.4/bin/spark-class: line 187: /usr/lib/jvm/java-8-oracle/jre/bin/java/bin/java: Not a directory
/home/rg203/work/scripts/wiki2vec/working/spark-1.2.0-bin-hadoop2.4/bin/spark-class: line 187: exec: /usr/lib/jvm/java-8-oracle/jre/bin/java/bin/java: cannot execute: Not a directory
Joining corpus..
cat: 'part*': No such file or directory
^___^ corpus : /home/rg203/work/scripts/wiki2vec/spanish_output//eswiki.corpus
Any ideas? Thanks for the help!
Chinese Wikipedia pops this error out when creating word2vec corpus using: org.idio.wikipedia.word2vec.Word2VecCorpus class.
java.lang.StackOverflowError
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3705)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4160)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
.........
at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4173)
at java.util.regex.Pattern$Curly.match(Pattern.java:4144)
at java.util.regex.Pattern$Slice.match(Pattern.java:3882)
at java.util.regex.Pattern$Start.match(Pattern.java:3420)
at java.util.regex.Matcher.search(Matcher.java:1211)
at java.util.regex.Matcher.find(Matcher.java:604)
at java.util.regex.Matcher.replaceAll(Matcher.java:914)
at scala.util.matching.Regex.replaceAllIn(Regex.scala:298)
at org.idio.wikipedia.word2vec.ArticleCleaner$.cleanStyle(ArticleCleaner.scala:69)
at org.idio.wikipedia.word2vec.Word2VecCorpus$$anonfun$cleanArticles$1.apply(Word2VecCorpus.scala:65)
at org.idio.wikipedia.word2vec.Word2VecCorpus$$anonfun$cleanArticles$1.apply(Word2VecCorpus.scala:56)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1060)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
at java.lang.Thread.run(Thread.java:809)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-01-25 16:25:08 WARN TaskSetManager:71 - Lost task 57.0 in stage 0.0 (TID 57, localhost): TaskKilled (killed intentionally)```
It would be good to include a normal command line option parser for the existing and future command line tools, with adequate help.
Hi,
Well it's a question rather than an issue. I'm not a scala programmer but I want to know how do you handle the redirects...I mean, what do you mean by 'Handling redirects'...Do you replace the redicrect by the entity to which it redirects? What kind of handling?
The second question is how do u deal with the non-entity pages (navigation, maintainance, and discussion pages) ?
Thanks!
It would be great to add the org.apache.lucene.analysis for smarter tokenization for all languages. In this way, processing other languages such as Chinese is more sensible with your library.
Hello,
We are currently using the dataset "English Wikipedia (Feb 2015) 1000 dimension - No stemming - 10skipgram". We are search for some test cases on http://dbpedia.org/page/Earthquake. We have tried "DBPEDIA_ID/Vanilla_Ice" is available in the dataset. But when we try "dbc:Earthquakes" or "dbr:Vanilla_Ice", we will get error message KeyError "dbc:Earthquakes" not in vocabulary and "dbr:Vanilla_Ice" not in vocabulary. We are wonder whether the dataset stores data as "dbc:" or "dbo:"?
Thank you
I used python 2.7 and codes:
model = Word2Vec.load("path/to/word2vec/en.model")
returns an error:
AttributeError: 'Word2Vec' object has no attribute 'vector_size'
I guess might be the version of gensim, but I am not sure.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.