yahoo / caffeonspark Goto Github PK

View Code? Open in Web Editor NEW

1.3K 149.0 358.0 17.06 MB

Distributed deep learning on Hadoop and Spark clusters.

License: Apache License 2.0

Makefile 0.96% Python 4.32% Java 2.17% Scala 6.97% C++ 6.67% Shell 0.26% Jupyter Notebook 78.25% Dockerfile 0.41%

caffeonspark's Introduction

Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version of it and continue to use this code under the terms of the project license.

CaffeOnSpark

What's CaffeOnSpark?

CaffeOnSpark brings deep learning to Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.

As a distributed extension of Caffe, CaffeOnSpark supports neural network model training, testing, and feature extraction. Caffe users can now perform distributed learning using their existing LMDB data files and minorly adjusted network configuration (as illustrated).

CaffeOnSpark is a Spark package for deep learning. It is complementary to non-deep learning libraries MLlib and Spark SQL. CaffeOnSpark's Scala API provides Spark applications with an easy mechanism to invoke deep learning (see sample) over distributed datasets.

CaffeOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud. It's been in use by Yahoo for image search, content classification and several other use cases.

Why CaffeOnSpark?

CaffeOnSpark provides some important benefits (see our blog) over alternative deep learning solutions.

It enables model training, test and feature extraction directly on Hadoop datasets stored in HDFS on Hadoop clusters.
It turns your Hadoop or Spark cluster(s) into a powerful platform for deep learning, without the need to set up a new dedicated cluster for deep learning separately.
Server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck.
Caffe users' existing datasets (e.g. LMDB) and configurations could be applied for distributed learning without any conversion needed.
High-level API empowers Spark applications to easily conduct deep learning.
Incremental learning is supported to leverage previously trained models or snapshots.
Additional data formats and network interfaces could be easily added.
It can be easily deployed on public cloud (ex. AWS EC2) or a private cloud.

Using CaffeOnSpark

Please check CaffeOnSpark wiki site for detailed documentations such as building instruction, API reference and getting started guides for standalone cluster and AWS EC2 cluster.

Batch sizes specified in prototxt files are per device.
Memory layers should not be shared among GPUs, and thus "share_in_parallel: false" is required for layer configuration.

Building for Spark 2.X

CaffeOnSpark supports both Spark 1.x and 2.x. For Spark 2.0, our default settings are:

spark-2.0.0
hadoop-2.7.1
scala-2.11.7 You may want to adjust them in caffe-grid/pom.xml.

Mailing List

Please join CaffeOnSpark user group for discussions and questions.

License

The use and distribution terms for this software are covered by the Apache 2.0 license. See LICENSE file for terms.

caffeonspark's People

Contributors

Stargazers

Watchers

Forkers

fxfactorial benjamwhite wucpmark danielmorozoff ml-lab phecy mmadsen avi121 ctozlm xiangqiaolxq dhootha indravikas codeaudit xhuvom nagyistge tsingjinyun junshi15 rajerino honggli mmallad furaoing bigbear2017 mriduljain leimingyu hj3938 yangspeaking kenmsj tnachen hzmengyue winning1120xx takechiyoo hengqujushi chongfeng yuanpengx lemonhall perfettiful agile-lab bunnyrabbit8mile betashepherd starkmchen obinsc coconutpalm wangpeipei90 alexvk imaxmin vanloswang qiqipipioioi intellifora autowrite lu839684437 bikong2 hitluobin yigenliang louiszl smarthi cu-boulder-course u20024804 longjingcha2015 julianzhang qiuhanty sky7sea intellijoule nilbody hanhanwu hy-2013 forschnix aimar1986bupt nikolayvoronchikhin gameofthrow ajunboys anfeng mherr ndsl jeetgangele nsteenv wait1988 dejunzhang zky001 apsaltis leochencipher jinxustartup letian0805 jgong5 riverlight howkind shyamalschandra zmoon111 realentertain clarenceke fnet123 ferfervi ericdoug yangjunpro xzzhan wenzhaoshanda caidongyun yyuzhong lapuda mailmahee chaii

caffeonspark's Issues

NullPointerException when Running CaffeOnSpark on EC2

I am running CaffeOnSpark on EC2 following the instructions https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_EC2. I got the following errors:

16/05/06 00:10:35 INFO TaskSetManager: Starting task 1.1 in stage 1.0 (TID 5, ip-10-30-15-17.us-west-2.compute.internal, partition 1,PROCESS_LOCAL, 2197 bytes)
16/05/06 00:10:35 WARN TaskSetManager: Lost task 0.1 in stage 1.0 (TID 4, ip-10-30-15-17.us-west-2.compute.internal): java.lang.NullPointerException
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:153)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

how can i get the trainning accuaracy from the log?

Hi, we can run cifar10 full trainning with the below command:
spark-submit --master yarn --deploy-mode cluster --num-executors 1
--executor-memory 6g --executor-cores 5
--files ./data/cifar10_full_solver.prototxt,./data/cifar10_full_train_test.prototxt,./data/mean.binaryproto
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train -features accuracy,loss -label label -conf cifar10_full_solver.prototxt -devices 1 -connection ethernet
-model result/cifar10.full.model.h5 -output result/cifar10_full_features_result

And we can also get the below logs from the slave node, but we cann't get training accuracy during the training, is there any command parameters that can be used to print the accuracy information, thank you very much .

I0330 15:37:31.712399 29236 sgd_solver.cpp:106] Iteration 67800, lr = 0.001
16/03/30 15:37:38 INFO executor.Executor: Finished task 0.0 in stage 137.0 (TID 137). 2001 bytes result sent to driver
16/03/30 15:37:38 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 138
16/03/30 15:37:38 INFO executor.Executor: Running task 0.0 in stage 138.0 (TID 138)
16/03/30 15:37:38 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 139
16/03/30 15:37:38 INFO storage.MemoryStore: Block broadcast_139_piece0 stored as bytes in memory (estimated size 1375.0 B, free 32.1 KB)
16/03/30 15:37:38 INFO broadcast.TorrentBroadcast: Reading broadcast variable 139 took 7 ms
16/03/30 15:37:38 INFO storage.MemoryStore: Block broadcast_139 stored as values in memory (estimated size 2.1 KB, free 34.1 KB)
16/03/30 15:37:38 INFO storage.BlockManager: Found block rdd_0_0 locally
I0330 15:37:38.796069 29236 solver.cpp:237] Iteration 68000, loss = 0.362456
I0330 15:37:38.796123 29236 solver.cpp:253] Train net output #0: loss = 0.362456 (* 1 = 0.362456 loss)
I0330 15:37:38.796150 29236 sgd_solver.cpp:106] Iteration 68000, lr = 0.001
I0330 15:37:45.877544 29236 solver.cpp:237] Iteration 68200, loss = 0.490983
I0330 15:37:45.877589 29236 solver.cpp:253] Train net output #0: loss = 0.490983 (* 1 = 0.490983 loss)
I0330 15:37:45.877604 29236 sgd_solver.cpp:106] Iteration 68200, lr = 0.001
I0330 15:37:52.965775 29236 solver.cpp:237] Iteration 68400, loss = 0.379639
I0330 15:37:52.965844 29236 solver.cpp:253] Train net output #0: loss = 0.379639 (* 1 = 0.379639 loss)
I0330 15:37:52.965878 29236 sgd_solver.cpp:106] Iteration 68400, lr = 0.001
16/03/30 15:37:56 INFO executor.Executor: Finished task 0.0 in stage 138.0 (TID 138). 2001 bytes result sent to driver
16/03/30 15:37:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 139
16/03/30 15:37:56 INFO executor.Executor: Running task 0.0 in stage 139.0 (TID 139)
16/03/30 15:37:56 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 140
16/03/30 15:37:56 INFO storage.MemoryStore: Block broadcast_140_piece0 stored as bytes in memory (estimated size 1375.0 B, free 35.5 KB)
16/03/30 15:37:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable 140 took 9 ms
16/03/30 15:37:56 INFO storage.MemoryStore: Block broadcast_140 stored as values in memory (estimated size 2.1 KB, free 37.5 KB)
16/03/30 15:37:56 INFO storage.BlockManager: Found block rdd_0_0 locally
I0330 15:38:00.052302 29236 solver.cpp:237] Iteration 68600, loss = 0.358753
I0330 15:38:00.052367 29236 solver.cpp:253] Train net output #0: loss = 0.358753 (* 1 = 0.358753 loss)
I0330 15:38:00.052400 29236 sgd_solver.cpp:106] Iteration 68600, lr = 0.001
I0330 15:38:07.139724 29236 solver.cpp:237] Iteration 68800, loss = 0.395479
I0330 15:38:07.139791 29236 solver.cpp:253] Train net output #0: loss = 0.395479 (* 1 = 0.395479 loss)
I0330 15:38:07.139824 29236 sgd_solver.cpp:106] Iteration 68800, lr = 0.001
16/03/30 15:38:13 INFO executor.Executor: Finished task 0.0 in stage 139.0 (TID 139). 2001 bytes result sent to driver
16/03/30 15:38:13 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 140
16/03/30 15:38:13 INFO executor.Executor: Running task 0.0 in stage 140.0 (TID 140)
16/03/30 15:38:13 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 141
16/03/30 15:38:13 INFO storage.MemoryStore: Block broadcast_141_piece0 stored as bytes in memory (estimated size 1375.0 B, free 38.9 KB)
16/03/30 15:38:13 INFO broadcast.TorrentBroadcast: Reading broadcast variable 141 took 9 ms
16/03/30 15:38:13 INFO storage.MemoryStore: Block broadcast_141 stored as values in memory (estimated size 2.1 KB, free 40.9 KB)
16/03/30 15:38:13 INFO storage.BlockManager: Found block rdd_0_0 locally
I0330 15:38:14.222049 29236 solver.cpp:237] Iteration 69000, loss = 0.361772
I0330 15:38:14.222107 29236 solver.cpp:253] Train net output #0: loss = 0.361772 (* 1 = 0.361772 loss)
I0330 15:38:14.222134 29236 sgd_solver.cpp:106] Iteration 69000, lr = 0.001
I0330 15:38:21.309873 29236 solver.cpp:237] Iteration 69200, loss = 0.48632
I0330 15:38:21.309912 29236 solver.cpp:253] Train net output #0: loss = 0.48632 (* 1 = 0.48632 loss)
I0330 15:38:21.309931 29236 sgd_solver.cpp:106] Iteration 69200, lr = 0.001
I0330 15:38:28.487985 29236 solver.cpp:237] Iteration 69400, loss = 0.377639
I0330 15:38:28.488052 29236 solver.cpp:253] Train net output #0: loss = 0.377639 (* 1 = 0.377639 loss)
I0330 15:38:28.488085 29236 sgd_solver.cpp:106] Iteration 69400, lr = 0.001
16/03/30 15:38:31 INFO executor.Executor: Finished task 0.0 in stage 140.0 (TID 140). 2001 bytes result sent to driver
16/03/30 15:38:31 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 141
16/03/30 15:38:31 INFO executor.Executor: Running task 0.0 in stage 141.0 (TID 141)
16/03/30 15:38:31 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 142
16/03/30 15:38:31 INFO storage.MemoryStore: Block broadcast_142_piece0 stored as bytes in memory (estimated size 1375.0 B, free 42.3 KB)
16/03/30 15:38:31 INFO broadcast.TorrentBroadcast: Reading broadcast variable 142 took 7 ms
16/03/30 15:38:31 INFO storage.MemoryStore: Block broadcast_142 stored as values in memory (estimated size 2.1 KB, free 44.3 KB)
16/03/30 15:38:31 INFO storage.BlockManager: Found block rdd_0_0 locally
I0330 15:38:35.572760 29236 solver.cpp:237] Iteration 69600, loss = 0.35981
I0330 15:38:35.572826 29236 solver.cpp:253] Train net output #0: loss = 0.35981 (* 1 = 0.35981 loss)
I0330 15:38:35.572860 29236 sgd_solver.cpp:106] Iteration 69600, lr = 0.001
I0330 15:38:42.656687 29236 solver.cpp:237] Iteration 69800, loss = 0.390506
I0330 15:38:42.656752 29236 solver.cpp:253] Train net output #0: loss = 0.390506 (* 1 = 0.390506 loss)
I0330 15:38:42.656810 29236 sgd_solver.cpp:106] Iteration 69800, lr = 0.001
16/03/30 15:38:49 INFO executor.Executor: Finished task 0.0 in stage 141.0 (TID 141). 2001 bytes result sent to driver
16/03/30 15:38:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 142
16/03/30 15:38:49 INFO executor.Executor: Running task 0.0 in stage 142.0 (TID 142)
16/03/30 15:38:49 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 143
16/03/30 15:38:49 INFO storage.MemoryStore: Block broadcast_143_piece0 stored as bytes in memory (estimated size 1375.0 B, free 45.7 KB)
16/03/30 15:38:49 INFO broadcast.TorrentBroadcast: Reading broadcast variable 143 took 9 ms
16/03/30 15:38:49 INFO storage.MemoryStore: Block broadcast_143 stored as values in memory (estimated size 2.1 KB, free 47.7 KB)
16/03/30 15:38:49 INFO storage.BlockManager: Found block rdd_0_0 locally
16/03/30 15:38:49 INFO caffe.CaffeProcessor: Model saving into file at the end of training:result/cifar10.full.model.h5
I0330 15:38:49.704614 29236 solver.cpp:469] Snapshotting to HDF5 file cifar10_full_iter_70000.caffemodel.h5
I0330 15:38:49.733238 29236 sgd_solver.cpp:283] Snapshotting solver state to HDF5 file cifar10_full_iter_70000.solverstate.h5
16/03/30 15:38:49 INFO caffe.FSUtils$: destination file:result/cifar10.full.model.h5
16/03/30 15:38:49 INFO executor.Executor: Finished task 0.0 in stage 142.0 (TID 142). 2001 bytes result sent to driver
16/03/30 15:38:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 143
16/03/30 15:38:49 INFO executor.Executor: Running task 0.0 in stage 143.0 (TID 143)
16/03/30 15:38:49 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 144
16/03/30 15:38:49 INFO storage.MemoryStore: Block broadcast_144_piece0 stored as bytes in memory (estimated size 1259.0 B, free 49.0 KB)
16/03/30 15:38:49 INFO broadcast.TorrentBroadcast: Reading broadcast variable 144 took 7 ms
16/03/30 15:38:49 INFO storage.MemoryStore: Block broadcast_144 stored as values in memory (estimated size 2016.0 B, free 50.9 KB)
16/03/30 15:38:49 INFO executor.Executor: Finished task 0.0 in stage 143.0 (TID 143). 899 bytes result sent to driver
16/03/30 15:38:50 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 144
16/03/30 15:38:50 INFO executor.Executor: Running task 0.0 in stage 144.0 (TID 144)
16/03/30 15:38:50 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 145
16/03/30 15:38:50 INFO storage.MemoryStore: Block broadcast_145_piece0 stored as bytes in memory (estimated size 1973.0 B, free 52.9 KB)
16/03/30 15:38:50 INFO broadcast.TorrentBroadcast: Reading broadcast variable 145 took 9 ms
16/03/30 15:38:50 INFO storage.MemoryStore: Block broadcast_145 stored as values in memory (estimated size 2.9 KB, free 55.8 KB)
16/03/30 15:38:50 INFO caffe.CaffeProcessor: my rank is 0
16/03/30 15:38:50 INFO caffe.LMDB: Batch size:100
I0330 15:38:50.147389 29220 CaffeNet.cpp:78] set root solver device id to 0
I0330 15:38:50.148255 29220 solver.cpp:48] Initializing solver from parameters:
test_iter: 100
test_interval: 70001
base_lr: 0.001
display: 200
max_iter: 70000
lr_policy: "fixed"
momentum: 0.9
weight_decay: 0.004
stepsize: 30000
snapshot: 70001
snapshot_prefix: "cifar10_full"
solver_mode: GPU
device_id: 0
net_param {
name: "CIFAR10_full"
layer {
name: "cifar"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 100
channels: 3
height: 32
width: 32
share_in_parallel: false
source: "file:///home/atlas/work/caffe_spark/CaffeOnSpark-master/data/cifar10_train_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
//....................---delete some logs......
I0330 15:38:50.185540 29220 net.cpp:228] label_cifar_1_split does not need backward computation.
I0330 15:38:50.185547 29220 net.cpp:228] cifar does not need backward computation.
I0330 15:38:50.185554 29220 net.cpp:270] This network produces output accuracy
I0330 15:38:50.185561 29220 net.cpp:270] This network produces output loss
I0330 15:38:50.185586 29220 net.cpp:283] Network initialization done.
I0330 15:38:50.185653 29220 solver.cpp:60] Solver scaffolding done.
I0330 15:38:50.186189 29220 CaffeNet.cpp:248] Finetuning from /tmp/hadoop-atlas/nm-local-dir/usercache/atlas/appcache/application_1459254969589_0006/container_1459254969589_0006_01_000002/model.tmp.h5
I0330 15:38:50.186841 29220 hdf5.cpp:32] Datatype class: H5T_FLOAT
I0330 15:38:50.189332 29220 parallel.cpp:392] GPUs pairs
16/03/30 15:38:50 INFO executor.Executor: Finished task 0.0 in stage 144.0 (TID 144). 870 bytes result sent to driver
16/03/30 15:38:50 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 145
16/03/30 15:38:50 INFO executor.Executor: Running task 0.0 in stage 145.0 (TID 145)
16/03/30 15:38:50 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 146
16/03/30 15:38:50 INFO storage.MemoryStore: Block broadcast_146_piece0 stored as bytes in memory (estimated size 6.5 KB, free 62.2 KB)
16/03/30 15:38:50 INFO broadcast.TorrentBroadcast: Reading broadcast variable 146 took 9 ms
16/03/30 15:38:50 INFO storage.MemoryStore: Block broadcast_146 stored as values in memory (estimated size 14.1 KB, free 76.3 KB)
16/03/30 15:38:51 INFO spark.CacheManager: Partition rdd_154_0 not found, computing it
16/03/30 15:38:51 INFO spark.CacheManager: Partition rdd_148_0 not found, computing it
16/03/30 15:38:51 INFO caffe.LmdbRDD: Processing partition 0
16/03/30 15:38:52 INFO caffe.LmdbRDD: Completed partition 0
16/03/30 15:38:52 INFO storage.BlockManager: Found block rdd_148_0 locally
I0330 15:38:52.011971 954 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0330 15:38:52.013013 955 data_transformer.cpp:25] Loading mean file from: mean.binaryproto
16/03/30 15:38:52 INFO caffe.LMDB: Completed all files
16/03/30 15:38:53 INFO codegen.GenerateUnsafeProjection: Code generated in 115.18413 ms
16/03/30 15:38:53 INFO storage.BlockManager: Found block rdd_154_0 locally
16/03/30 15:38:53 INFO codegen.GeneratePredicate: Code generated in 3.347353 ms
16/03/30 15:38:53 INFO columnar.GenerateColumnAccessor: Code generated in 16.381273 ms
16/03/30 15:38:53 INFO codegen.GenerateMutableProjection: Code generated in 8.126173 ms
16/03/30 15:38:53 INFO codegen.GenerateUnsafeProjection: Code generated in 6.682903 ms
16/03/30 15:38:53 INFO codegen.GenerateMutableProjection: Code generated in 6.995862 ms
16/03/30 15:38:53 INFO codegen.GenerateUnsafeRowJoiner: Code generated in 5.680261 ms
16/03/30 15:38:53 INFO codegen.GenerateUnsafeProjection: Code generated in 5.733341 ms
16/03/30 15:38:53 INFO executor.Executor: Finished task 0.0 in stage 145.0 (TID 145). 2548 bytes result sent to driver
16/03/30 15:38:53 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 146
16/03/30 15:38:53 INFO executor.Executor: Running task 0.0 in stage 146.0 (TID 146)
16/03/30 15:38:53 INFO spark.MapOutputTrackerWorker: Updating epoch to 1 and clearing cache
16/03/30 15:38:53 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 147
16/03/30 15:38:53 INFO storage.MemoryStore: Block broadcast_147_piece0 stored as bytes in memory (estimated size 4.6 KB, free 80.9 KB)
16/03/30 15:38:53 INFO broadcast.TorrentBroadcast: Reading broadcast variable 147 took 7 ms
16/03/30 15:38:53 INFO storage.MemoryStore: Block broadcast_147 stored as values in memory (estimated size 9.3 KB, free 90.2 KB)
16/03/30 15:38:53 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
16/03/30 15:38:53 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://[email protected]:38988)
16/03/30 15:38:53 INFO spark.MapOutputTrackerWorker: Got the output locations
16/03/30 15:38:53 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/03/30 15:38:53 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 4 ms
16/03/30 15:38:53 INFO codegen.GenerateMutableProjection: Code generated in 7.146091 ms
16/03/30 15:38:53 INFO codegen.GenerateMutableProjection: Code generated in 5.766725 ms
16/03/30 15:38:53 INFO executor.Executor: Finished task 0.0 in stage 146.0 (TID 146). 1663 bytes result sent to driver
16/03/30 15:38:53 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 147
16/03/30 15:38:53 INFO executor.Executor: Running task 0.0 in stage 147.0 (TID 147)
16/03/30 15:38:53 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 148
16/03/30 15:38:53 INFO storage.MemoryStore: Block broadcast_148_piece0 stored as bytes in memory (estimated size 1259.0 B, free 91.4 KB)
16/03/30 15:38:53 INFO broadcast.TorrentBroadcast: Reading broadcast variable 148 took 8 ms
16/03/30 15:38:53 INFO storage.MemoryStore: Block broadcast_148 stored as values in memory (estimated size 2016.0 B, free 93.4 KB)
16/03/30 15:38:53 INFO executor.Executor: Finished task 0.0 in stage 147.0 (TID 147). 899 bytes result sent to driver
16/03/30 15:38:54 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
16/03/30 15:38:54 INFO storage.MemoryStore: MemoryStore cleared
16/03/30 15:38:54 INFO storage.BlockManager: BlockManager stopped
16/03/30 15:38:54 WARN executor.CoarseGrainedExecutorBackend: An unknown (yuntu2:38988) driver disconnected.
16/03/30 15:38:54 ERROR executor.CoarseGrainedExecutorBackend: Driver 10.110.52.32:38988 disassociated! Shutting down.
16/03/30 15:38:54 INFO util.ShutdownHookManager: Shutdown hook called
16/03/30 15:38:54 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/03/30 15:38:54 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/03/30 15:38:54 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.

failure in make build complaining about undefined symbol to python

[exec] /usr/bin/ld: .build_release/src/main/cpp/tools/caffe_mini_cluster.o: undefined reference to symbol '_ZN5boost6python17error_already_setD1Ev'
[exec] //usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.54.0: error adding symbols: DSO missing from command line
[exec] collect2: error: ld returned 1 exit status
[exec] make[1]: *** [.build_release/src/main/cpp/tools/caffe_mini_cluster.bin] Error 1
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe ............................................. SUCCESS [0.003s]
[INFO] caffe-distri ...................................... FAILURE [47.767s]
[INFO] caffe-grid

Installation Error while building CaffeOnSpark

I have run caffe and spark successfully. However, when I am building caffeonspark, it shows many warning and errors. I can't understand what these things mean and have no idea what to do. Could someone helps me?

cd caffe-public; make proto; make -j4 -e distribute; cd ..
make[1]: Nothing to be done for `proto'.
NVCC src/caffe/layers/absval_layer.cu
NVCC src/caffe/layers/base_data_layer.cu
NVCC src/caffe/layers/batch_norm_layer.cu
NVCC src/caffe/layers/batch_reindex_layer.cu
nvcc fatal : The version ('70300') of the host compiler ('Apple clang') is not supported
nvcc fatal : The version ('70300') of the host compiler ('Apple clang') is not supported
nvcc fatal : The version ('70300') of the host compiler ('Apple clang') is not supported
make[1]: *** [.build_release/cuda/src/caffe/layers/batch_reindex_layer.o] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: *** [.build_release/cuda/src/caffe/layers/base_data_layer.o] Error 1
make[1]: *** [.build_release/cuda/src/caffe/layers/absval_layer.o] Error 1
nvcc fatal : The version ('70300') of the host compiler ('Apple clang') is not supported
make[1]: *** [.build_release/cuda/src/caffe/layers/batch_norm_layer.o] Error 1
export LD_LIBRARY_PATH="/home/y/lib64:/home/y/lib64/mkl/intel64:/Users/Red_Hair/Downloads/CaffeOnSpark/caffe-public/distribute/lib:/Users/Red_Hair/Downloads/CaffeOnSpark/caffe-distri/distribute/lib:/usr/lib64:/lib64 "; mvn -B package
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for com.yahoo.ml:caffe-grid:jar:0.1-SNAPSHOT
[WARNING] The expression ${version} is deprecated. Please use ${project.version} instead.
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] caffe
[INFO] caffe-distri
[INFO] caffe-grid
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building caffe 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building caffe-distri 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-antrun-plugin:1.7:run (proto) @ caffe-distri ---
[INFO] Executing tasks

protoc:
[exec] make[1]: *** No rule to make target ../caffe-public/distribute/proto/caffe.proto', needed bysrc/main/java/caffe/Caffe.java'. Stop.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe .............................................. SUCCESS [ 0.005 s]
[INFO] caffe-distri ....................................... FAILURE [ 3.871 s]
[INFO] caffe-grid ......................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5.188 s
[INFO] Finished at: 2016-06-08T15:03:14+08:00
[INFO] Final Memory: 8M/123M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (proto) on project caffe-distri: An Ant BuildException has occured: exec returned: 2
[ERROR] around Ant part ...... @ 5:104 in /Users/Red_Hair/Downloads/CaffeOnSpark/caffe-distri/target/antrun/build-protoc.xml
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :caffe-distri
make: *** [build] Error 1

I can't package jar after compile caffe...

I can't package jar after compile caffe
I can see these message
export LD_LIBRARY_PATH="/usr/local/caffe-parallel-master/build/lib::/home/xiangqiao.lxq/CaffeOnSpark/caffe-public/distribute/lib:/home/xiangqiao.lxq/CaffeOnSpark/caffe-distri/distribute/lib:/usr/lib64:/lib64 "; mvn package
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for com.yahoo.ml:caffe-grid:jar:0.1-SNAPSHOT
[WARNING] The expression ${version} is deprecated. Please use ${project.version} instead.
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]

[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /home/xiangqiao.lxq/CaffeOnSpark/caffe-distri/src/main/java/caffe/Caffe.java:[78,23] caffe.Caffe.BlobShape is not abstract and does not override abstract method newBuilderForType(com.google.protobuf.GeneratedMessage.BuilderParent) in com.google.protobuf.GeneratedMessage
[ERROR] /home/xiangqiao.lxq/CaffeOnSpark/caffe-distri/src/main/java/caffe/Caffe.java:[236,25] caffe.Caffe.BlobShape.Builder is not abstract and does not override abstract method internalGetFieldAccessorTable() in com.google.protobuf.GeneratedMessage.Builder
[ERROR] /home/xiangqiao.lxq/CaffeOnSpark/caffe-distri/src/main/java/caffe/Caffe.java:[414,23] caffe.Caffe.BlobProto is not abstract and does not override abstract method newBuilderForType(com.google.protobuf.GeneratedMessage.BuilderParent) in com.google.protobuf.GeneratedMessage

I don't know how this happened，can anyone help me ?

java.lang.NullPointerException when running in standalone mode

I am getting null pointer exception when submitting CaffenOnSpark on mnist data.

The command to submit is :

spark-submit --master ${MASTER_URL} --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt --conf spark.cores.max=${TOTAL_CORES} --conf spark.task.cpus=${CORES_PER_WORKER} --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" --class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar -train -features accuracy,loss -label label -conf lenet_memory_solver.prototxt -clusterSize ${SPARK_WORKER_INSTANCES} -devices 1 -connection ethernet -model file:${CAFFE_ON_SPARK}/mnist_lenet.model -output file:${CAFFE_ON_SPARK}/lenet_features_result.

Log generated is :

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/05/30 16:47:42 INFO SparkContext: Running Spark version 1.6.1
16/05/30 16:47:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/30 16:47:43 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '1').
This is deprecated in Spark 1.0+.

Please instead use:

./spark-submit with --num-executors to specify the number of executors
Or set SPARK_EXECUTOR_INSTANCES
spark.executor.instances to configure the number of instances in the spark config.

16/05/30 16:47:43 WARN Utils: Your hostname, ubuntu-H81M-S resolves to a loopback address: 127.0.0.1; using 192.168.1.29 instead (on interface eth0)
16/05/30 16:47:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/05/30 16:47:43 INFO SecurityManager: Changing view acls to: caffe
16/05/30 16:47:43 INFO SecurityManager: Changing modify acls to: caffe
16/05/30 16:47:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(caffe); users with modify permissions: Set(caffe)
16/05/30 16:47:43 INFO Utils: Successfully started service 'sparkDriver' on port 44682.
16/05/30 16:47:43 INFO Slf4jLogger: Slf4jLogger started
16/05/30 16:47:43 INFO Remoting: Starting remoting
16/05/30 16:47:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:47591]
16/05/30 16:47:43 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 47591.
16/05/30 16:47:43 INFO SparkEnv: Registering MapOutputTracker
16/05/30 16:47:43 INFO SparkEnv: Registering BlockManagerMaster
16/05/30 16:47:43 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-c94ea90e-4190-4605-b380-f147255c9ac3
16/05/30 16:47:43 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/05/30 16:47:43 INFO SparkEnv: Registering OutputCommitCoordinator
16/05/30 16:47:44 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/05/30 16:47:44 INFO SparkUI: Started SparkUI at http://192.168.1.29:4040
16/05/30 16:47:44 INFO HttpFileServer: HTTP File server directory is /tmp/spark-888c7abf-7cbd-4936-8d25-d3b8f0b875a0/httpd-93aa28a2-68f1-4414-bbc5-816e350a7466
16/05/30 16:47:44 INFO HttpServer: Starting HTTP Server
16/05/30 16:47:44 INFO Utils: Successfully started service 'HTTP file server' on port 46265.
16/05/30 16:47:44 INFO SparkContext: Added JAR file:/home/caffe/Caffe/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar at http://192.168.1.29:46265/jars/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar with timestamp 1464607064217
16/05/30 16:47:44 INFO Utils: Copying /home/caffe/Caffe/CaffeOnSpark/data/lenet_memory_solver.prototxt to /tmp/spark-888c7abf-7cbd-4936-8d25-d3b8f0b875a0/userFiles-d2e43ca6-f05c-48d2-8dbb-d71223681689/lenet_memory_solver.prototxt
16/05/30 16:47:44 INFO SparkContext: Added file file:/home/caffe/Caffe/CaffeOnSpark/data/lenet_memory_solver.prototxt at http://192.168.1.29:46265/files/lenet_memory_solver.prototxt with timestamp 1464607064313
16/05/30 16:47:44 INFO Utils: Copying /home/caffe/Caffe/CaffeOnSpark/data/lenet_memory_train_test.prototxt to /tmp/spark-888c7abf-7cbd-4936-8d25-d3b8f0b875a0/userFiles-d2e43ca6-f05c-48d2-8dbb-d71223681689/lenet_memory_train_test.prototxt
16/05/30 16:47:44 INFO SparkContext: Added file file:/home/caffe/Caffe/CaffeOnSpark/data/lenet_memory_train_test.prototxt at http://192.168.1.29:46265/files/lenet_memory_train_test.prototxt with timestamp 1464607064319
16/05/30 16:47:44 INFO AppClient$ClientEndpoint: Connecting to master spark://ubuntu-H81M-S:7077...
16/05/30 16:47:44 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160530164744-0010
16/05/30 16:47:44 INFO AppClient$ClientEndpoint: Executor added: app-20160530164744-0010/0 on worker-20160530160108-192.168.1.29-36155 (192.168.1.29:36155) with 1 cores
16/05/30 16:47:44 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160530164744-0010/0 on hostPort 192.168.1.29:36155 with 1 cores, 1024.0 MB RAM
16/05/30 16:47:44 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36036.
16/05/30 16:47:44 INFO NettyBlockTransferService: Server created on 36036
16/05/30 16:47:44 INFO BlockManagerMaster: Trying to register BlockManager
16/05/30 16:47:44 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.29:36036 with 511.1 MB RAM, BlockManagerId(driver, 192.168.1.29, 36036)
16/05/30 16:47:44 INFO BlockManagerMaster: Registered BlockManager
16/05/30 16:47:44 INFO AppClient$ClientEndpoint: Executor updated: app-20160530164744-0010/0 is now RUNNING
16/05/30 16:47:46 INFO SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ubuntu-H81M-S:42453) with ID 0
16/05/30 16:47:46 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 1.0
16/05/30 16:47:46 INFO BlockManagerMasterEndpoint: Registering block manager ubuntu-H81M-S:46625 with 511.1 MB RAM, BlockManagerId(0, ubuntu-H81M-S, 46625)
16/05/30 16:47:48 INFO DataSource$: Source data layer:0
16/05/30 16:47:48 INFO LMDB: Batch size:64
16/05/30 16:47:48 INFO SparkContext: Starting job: collect at CaffeOnSpark.scala:127
16/05/30 16:47:48 INFO DAGScheduler: Got job 0 (collect at CaffeOnSpark.scala:127) with 1 output partitions
16/05/30 16:47:48 INFO DAGScheduler: Final stage: ResultStage 0 (collect at CaffeOnSpark.scala:127)
16/05/30 16:47:48 INFO DAGScheduler: Parents of final stage: List()
16/05/30 16:47:48 INFO DAGScheduler: Missing parents: List()
16/05/30 16:47:48 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at map at CaffeOnSpark.scala:116), which has no missing parents
16/05/30 16:47:48 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.3 KB, free 3.3 KB)
16/05/30 16:47:48 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.1 KB, free 5.4 KB)
16/05/30 16:47:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.29:36036 (size: 2.1 KB, free: 511.1 MB)
16/05/30 16:47:48 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/05/30 16:47:48 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at map at CaffeOnSpark.scala:116)
16/05/30 16:47:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/05/30 16:47:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ubuntu-H81M-S, partition 0,PROCESS_LOCAL, 2200 bytes)
16/05/30 16:47:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ubuntu-H81M-S:46625 (size: 2.1 KB, free: 511.1 MB)
16/05/30 16:47:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 795 ms on ubuntu-H81M-S (1/1)
16/05/30 16:47:49 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/05/30 16:47:49 INFO DAGScheduler: ResultStage 0 (collect at CaffeOnSpark.scala:127) finished in 0.800 s
16/05/30 16:47:49 INFO DAGScheduler: Job 0 finished: collect at CaffeOnSpark.scala:127, took 1.052826 s
16/05/30 16:47:49 INFO CaffeOnSpark: rank = 0, address = null, hostname = ubuntu-H81M-S
16/05/30 16:47:49 INFO CaffeOnSpark: rank 0:ubuntu-H81M-S
16/05/30 16:47:49 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 112.0 B, free 5.5 KB)
16/05/30 16:47:49 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 53.0 B, free 5.5 KB)
16/05/30 16:47:49 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.1.29:36036 (size: 53.0 B, free: 511.1 MB)
16/05/30 16:47:49 INFO SparkContext: Created broadcast 1 from broadcast at CaffeOnSpark.scala:146
16/05/30 16:47:49 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 192.168.1.29:36036 in memory (size: 2.1 KB, free: 511.1 MB)
16/05/30 16:47:49 INFO SparkContext: Starting job: collect at CaffeOnSpark.scala:155
16/05/30 16:47:49 INFO BlockManagerInfo: Removed broadcast_0_piece0 on ubuntu-H81M-S:46625 in memory (size: 2.1 KB, free: 511.1 MB)
16/05/30 16:47:49 INFO DAGScheduler: Got job 1 (collect at CaffeOnSpark.scala:155) with 1 output partitions
16/05/30 16:47:49 INFO DAGScheduler: Final stage: ResultStage 1 (collect at CaffeOnSpark.scala:155)
16/05/30 16:47:49 INFO DAGScheduler: Parents of final stage: List()
16/05/30 16:47:49 INFO DAGScheduler: Missing parents: List()
16/05/30 16:47:49 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at map at CaffeOnSpark.scala:149), which has no missing parents
16/05/30 16:47:49 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 2.8 KB)
16/05/30 16:47:49 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1597.0 B, free 4.3 KB)
16/05/30 16:47:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.1.29:36036 (size: 1597.0 B, free: 511.1 MB)
16/05/30 16:47:49 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/05/30 16:47:49 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at map at CaffeOnSpark.scala:149)
16/05/30 16:47:49 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/05/30 16:47:49 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ubuntu-H81M-S, partition 0,PROCESS_LOCAL, 2200 bytes)
16/05/30 16:47:49 INFO ContextCleaner: Cleaned accumulator 1
16/05/30 16:47:49 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ubuntu-H81M-S:46625 (size: 1597.0 B, free: 511.1 MB)
16/05/30 16:47:49 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ubuntu-H81M-S:46625 (size: 53.0 B, free: 511.1 MB)
16/05/30 16:47:49 INFO DAGScheduler: ResultStage 1 (collect at CaffeOnSpark.scala:155) finished in 0.084 s
16/05/30 16:47:49 INFO DAGScheduler: Job 1 finished: collect at CaffeOnSpark.scala:155, took 0.103977 s
16/05/30 16:47:49 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 84 ms on ubuntu-H81M-S (1/1)
16/05/30 16:47:49 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
Exception in thread "main" java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at com.yahoo.ml.caffe.LmdbRDD.localLMDBFile(LmdbRDD.scala:185)
at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:202)
at com.yahoo.ml.caffe.LmdbRDD.getPartitions(LmdbRDD.scala:46)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:158)
at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40)
at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/05/30 16:47:49 INFO SparkContext: Invoking stop() from shutdown hook
16/05/30 16:47:49 INFO SparkUI: Stopped Spark web UI at http://192.168.1.29:4040
16/05/30 16:47:49 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/05/30 16:47:49 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/05/30 16:47:49 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/05/30 16:47:49 INFO MemoryStore: MemoryStore cleared
16/05/30 16:47:49 INFO BlockManager: BlockManager stopped
16/05/30 16:47:49 INFO BlockManagerMaster: BlockManagerMaster stopped
16/05/30 16:47:49 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/05/30 16:47:49 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/05/30 16:47:49 INFO SparkContext: Successfully stopped SparkContext
16/05/30 16:47:49 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/05/30 16:47:49 INFO ShutdownHookManager: Shutdown hook called
16/05/30 16:47:49 INFO ShutdownHookManager: Deleting directory /tmp/spark-888c7abf-7cbd-4936-8d25-d3b8f0b875a0/httpd-93aa28a2-68f1-4414-bbc5-816e350a7466
16/05/30 16:47:49 INFO ShutdownHookManager: Deleting directory /tmp/spark-888c7abf-7cbd-4936-8d25-d3b8f0b875a0

The lenet_memory_train_text.prototxt is

name: "LeNet"
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
source_class: "com.yahoo.ml.caffe.LMDB"
memory_data_param {
source: "file:/home/caffe/Caffe/CaffeOnSpark/mnist_train_lmdb"
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
}
transform_param {
scale: 0.00390625
}
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
source_class: "com.yahoo.ml.caffe.LMDB"
memory_data_param {
source: "file:/home/caffe/Caffe/CaffeOnSpark/mnist_test_lmdb/"
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
}
transform_param {
scale: 0.00390625
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}

MakeFile.config is attached
Makefile.config.txt

Please help.

caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar does not exist

Hi,

I have followed the steps mentioned in the README file. So, CaffeOnSpark, Caffe, Spark and Hadoop have been installed.

However, the directory $(CAFFE_ON_SPARK)/caffe-grid/target/ does not contain the JAR file, caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar

Has the file been deleted in your github repo or is there a work-around for it?

Here is the error:

rescomp-14-310339:caffe-public vishakhhegde$ spark-submit --master ${MASTER_URL} \

--files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
--conf spark.cores.max=${TOTAL_CORES} \
--conf spark.task.cpus=${CORES_PER_WORKER} \
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" \
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" \
--class com.yahoo.ml.caffe.CaffeOnSpark  \
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
    -train \
    -features accuracy,loss -label label \
    -conf lenet_memory_solver.prototxt \
-clusterSize ${SPARK_WORKER_INSTANCES} \
    -devices 1 \
-connection ethernet \
    -model file:${CAFFE_ON_SPARK}/mnist_lenet.model \
    -output file:${CAFFE_ON_SPARK}/lenet_features_result
Warning: Local jar /Users/vishakhhegde/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar does not exist, skipping.
java.lang.ClassNotFoundException: com.yahoo.ml.caffe.CaffeOnSpark
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Thanks a lot!

What's wrong

After finish building and configuring the latest code of caffeonspark, and i run below command:
export SPARK_WORKER_INSTANCES=2
export DEVICES=1
spark-submit --master yarn --deploy-mode cluster --num-executors ${SPARK_WORKER_INSTANCES}
--files ./data/cifar10_quick_solver.prototxt,./data/cifar10_quick_train_test.prototxt,./data/mean.binaryproto
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train -features accuracy,loss -label label -conf cifar10_quick_solver.prototxt -devices ${DEVICES}
-connection ethernet -model result/cifar10.model.h5 -output result/cifar10_features_result

The following errors happened:
16/03/17 18:04:29 INFO Client: Source and destination file systems are the same. Not copying file:/home/atlas/work/caffe_spark/3rdparty/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar
16/03/17 18:04:29 INFO Client: Source and destination file systems are the same. Not copying file:/home/atlas/work/caffe_spark/CaffeOnSpark-master/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
16/03/17 18:04:29 INFO Client: Source and destination file systems are the same. Not copying file:/home/atlas/work/caffe_spark/CaffeOnSpark-master/data/cifar10_quick_solver.prototxt
16/03/17 18:04:29 INFO Client: Source and destination file systems are the same. Not copying file:/home/atlas/work/caffe_spark/CaffeOnSpark-master/data/cifar10_quick_train_test.prototxt
16/03/17 18:04:29 INFO Client: Source and destination file systems are the same. Not copying file:/home/atlas/work/caffe_spark/CaffeOnSpark-master/data/mean.binaryproto
16/03/17 18:04:29 INFO Client: Source and destination file systems are the same. Not copying file:/tmp/spark-b7150b1f-b8ed-40a8-bef6-b7d10a114e0f/__spark_conf__4312666864148429628.zip

Even i re-format hdfs, re-generate the lmdb files and restart the cluster. the problem is the same.

MemoryData & JAR error

We launched a cluster with your image and ran the "lenet_memory" example with success.

(1) Then we executed the same example, but with other data type (type: '"Data" vs "MemoryData") as showed in Caffe's directory, and an error occured:

Example:
==> Original data type:

}
data_param {
source: mnist_train_lmdb/"
batch_size: 64
backend: LMDB
}

==> New data type:

source_class: "com.yahoo.ml.caffe.LMDB"
memory_data_param {
source: "mnist_train_lmdb/"
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
}

Execution:

root@ip-172-31-14-118:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077
--files lenet_train_test.prototxt,lenet_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist.model
-output /mnist_features_result
....
16/03/29 16:36:25 INFO DataSource$: Source data layer:0
16/03/29 16:36:25 ERROR DataSource$: source_class must be defined for input data layer:Data
Exception in thread "main" java.lang.NullPointerException
...

Do CaffeOnSpark use only MemoryData type?

(2) We have tested another example from Caffe: "mnist_autoencoder". After change data type to MemoryData in prototxt file, we got an error:

root@ip-172-31-14-118:~/CaffeOnSpark/data# spark-submit --master spark://$(hostname):7077
--files mnist_memory_autoencoder.prototxt, mnist_memory_autoencoder_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train -persistent
-features accuracy,loss -label label
-conf mnist_memory_autoencoder_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist_memory.model
-output /mnist_memory_autoencoder
Error: Cannot load main class from JAR file:/root/CaffeOnSpark/data/mnist_memory_autoencoder_solver.prototxt

This files exists in ~/CaffeOnSpark/data:

root@ip-172-31-14-118:~/CaffeOnSpark/data# ls mni* -l
-rwxr-xr-x 1 root root 5102 Mar 29 16:14 mnist_memory_autoencoder.prototxt
-rwxr-xr-x 1 root root 417 Mar 29 16:06 mnist_memory_autoencoder_solver.prototxt

What are we missing?

Is this a Python library packaging issue?

I'm following the step on the stand-alone cluster guide, and I ran into this problem at step 5 while building the file:

cp .build_release/lib/libcaffe.a distribute/lib
install -m 644 .build_release/lib/libcaffe.so.1.0.0-rc3 distribute/lib
cd distribute/lib; rm -f libcaffe.so;   ln -s libcaffe.so.1.0.0-rc3 libcaffe.so
# add python - it's not the standard way, indeed...
cp -r python distribute/python
make[1]: Leaving directory `/home/name/Projects/UserModelEMR/CaffeOnSpark/caffe-public'
export LD_LIBRARY_PATH="/home/name/Projects/UserModelEMR/CaffeOnSpark/caffe-public/distribute/lib:/home/name/Projects/UserModelEMR/CaffeOnSpark/caffe-distri/distribute/lib:/usr/local/cuda-7.0/lib64:/usr/local/mkl/lib/intel64/:/home/name/Projects/UserModelEMR/CaffeOnSpark/caffe-public/distribute/lib:/home/name/Projects/UserModelEMR/CaffeOnSpark/caffe-distri/distribute/lib:/usr/lib64:/lib64 "; mvn package
/bin/sh: 1: mvn: not found
make: *** [build] Error 127

The error seems to indicate a missing library/dependency, but the log does not really say which. I am building the CPU-only version (although I do have CUDA installed). Is it missing the CUDA dependency or Python?

An Py4JJavaError happened when follow the python instructions

Hi, i am following the python instructions from:
https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_python
and trying to use the python APIs to train models. But when i use the following example command:
pushd ${CAFFE_ON_SPARK}/data/
unzip ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip
IPYTHON=1 pyspark --master yarn
--num-executors 1
--driver-library-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
--driver-class-path "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
--conf spark.cores.max=1
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--py-files ${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip
--files ${CAFFE_ON_SPARK}/data/caffe/_caffe.so
--jars "${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar"
Then run examples as below, there is a error appeared for the last line:
from pyspark import SparkConf,SparkContext
from com.yahoo.ml.caffe.RegisterContext import registerContext,registerSQLContext
from com.yahoo.ml.caffe.CaffeOnSpark import CaffeOnSpark
from com.yahoo.ml.caffe.Config import Config
from com.yahoo.ml.caffe.DataSource import DataSource
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
registerContext(sc)
registerSQLContext(sqlContext)
cos=CaffeOnSpark(sc,sqlContext)
cfg=Config(sc)
cfg.protoFile='/Users/afeng/dev/ml/CaffeOnSpark/data/lenet_memory_solver.prototxt'
cfg.modelPath = 'file:/tmp/lenet.model'
cfg.devices = 1
cfg.isFeature=True
cfg.label='label'
cfg.features=['ip1']
cfg.outputFormat = 'json'
cfg.clusterSize = 1
cfg.lmdb_partitions=cfg.clusterSize

Train

dl_train_source = DataSource(sc).getSource(cfg,True)
cos.train(dl_train_source) <------------------error happened after call this.

the error message is :
In [41]: cos.train(dl_train_source)
16/04/27 10:44:34 INFO spark.SparkContext: Starting job: collect at CaffeOnSpark.scala:127
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Got job 4 (collect at CaffeOnSpark.scala:127) with 1 output partitions
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (collect at CaffeOnSpark.scala:127)
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Missing parents: List()
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[14] at map at CaffeOnSpark.scala:116), which has no missing parents
16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.2 KB, free 23.9 KB)
16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2.1 KB, free 25.9 KB)
16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 10.110.53.146:59213 (size: 2.1 KB, free: 511.5 MB)
16/04/27 10:44:34 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[14] at map at CaffeOnSpark.scala:116)
16/04/27 10:44:34 INFO cluster.YarnScheduler: Adding task set 4.0 with 1 tasks
16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 10, sweet, partition 0,PROCESS_LOCAL, 2169 bytes)
16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on sweet:46000 (size: 2.1 KB, free: 511.5 MB)
16/04/27 10:44:34 INFO scheduler.DAGScheduler: ResultStage 4 (collect at CaffeOnSpark.scala:127) finished in 0.084 s
16/04/27 10:44:34 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 10) in 84 ms on sweet (1/1)
16/04/27 10:44:34 INFO cluster.YarnScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Job 4 finished: collect at CaffeOnSpark.scala:127, took 0.092871 s
16/04/27 10:44:34 INFO caffe.CaffeOnSpark: rank = 0, address = null, hostname = sweet
16/04/27 10:44:34 INFO caffe.CaffeOnSpark: rank 0:sweet
16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 112.0 B, free 26.0 KB)
16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 221.0 B, free 26.3 KB)
16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on 10.110.53.146:59213 (size: 221.0 B, free: 511.5 MB)
16/04/27 10:44:34 INFO spark.SparkContext: Created broadcast 6 from broadcast at CaffeOnSpark.scala:146
16/04/27 10:44:34 INFO spark.SparkContext: Starting job: collect at CaffeOnSpark.scala:155
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Got job 5 (collect at CaffeOnSpark.scala:155) with 1 output partitions
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Final stage: ResultStage 5 (collect at CaffeOnSpark.scala:155)
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Missing parents: List()
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[16] at map at CaffeOnSpark.scala:149), which has no missing parents
16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 2.6 KB, free 28.9 KB)
16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 1597.0 B, free 30.4 KB)
16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.110.53.146:59213 (size: 1597.0 B, free: 511.5 MB)
16/04/27 10:44:34 INFO spark.SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1006
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[16] at map at CaffeOnSpark.scala:149)
16/04/27 10:44:34 INFO cluster.YarnScheduler: Adding task set 5.0 with 1 tasks
16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 11, sweet, partition 0,PROCESS_LOCAL, 2169 bytes)
16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on sweet:46000 (size: 1597.0 B, free: 511.5 MB)
16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on sweet:46000 (size: 221.0 B, free: 511.5 MB)
16/04/27 10:44:34 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 11) in 48 ms on sweet (1/1)
16/04/27 10:44:34 INFO scheduler.DAGScheduler: ResultStage 5 (collect at CaffeOnSpark.scala:155) finished in 0.049 s
16/04/27 10:44:34 INFO cluster.YarnScheduler: Removed TaskSet 5.0, whose tasks have all completed, from pool
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Job 5 finished: collect at CaffeOnSpark.scala:155, took 0.058122 s
16/04/27 10:44:34 INFO caffe.LmdbRDD: local LMDB path:/home/atlas/work/caffe_spark/CaffeOnSpark-master/data/mnist_train_lmdb
16/04/27 10:44:34 INFO caffe.LmdbRDD: 1 LMDB RDD partitions
16/04/27 10:44:34 INFO spark.SparkContext: Starting job: reduce at CaffeOnSpark.scala:205
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Got job 6 (reduce at CaffeOnSpark.scala:205) with 1 output partitions
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Final stage: ResultStage 6 (reduce at CaffeOnSpark.scala:205)
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Missing parents: List()
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting ResultStage 6 (MapPartitionsRDD[17] at mapPartitions at CaffeOnSpark.scala:190), which has no missing parents
16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 3.4 KB, free 33.8 KB)
16/04/27 10:44:34 INFO storage.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 2.2 KB, free 35.9 KB)
16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 10.110.53.146:59213 (size: 2.2 KB, free: 511.5 MB)
16/04/27 10:44:34 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1006
16/04/27 10:44:34 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 6 (MapPartitionsRDD[17] at mapPartitions at CaffeOnSpark.scala:190)
16/04/27 10:44:34 INFO cluster.YarnScheduler: Adding task set 6.0 with 1 tasks
16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 6.0 (TID 12, sweet, partition 0,PROCESS_LOCAL, 1992 bytes)
16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on sweet:46000 (size: 2.2 KB, free: 511.5 MB)
16/04/27 10:44:34 INFO storage.BlockManagerInfo: Added rdd_12_0 on disk on sweet:46000 (size: 26.0 B)
16/04/27 10:44:34 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 6.0 (TID 12, sweet): java.lang.UnsupportedOperationException: empty.reduceLeft
at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:167)
at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195)
at scala.collection.AbstractIterator.reduce(Iterator.scala:1157)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:199)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:191)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 6.0 (TID 13, sweet, partition 0,PROCESS_LOCAL, 1992 bytes)
16/04/27 10:44:34 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 6.0 (TID 13) on executor sweet: java.lang.UnsupportedOperationException (empty.reduceLeft) [duplicate 1]
16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.2 in stage 6.0 (TID 14, sweet, partition 0,PROCESS_LOCAL, 1992 bytes)
16/04/27 10:44:34 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 6.0 (TID 14) on executor sweet: java.lang.UnsupportedOperationException (empty.reduceLeft) [duplicate 2]
16/04/27 10:44:34 INFO scheduler.TaskSetManager: Starting task 0.3 in stage 6.0 (TID 15, sweet, partition 0,PROCESS_LOCAL, 1992 bytes)
16/04/27 10:44:34 INFO scheduler.TaskSetManager: Lost task 0.3 in stage 6.0 (TID 15) on executor sweet: java.lang.UnsupportedOperationException (empty.reduceLeft) [duplicate 3]
16/04/27 10:44:34 ERROR scheduler.TaskSetManager: Task 0 in stage 6.0 failed 4 times; aborting job
16/04/27 10:44:34 INFO cluster.YarnScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool
16/04/27 10:44:34 INFO cluster.YarnScheduler: Cancelling stage 6
16/04/27 10:44:34 INFO scheduler.DAGScheduler: ResultStage 6 (reduce at CaffeOnSpark.scala:205) failed in 0.117 s

16/04/27 10:44:34 INFO scheduler.DAGScheduler: Job 6 failed: reduce at CaffeOnSpark.scala:205, took 0.124712 s

Py4JJavaError Traceback (most recent call last)
in ()
----> 1 cos.train(dl_train_source)

/home/atlas/work/caffe_spark/CaffeOnSpark-master/data/com/yahoo/ml/caffe/CaffeOnSpark.py in train(self, train_source)
29 :param DataSource: the source for training data
30 """
---> 31 self.dict.get('cos').train(train_source)
32
33 def test(self,test_source):

/home/atlas/work/caffe_spark/CaffeOnSpark-master/data/com/yahoo/ml/caffe/ConversionUtil.py in call(self, _args)
814 for i in self.syms:
815 try:
--> 816 return callJavaMethod(i,self.javaInstance,self._evalDefaults(),self.mirror,_args)
817 except Py4JJavaError:
818 raise

/home/atlas/work/caffe_spark/CaffeOnSpark-master/data/com/yahoo/ml/caffe/ConversionUtil.py in callJavaMethod(sym, javaInstance, defaults, mirror, _args)
617 return javaInstance(__getConvertedTuple(args,sym,defaults,mirror))
618 else:
--> 619 return toPython(javaInstance.getattr(name)(*_getConvertedTuple(args,sym,defaults,mirror)))
620 #It is good for debugging to know whether the argument conversion was successful.
621 #If it was, a Py4JJavaError may be raised from the Java code.

/home/atlas/work/caffe_spark/3rdparty/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in call(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814
815 for temp_arg in temp_args:

/home/atlas/work/caffe_spark/3rdparty/spark-1.6.0-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(_a, *_kw)
43 def deco(_a, *_kw):
44 try:
---> 45 return f(_a, *_kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()

/home/atlas/work/caffe_spark/3rdparty/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(

Py4JJavaError: An error occurred while calling o2122.train.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 15, sweet): java.lang.UnsupportedOperationException: empty.reduceLeft
at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:167)
at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195)
at scala.collection.AbstractIterator.reduce(Iterator.scala:1157)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:199)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:191)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Could you please help me to check what was happened?

Wiki typo

For YARN clusters, I think
hadoop fs -rm -r -f ${CAFFE_ON_SPARK}/mnist_features_result

should be
hadoop fs -rm -r -f hdfs:///mnist_features_result

compile latest code with sudo make build, but failed with following errors.

[ERROR] /home/atlas/work/caffe_spark/CaffeOnSpark-master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:36: error: value getDataframeColumnSelectCount is not a member of caffe.Caffe.MemoryDataParameter
[INFO] if (memdatalayer_param.getDataframeColumnSelectCount() > 0) {
[INFO] ^
[ERROR] /home/atlas/work/caffe_spark/CaffeOnSpark-master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:37: error: value getDataframeColumnSelectList is not a member of caffe.Caffe.MemoryDataParameter
[INFO] val selects = memdatalayer_param.getDataframeColumnSelectList()
[INFO] ^
[ERROR] /home/atlas/work/caffe_spark/CaffeOnSpark-master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataFrame.scala:58: error: value getImageEncoded is not a member of caffe.Caffe.MemoryDataParameter
[INFO] val encoded : Boolean = if (!has_encoded) memdatalayer_param.getImageEncoded() else row.getAsBoolean
[INFO] ^
[ERROR] three errors found

Do you have any benchmark data?

I am wondering if there is any benchmark data about training time / num of machines.
(ex: ImageNet Training Time with 2 ~ N machines)

How to increase CPU usage per executor?

Hi,

I tried to evaluate CaffeOnSpark on CPU mode with Ethernet.

The YARN cluster mode worked fine with my 4 nodes(32 CPU cores per node) environment.

But I observed the CPU usage of each executor only take 3 cores no matter which number of "--executor-cores" is allocated (I've tried 8 and 16).

I also tried to set "-devices NUM" to CaffeOnSpark, but I got fail except the default "-devices 1" on CPU mode.

Could you provide any tips for using more CPU cores per executor? thank you.

NullPointerException when Training Imagenet

I trained imagenet with my own dataset on a spark standalone cluster.I used my own lmdb file to train imagenet.This is info about my project:
16/06/08 02:20:18 INFO spark.SparkContext: Added JAR file:/home/lvhao/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar at http://10.0.0.201:51320/jars/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar with timestamp 1465323618083
16/06/08 02:20:18 INFO util.Utils: Copying /home/lvhao/CaffeOnSpark/data/imagenet_solver.prototxt to /tmp/spark-b4a9f269-b16b-4170-ae04-18b442cf5dec/userFiles-a9bab616-285d-433d-b634-63e0d554db21/imagenet_solver.prototxt
16/06/08 02:20:18 INFO spark.SparkContext: Added file file:/home/lvhao/CaffeOnSpark/data/imagenet_solver.prototxt at http://10.0.0.201:51320/files/imagenet_solver.prototxt with timestamp 1465323618294
16/06/08 02:20:18 INFO util.Utils: Copying /home/lvhao/CaffeOnSpark/data/imagenet.prototxt to /tmp/spark-b4a9f269-b16b-4170-ae04-18b442cf5dec/userFiles-a9bab616-285d-433d-b634-63e0d554db21/imagenet.prototxt
16/06/08 02:20:18 INFO spark.SparkContext: Added file file:/home/lvhao/CaffeOnSpark/data/imagenet.prototxt at http://10.0.0.201:51320/files/imagenet.prototxt with timestamp 1465323618353
16/06/08 02:20:18 INFO util.Utils: Copying /home/lvhao/CaffeOnSpark/data/test_mean.binaryproto to /tmp/spark-b4a9f269-b16b-4170-ae04-18b442cf5dec/userFiles-a9bab616-285d-433d-b634-63e0d554db21/test_mean.binaryproto
16/06/08 02:20:18 INFO spark.SparkContext: Added file file:/home/lvhao/CaffeOnSpark/data/test_mean.binaryproto at http://10.0.0.201:51320/files/test_mean.binaryproto with timestamp 1465323618360
16/06/08 02:20:18 INFO client.AppClient$ClientEndpoint: Connecting to master spark://10.0.0.201:7077...
16/06/08 02:20:18 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160608022018-0013
16/06/08 02:20:18 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44520.
16/06/08 02:20:18 INFO netty.NettyBlockTransferService: Server created on 44520
16/06/08 02:20:18 INFO client.AppClient$ClientEndpoint: Executor added: app-20160608022018-0013/0 on worker-20160606193041-10.0.0.203-33483 (10.0.0.203:33483) with 8 cores
16/06/08 02:20:18 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/06/08 02:20:18 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160608022018-0013/0 on hostPort 10.0.0.203:33483 with 8 cores, 1024.0 MB RAM
16/06/08 02:20:18 INFO client.AppClient$ClientEndpoint: Executor added: app-20160608022018-0013/1 on worker-20160606193041-10.0.0.202-48485 (10.0.0.202:48485) with 8 cores
16/06/08 02:20:18 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160608022018-0013/1 on hostPort 10.0.0.202:48485 with 8 cores, 1024.0 MB RAM
16/06/08 02:20:18 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.0.0.201:44520 with 511.1 MB RAM, BlockManagerId(driver, 10.0.0.201, 44520)
16/06/08 02:20:18 INFO storage.BlockManagerMaster: Registered BlockManager
16/06/08 02:20:18 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160608022018-0013/0 is now RUNNING
16/06/08 02:20:18 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160608022018-0013/1 is now RUNNING
16/06/08 02:20:18 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 1.0
16/06/08 02:20:20 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (pc-3:52125) with ID 0
16/06/08 02:20:20 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (pc-2:55469) with ID 1
16/06/08 02:20:20 INFO storage.BlockManagerMasterEndpoint: Registering block manager pc-3:51562 with 511.1 MB RAM, BlockManagerId(0, pc-3, 51562)
16/06/08 02:20:20 INFO storage.BlockManagerMasterEndpoint: Registering block manager pc-2:49903 with 511.1 MB RAM, BlockManagerId(1, pc-2, 49903)
16/06/08 02:20:20 INFO caffe.DataSource$: Source data layer:0
16/06/08 02:20:20 INFO caffe.LMDB: Batch size:256
16/06/08 02:20:20 INFO spark.SparkContext: Starting job: collect at CaffeOnSpark.scala:127
16/06/08 02:20:20 INFO scheduler.DAGScheduler: Got job 0 (collect at CaffeOnSpark.scala:127) with 2 output partitions
16/06/08 02:20:20 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (collect at CaffeOnSpark.scala:127)
16/06/08 02:20:20 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/08 02:20:20 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/08 02:20:20 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at map at CaffeOnSpark.scala:116), which has no missing parents
16/06/08 02:20:20 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.2 KB, free 3.2 KB)
16/06/08 02:20:20 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.0 KB, free 5.2 KB)
16/06/08 02:20:20 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.0.201:44520 (size: 2.0 KB, free: 511.1 MB)
16/06/08 02:20:20 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/06/08 02:20:20 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at map at CaffeOnSpark.scala:116)
16/06/08 02:20:20 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/06/08 02:20:20 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, pc-3, partition 0,PROCESS_LOCAL, 2236 bytes)
16/06/08 02:20:21 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, pc-2, partition 1,PROCESS_LOCAL, 2236 bytes)
16/06/08 02:20:21 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on pc-2:49903 (size: 2.0 KB, free: 511.1 MB)
16/06/08 02:20:22 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on pc-3:51562 (size: 2.0 KB, free: 511.1 MB)
16/06/08 02:20:24 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3817 ms on pc-2 (1/2)
16/06/08 02:20:25 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 4479 ms on pc-3 (2/2)
16/06/08 02:20:25 INFO scheduler.DAGScheduler: ResultStage 0 (collect at CaffeOnSpark.scala:127) finished in 4.480 s
16/06/08 02:20:25 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/06/08 02:20:25 INFO scheduler.DAGScheduler: Job 0 finished: collect at CaffeOnSpark.scala:127, took 4.826888 s
16/06/08 02:20:25 INFO caffe.CaffeOnSpark: rank = 0, address = ,pc-3:58886, hostname = pc-3
16/06/08 02:20:25 INFO caffe.CaffeOnSpark: rank = 1, address = pc-2:49130,, hostname = pc-2
16/06/08 02:20:25 INFO caffe.CaffeOnSpark: rank 0:pc-3
16/06/08 02:20:25 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 344.0 B, free 5.5 KB)
16/06/08 02:20:25 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 82.0 B, free 5.6 KB)
16/06/08 02:20:25 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.0.0.201:44520 (size: 82.0 B, free: 511.1 MB)
16/06/08 02:20:25 INFO spark.SparkContext: Created broadcast 1 from broadcast at CaffeOnSpark.scala:146
16/06/08 02:20:25 INFO spark.SparkContext: Starting job: collect at CaffeOnSpark.scala:155
16/06/08 02:20:25 INFO scheduler.DAGScheduler: Got job 1 (collect at CaffeOnSpark.scala:155) with 2 output partitions
16/06/08 02:20:25 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at CaffeOnSpark.scala:155)
16/06/08 02:20:25 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/08 02:20:25 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/08 02:20:25 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at map at CaffeOnSpark.scala:149), which has no missing parents
16/06/08 02:20:25 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 8.2 KB)
16/06/08 02:20:25 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1580.0 B, free 9.7 KB)
16/06/08 02:20:25 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.0.0.201:44520 (size: 1580.0 B, free: 511.1 MB)
16/06/08 02:20:25 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/06/08 02:20:25 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at map at CaffeOnSpark.scala:149)
16/06/08 02:20:25 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/06/08 02:20:25 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, pc-2, partition 0,PROCESS_LOCAL, 2236 bytes)
16/06/08 02:20:25 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, pc-3, partition 1,PROCESS_LOCAL, 2236 bytes)
16/06/08 02:20:25 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on pc-2:49903 (size: 1580.0 B, free: 511.1 MB)
16/06/08 02:20:25 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on pc-3:51562 (size: 1580.0 B, free: 511.1 MB)
16/06/08 02:20:25 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on pc-2:49903 (size: 82.0 B, free: 511.1 MB)
16/06/08 02:20:25 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on pc-3:51562 (size: 82.0 B, free: 511.1 MB)
16/06/08 02:20:31 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on pc-3:51562 in memory (size: 2.0 KB, free: 511.1 MB)
16/06/08 02:20:31 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on pc-2:49903 in memory (size: 2.0 KB, free: 511.1 MB)
16/06/08 02:20:31 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on 10.0.0.201:44520 in memory (size: 2.0 KB, free: 511.1 MB)
16/06/08 02:20:31 INFO spark.ContextCleaner: Cleaned accumulator 1
16/06/08 02:20:35 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 10096 ms on pc-2 (1/2)
16/06/08 02:20:35 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 10103 ms on pc-3 (2/2)
16/06/08 02:20:35 INFO scheduler.DAGScheduler: ResultStage 1 (collect at CaffeOnSpark.scala:155) finished in 10.108 s
16/06/08 02:20:35 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/06/08 02:20:35 INFO scheduler.DAGScheduler: Job 1 finished: collect at CaffeOnSpark.scala:155, took 10.124813 s
16/06/08 02:20:35 INFO caffe.LmdbRDD: local LMDB path:/home/lvhao/CaffeOnSpark/data/test_train_lmdb
16/06/08 02:20:55 INFO caffe.LmdbRDD: 2 LMDB RDD partitions
16/06/08 02:20:55 INFO spark.SparkContext: Starting job: min at CaffeOnSpark.scala:182
16/06/08 02:20:55 INFO scheduler.DAGScheduler: Got job 2 (min at CaffeOnSpark.scala:182) with 2 output partitions
16/06/08 02:20:55 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 (min at CaffeOnSpark.scala:182)
16/06/08 02:20:55 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/06/08 02:20:55 INFO scheduler.DAGScheduler: Missing parents: List()
16/06/08 02:20:55 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[6] at mapPartitions at CaffeOnSpark.scala:171), which has no missing parents
16/06/08 02:20:55 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.4 KB, free 7.9 KB)
16/06/08 02:20:55 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.1 KB, free 10.1 KB)
16/06/08 02:20:55 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.0.0.201:44520 (size: 2.1 KB, free: 511.1 MB)
16/06/08 02:20:55 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
16/06/08 02:20:55 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 2 (MapPartitionsRDD[6] at mapPartitions at CaffeOnSpark.scala:171)
16/06/08 02:20:55 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 4, pc-3, partition 0,PROCESS_LOCAL, 2171 bytes)
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 2.0 (TID 5, pc-2, partition 1,PROCESS_LOCAL, 2200 bytes)
16/06/08 02:20:55 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on pc-2:49903 (size: 2.1 KB, free: 511.1 MB)
16/06/08 02:20:55 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on pc-3:51562 (size: 2.1 KB, free: 511.1 MB)
16/06/08 02:20:55 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 5, pc-2): java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at com.yahoo.ml.caffe.LmdbRDD.localLMDBFile(LmdbRDD.scala:185)
at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:202)
at com.yahoo.ml.caffe.LmdbRDD$$anon$1.(LmdbRDD.scala:102)
at com.yahoo.ml.caffe.LmdbRDD.compute(LmdbRDD.scala:100)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

16/06/08 02:20:55 INFO scheduler.TaskSetManager: Starting task 1.1 in stage 2.0 (TID 6, pc-3, partition 1,PROCESS_LOCAL, 2200 bytes)
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4) on executor pc-3: java.lang.NullPointerException (null) [duplicate 1]
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 2.0 (TID 7, pc-2, partition 0,PROCESS_LOCAL, 2171 bytes)
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Lost task 1.1 in stage 2.0 (TID 6) on executor pc-3: java.lang.NullPointerException (null) [duplicate 2]
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Starting task 1.2 in stage 2.0 (TID 8, pc-2, partition 1,PROCESS_LOCAL, 2200 bytes)
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 2.0 (TID 7) on executor pc-2: java.lang.NullPointerException (null) [duplicate 3]
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Starting task 0.2 in stage 2.0 (TID 9, pc-2, partition 0,PROCESS_LOCAL, 2171 bytes)
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Lost task 1.2 in stage 2.0 (TID 8) on executor pc-2: java.lang.NullPointerException (null) [duplicate 4]
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Starting task 1.3 in stage 2.0 (TID 10, pc-2, partition 1,PROCESS_LOCAL, 2200 bytes)
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Lost task 0.2 in stage 2.0 (TID 9) on executor pc-2: java.lang.NullPointerException (null) [duplicate 5]
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Starting task 0.3 in stage 2.0 (TID 11, pc-2, partition 0,PROCESS_LOCAL, 2171 bytes)
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Lost task 1.3 in stage 2.0 (TID 10) on executor pc-2: java.lang.NullPointerException (null) [duplicate 6]
16/06/08 02:20:55 ERROR scheduler.TaskSetManager: Task 1 in stage 2.0 failed 4 times; aborting job
16/06/08 02:20:55 INFO scheduler.TaskSetManager: Lost task 0.3 in stage 2.0 (TID 11) on executor pc-2: java.lang.NullPointerException (null) [duplicate 7]
16/06/08 02:20:55 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
16/06/08 02:20:55 INFO scheduler.TaskSchedulerImpl: Cancelling stage 2
16/06/08 02:20:55 INFO scheduler.DAGScheduler: ResultStage 2 (min at CaffeOnSpark.scala:182) failed in 0.144 s
16/06/08 02:20:55 INFO scheduler.DAGScheduler: Job 2 failed: min at CaffeOnSpark.scala:182, took 0.166341 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 10, pc-2): java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at com.yahoo.ml.caffe.LmdbRDD.localLMDBFile(LmdbRDD.scala:185)
at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:202)
at com.yahoo.ml.caffe.LmdbRDD$$anon$1.(LmdbRDD.scala:102)
at com.yahoo.ml.caffe.LmdbRDD.compute(LmdbRDD.scala:100)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
at org.apache.spark.rdd.RDD$$anonfun$min$1.apply(RDD.scala:1404)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.min(RDD.scala:1403)
at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:182)
at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40)
at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at com.yahoo.ml.caffe.LmdbRDD.localLMDBFile(LmdbRDD.scala:185)
at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:202)
at com.yahoo.ml.caffe.LmdbRDD$$anon$1.(LmdbRDD.scala:102)
at com.yahoo.ml.caffe.LmdbRDD.compute(LmdbRDD.scala:100)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/06/08 02:20:55 INFO spark.SparkContext: Invoking stop() from shutdown hook
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/06/08 02:20:55 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/06/08 02:20:55 INFO ui.SparkUI: Stopped Spark web UI at http://10.0.0.201:4040
16/06/08 02:20:55 INFO cluster.SparkDeploySchedulerBackend: Shutting down all executors
16/06/08 02:20:55 INFO cluster.SparkDeploySchedulerBackend: Asking each executor to shut down
16/06/08 02:20:55 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/06/08 02:20:55 INFO storage.MemoryStore: MemoryStore cleared
16/06/08 02:20:55 INFO storage.BlockManager: BlockManager stopped
16/06/08 02:20:55 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/06/08 02:20:55 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/06/08 02:20:55 INFO spark.SparkContext: Successfully stopped SparkContext
16/06/08 02:20:55 INFO util.ShutdownHookManager: Shutdown hook called
16/06/08 02:20:55 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b4a9f269-b16b-4170-ae04-18b442cf5dec
16/06/08 02:20:55 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b4a9f269-b16b-4170-ae04-18b442cf5dec/httpd-ea75132c-c34d-49e5-9f13-505120345feb
16/06/08 02:20:55 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

no lmdbjni in java.library.path exception

16/02/26 16:34:34 INFO caffe.DataSource$: Source data layer:0
16/02/26 16:34:34 INFO caffe.LMDB: Batch size:64
Exception in thread "main" java.lang.UnsatisfiedLinkError: no lmdbjni in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at com.yahoo.ml.caffe.LMDB$.makeSequence(LMDB.scala:28)
at com.yahoo.ml.caffe.LMDB.makeRDD(LMDB.scala:94)
at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:113)
at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:44)
at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/02/26 16:34:34 INFO spark.SparkContext: Invoking stop() from shutdown hook

Poor efficiency while running many worker instance nodes

Hi,

I have got a quick result when just start one worker instance node, but the processing became very slow if i change the SPARK_WORKER_INSTANCES to 2.(I have built a 3 nodes cluster)

for example, the cost time for minist change from 45s (1 node) to 3m47s(2 nodes) and for cifar-10 change from 1'20s (1 node) to 7m10s(2 nodes)

I have export raw data to HDFS using the following cmd and set 3 replications.
hadoop fs -put -f ${CAFFE_ON_SPARK}/data/cifar10_*_lmdb hdfs:/projects/machine_learning/image_dataset/

The following is my trainning command for cifar:

export SPARK_WORKER_INSTANCES=2
export DEVICES=1
hadoop fs -rm -r -f hdfs:///cifar10_features_result
hadoop fs -rm hdfs:///cifar10*
spark-submit --master yarn --deploy-mode cluster --num-executors ${SPARK_WORKER_INSTANCES}
--files ./data/cifar10_quick_solver_hdfs.prototxt,./data/cifar10_quick_train_test_hdfs.prototxt,./data/mean.binaryproto
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train -features accuracy,loss -label label -conf cifar10_quick_solver_hdfs.prototxt
-devices ${DEVICES}
-connection ethernet -model hdfs:///cifar10.model.h5 -output hdfs:///cifar10_features_result

Here is some log from worker node when running multi nodes.

16/03/22 11:05:30 INFO storage.BlockManager: Found block rdd_0_0 locally
I0322 11:05:31.616941 24799 solver.cpp:237] Iteration 1000, loss = 0.935601
I0322 11:05:31.617007 24799 solver.cpp:253] Train net output #0: loss = 0.935601 (* 1 = 0.935601 loss)
I0322 11:05:31.657905 24799 sgd_solver.cpp:106] Iteration 1000, lr = 0.001
I0322 11:05:41.364012 24799 solver.cpp:237] Iteration 1100, loss = 0.955443
I0322 11:05:41.364084 24799 solver.cpp:253] Train net output #0: loss = 0.955443 (* 1 = 0.955443 loss)
I0322 11:05:41.411869 24799 sgd_solver.cpp:106] Iteration 1100, lr = 0.001
I0322 11:05:51.151032 24799 solver.cpp:237] Iteration 1200, loss = 0.813748
I0322 11:05:51.151115 24799 solver.cpp:253] Train net output #0: loss = 0.813748 (* 1 = 0.813748 loss)
I0322 11:05:51.197666 24799 sgd_solver.cpp:106] Iteration 1200, lr = 0.001

Here is some log from worker node when running single node.

16/03/22 14:14:30 INFO storage.BlockManager: Found block rdd_0_0 locally
I0322 14:14:30.212993 9250 solver.cpp:237] Iteration 1000, loss = 0.875192
I0322 14:14:30.213027 9250 solver.cpp:253] Train net output #0: loss = 0.875192 (* 1 = 0.875192 loss)
I0322 14:14:30.213042 9250 sgd_solver.cpp:106] Iteration 1000, lr = 0.001
I0322 14:14:31.229243 9250 solver.cpp:237] Iteration 1100, loss = 1.05669
I0322 14:14:31.229284 9250 solver.cpp:253] Train net output #0: loss = 1.05669 (* 1 = 1.05669 loss)
I0322 14:14:31.229298 9250 sgd_solver.cpp:106] Iteration 1100, lr = 0.001
I0322 14:14:32.238924 9250 solver.cpp:237] Iteration 1200, loss = 0.965516
I0322 14:14:32.238962 9250 solver.cpp:253] Train net output #0: loss = 0.965516 (* 1 = 0.965516 loss)
I0322 14:14:32.238976 9250 sgd_solver.cpp:106] Iteration 1200, lr = 0.001
I0322 14:14:33.250813 9250 solver.cpp:237] Iteration 1300, loss = 0.806154
I0322 14:14:33.250855 9250 solver.cpp:253] Train net output #0: loss = 0.806154 (* 1 = 0.806154 loss)
I0322 14:14:33.250871 9250 sgd_solver.cpp:106] Iteration 1300, lr = 0.001
I0322 14:14:34.266059 9250 solver.cpp:237] Iteration 1400, loss = 0.807488
I0322 14:14:34.266103 9250 solver.cpp:253] Train net output #0: loss = 0.807488 (* 1 = 0.807488 loss)
I0322 14:14:34.266119 9250 sgd_solver.cpp:106] Iteration 1400, lr = 0.001

As you can see, the different between that is every iteration cost time(9s vs 1s). I know there are extra communication between differents nodes, but why so slow?

Looking forward to your response, thx!

LMDB entry's channels/height/width need to be reserved

Our current implementation maps LMDB's key/label/data nicely, but didn't include channels/height/width. The lost information will result certain use cases (see #10) to fail due to channel/width/height mismatch.

java.lang.NullPointerException on EC2

I have got an java.lang.NullPointerException when I run CaffeOnSpark on EC2, following the steps at "https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_EC2"

Could anybody help me?

CODE:

export SPARK_HOME=/root/spark
export HADOOP_HOME=/root/persistent-hdfs/
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export PATH=${HADOOP_HOME}/bin:${SPARK_HOME}/bin:${PATH}
export CAFFE_ON_SPARK=/root/CaffeOnSpark
export LD_LIBRARY_PATH=${CAFFE_ON_SPARK}/caffe-public/distribute/lib:${CAFFE_ON_SPARK}/caffe-distri/distribute/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-7.0/lib64:/usr/local/mkl/lib/intel64/
export SPARK_WORKER_INSTANCES=2
export CORES_PER_WORKER=1
export TOTAL_CORES=$((${CORES_PER_WORKER}*${SPARK_WORKER_INSTANCES}))
export DEVICES=1

pushd ${CAFFE_ON_SPARK}/data

spark-submit --master spark://$(hostname):7077
--files ${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist.model
-output /mnist_features_result

ERROR:

6/03/22 16:33:31 INFO DAGScheduler: Job 4 failed: reduce at CaffeOnSpark.scala:202, took 19.225960 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 12, ip-172-31-13-171.eu-west-1.compute.internal): java.lang.NullPointerException
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:192)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$7.apply(CaffeOnSpark.scala:188)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Does CaffeOnspark support fault recovery stratage?

When i run the cifar10 example, the job often hang if an error appears on one slave node.
E0321 09:20:01.425690 12928 socket.cpp:61] ERROR: Read partial messageheader [4 of 12]

I don't know whether the network bandwidth problem causes the above error happened(error extracting socket head)?
So my question is whether there exists an error recovery mechanism to reduce probability of this kind of errors?
Notes: i use ethernet connection and an 1 Gb/s switch.
Thank you.

Caffe Successful but not Caffe-distri

Hi,

i have the following maven error:

[exec] g++ src/main/cpp/jni/JniMatVector.cpp -MMD -MP -pthread -fPIC -DNDEBUG -O2 -I/usr/include/python2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/local/include -I/usr/include/hdf5/serial/ -I/usr/lib/jvm/default-java/include -I/usr/lib/jvm/java-7-oracle/include -I.build_release/src -I./include -I../caffe-public/distribute/include -I../caffe-public/src -I/usr/local/cuda/include -I/usr/lib/jvm/java-7-oracle/include/linux -Wall -Wno-sign-compare -c -o .build_release/src/main/cpp/jni/JniMatVector.o 2> .build_release/src/main/cpp/jni/JniMatVector.o.warnings.txt
[exec] || (cat .build_release/src/main/cpp/jni/JniMatVector.o.warnings.txt; exit 1)
[exec] CXX src/main/cpp/jni/JniFloatDataTransformer.cpp
[exec] g++ src/main/cpp/jni/JniFloatDataTransformer.cpp -MMD -MP -pthread -fPIC -DNDEBUG -O2 -I/usr/include/python2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/local/include -I/usr/include/hdf5/serial/ -I/usr/lib/jvm/default-java/include -I/usr/lib/jvm/java-7-oracle/include -I.build_release/src -I./include -I../caffe-public/distribute/include -I../caffe-public/src -I/usr/local/cuda/include -I/usr/lib/jvm/java-7-oracle/include/linux -Wall -Wno-sign-compare -c -o .build_release/src/main/cpp/jni/JniFloatDataTransformer.o 2> .build_release/src/main/cpp/jni/JniFloatDataTransformer.o.warnings.txt
[exec] || (cat .build_release/src/main/cpp/jni/JniFloatDataTransformer.o.warnings.txt; exit 1)
[exec] make[1]: *** [.build_release/src/main/cpp/jni/JniFloatDataTransformer.o] Error 1
[exec] src/main/cpp/jni/JniFloatDataTransformer.cpp: In function ‘void Java_com_yahoo_ml_jcaffe_FloatDataTransformer_transform(JNIEnv_, jobject, jobject, jobject)’:
[exec] src/main/cpp/jni/JniFloatDataTransformer.cpp:69:52: error: no matching function for call to ‘caffe::DataTransformer::Transform(std::vectorcv::Mat&, caffe::Blob&)’
[exec] xformer->Transform((* mat_vector_ptr), blob_ptr);
[exec] ^
[exec] src/main/cpp/jni/JniFloatDataTransformer.cpp:69:52: note: candidates are:
[exec] In file included from src/main/cpp/jni/JniFloatDataTransformer.cpp:8:0:
[exec] ../caffe-public/distribute/include/caffe/data_transformer.hpp:38:8: note: void caffe::DataTransformer::Transform(const caffe::Datum&, caffe::Blob) [with Dtype = float]
[exec] void Transform(const Datum& datum, Blob_ transformed_blob);
[exec] ^
[exec] ../caffe-public/distribute/include/caffe/data_transformer.hpp:38:8: note: no known conversion for argument 1 from ‘std::vectorcv::Mat’ to ‘const caffe::Datum&’
[exec] ../caffe-public/distribute/include/caffe/data_transformer.hpp:50:8: note: void caffe::DataTransformer::Transform(const std::vectorcaffe::Datum&, caffe::Blob) [with Dtype = float]
[exec] void Transform(const vector & datum_vector,
[exec] ^
[exec] ../caffe-public/distribute/include/caffe/data_transformer.hpp:50:8: note: no known conversion for argument 1 from ‘std::vectorcv::Mat’ to ‘const std::vectorcaffe::Datum&’
[exec] ../caffe-public/distribute/include/caffe/data_transformer.hpp:91:8: note: void caffe::DataTransformer::Transform(caffe::Blob, caffe::Blob) [with Dtype = float]
[exec] void Transform(Blob input_blob, Blob* transformed_blob);
[exec] ^
[exec] ../caffe-public/distribute/include/caffe/data_transformer.hpp:91:8: note: no known conversion for argument 1 from ‘std::vectorcv::Mat’ to ‘caffe::Blob’
[exec] ../caffe-public/distribute/include/caffe/data_transformer.hpp:141:8: note: void caffe::DataTransformer::Transform(const caffe::Datum&, Dtype) [with Dtype = float]
[exec] void Transform(const Datum& datum, Dtype* transformed_data);
[exec] ^
[exec] ../caffe-public/distribute/include/caffe/data_transformer.hpp:141:8: note: no known conversion for argument 1 from ‘std::vectorcv::Mat’ to ‘const caffe::Datum&’
[exec] Makefile:413: recipe for target '.build_release/src/main/cpp/jni/JniFloatDataTransformer.o' failed
[exec] make[1]: Leaving directory '/opt/CaffeOnSpark/caffe-distri'
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe ............................................. SUCCESS [0.001s]
[INFO] caffe-distri ...................................... FAILURE [53.341s]
[INFO] caffe-grid ........................................ SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 53.811s
[INFO] Finished at: Mon May 09 17:21:45 CEST 2016
[INFO] Final Memory: 16M/340M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (make) on project caffe-distri: An Ant BuildException has occured: exec returned: 2
[ERROR] around Ant part ...... @ 5:83 in /opt/CaffeOnSpark/caffe-distri/target/antrun/build-make.xml
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :caffe-distri
Makefile:9: recipe for target 'build' failed
make: *** [build] Error 1

I am running Java version 1.7.0_80-b15, scala version 2.11.8, maven 3.0.5

RDMA over ethernet

Hi, folks,

Do you have any plan to support RoCE(RDMA over Converged Ethernet) device?

Now I tried to setup CaffeOnSpark in my GPU+RoCE environment, I added GID index in source code for RoCE connection.

But now I got failed in ibv_reg_mr() by using the "data_" address.
If I replaced the "data_" address with a malloc address (same size), It works.

That would be great if you could give me any suggestion about this issue.
Thanks.

UnsatisfiedLinkError, Library not loaded on Mac OS X(10.11)

Dear all, I followed the Running CaffeOnSpark Locally tutorial and wished to get CaffeOnSpark running on my own MBPR (OSX: 10.11, 8 cores, 64bits). Everything goes well until I reach the 8th step -- Train a DNN network using CaffeOnSpark.

In my case, some key configurations are listed as follows:

export MASTER_URL=spark://localhost:7077
export SPARK_WORKER_INSTANCES=1
export CORES_PER_WORKER=8

DYLD_LIBRARY_PATH:

$ echo $DYLD_LIBRARY_PATH
/Users/sqfan/tools/CaffeOnSpark/caffe-public/distribute/lib:/Users/sqfan/tools/CaffeOnSpark/caffe-distri/distribute/lib:/usr/local/cuda/lib

The spark command:

spark-submit --master spark://localhost:7077 \
    --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
    --conf spark.cores.max=${CORES_PER_WORKER} \
    --conf spark.task.cpus=${CORES_PER_WORKER} \
    --conf spark.driver.extraLibraryPath="${DYLD_LIBRARY_PATH}" \
    --conf spark.executorEnv.DYLD_LIBRARY_PATH="${DYLD_LIBRARY_PATH}" \
    --class com.yahoo.ml.caffe.CaffeOnSpark  \
    ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
        -train \
        -features accuracy,loss -label label \
        -conf lenet_memory_solver.prototxt \
    -clusterSize ${SPARK_WORKER_INSTANCES} \
        -devices 1 \
    -connection ethernet \
        -model file:${CAFFE_ON_SPARK}/mnist_lenet.model \
        -output file:${CAFFE_ON_SPARK}/lenet_features_result

And the key part of ERROR Log:

...
...
16/05/27 22:04:14 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 114.212.83.64, partition 0,PROCESS_LOCAL, 2203 bytes)
16/05/27 22:04:15 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 114.212.83.64:59985 (size: 2.1 KB, free: 511.5 MB)
16/05/27 22:04:15 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 114.212.83.64): **java.lang.UnsatisfiedLinkError: /Users/sqfan/tools/CaffeOnSpark/caffe-distri/distribute/lib/libcaffedistri.jnilib: dlopen(/Users/sqfan/tools/CaffeOnSpark/caffe-distri/distribute/lib/libcaffedistri.jnilib, 1): Library not loaded: @rpath/./libhdf5_hl.10.dylib**
  Referenced from: /Users/sqfan/tools/CaffeOnSpark/caffe-distri/distribute/lib/libcaffedistri.jnilib
  Reason: image not found
    at java.lang.ClassLoader$NativeLibrary.load(Native Method)
    at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1965)
    at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1894)
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1880)
    at java.lang.Runtime.loadLibrary0(Runtime.java:849)
    at java.lang.System.loadLibrary(System.java:1088)
    at com.yahoo.ml.jcaffe.BaseObject.<clinit>(BaseObject.java:10)
    at com.yahoo.ml.caffe.CaffeProcessor.<init>(CaffeProcessor.scala:55)
    at com.yahoo.ml.caffe.CaffeProcessor$.instance(CaffeProcessor.scala:21)
    at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$4.apply(CaffeOnSpark.scala:118)
    at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$4.apply(CaffeOnSpark.scala:116)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

16/05/27 22:04:15 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, 114.212.83.64, partition 0,PROCESS_LOCAL, 2203 bytes)
16/05/27 22:04:16 ERROR scheduler.TaskSchedulerImpl: Lost executor 0 on 114.212.83.64: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
16/05/27 22:04:16 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1, 114.212.83.64): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
16/05/27 22:04:16 INFO scheduler.DAGScheduler: Executor lost: 0 (epoch 0)
16/05/27 22:04:16 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160527220410-0000/0 is now EXITED (Command exited with code 50)
...
...

This post in our google group and this post in stack overflow seem really the similar problems. As these posts pointed out, I open the SPARK_PRINT_LAUNCH_COMMAND switch and get the complete spark command as follows:

Spark Command: /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/bin/java 
-cp /Users/sqfan/tools/CaffeOnSpark/scripts/spark-1.6.0-bin-
hadoop2.6/conf/:/Users/sqfan/tools/CaffeOnSpark/scripts/spark-1.6.0-bin-hadoop2.6/lib/spark-
assembly-1.6.0-hadoop2.6.0.jar:/Users/sqfan/tools/CaffeOnSpark/scripts/spark-1.6.0-bin-
hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/Users/sqfan/tools/CaffeOnSpark/scripts/spark-1.6.0-bin-
hadoop2.6/lib/datanucleus-core-3.2.10.jar:/Users/sqfan/tools/CaffeOnSpark/scripts/spark-1.6.0-bin-
hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/Users/sqfan/tools/CaffeOnSpark/scripts/hadoop-
2.6.4/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --
master spark://localhost:7077 --conf 
spark.driver.extraLibraryPath=/Users/sqfan/tools/CaffeOnSpark/caffe-
public/distribute/lib:/Users/sqfan/tools/CaffeOnSpark/caffe-distri/distribute/lib:/usr/local/cuda/lib --conf 
spark.cores.max=8 --conf spark.task.cpus=8 --conf 
spark.executorEnv.DYLD_LIBRARY_PATH=/Users/sqfan/tools/CaffeOnSpark/caffe-
public/distribute/lib:/Users/sqfan/tools/CaffeOnSpark/caffe-distri/distribute/lib:/usr/local/cuda/lib --class
 com.yahoo.ml.caffe.CaffeOnSpark --files 
/Users/sqfan/tools/CaffeOnSpark/data/lenet_memory_solver.prototxt,/Users/sqfan/tools/CaffeOnSpark/
data/lenet_memory_train_test.prototxt /Users/sqfan/tools/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-
SNAPSHOT-jar-with-dependencies.jar -train -features accuracy,loss -label label -conf 
lenet_memory_solver.prototxt -clusterSize 1 -devices 1 -connection ethernet -model 
file:/Users/sqfan/tools/CaffeOnSpark/mnist_lenet.model -output 
file:/Users/sqfan/tools/CaffeOnSpark/lenet_features_result
========================================
...
...

While the ERRORs keep the same. This problem has puzzled me for several days and little progress has been made. So I am posting here and any helpful advice will be appreciated !!

Thanks in advance!!!

Connect GPUs from different machines

I have a problem running CaffeOnSpark with the option clusterSize >= 2.

I have the following configuration: server1 is the master node and also a worker; server2 is another worker node. When I run the LeNet example, I encounter the following problem:

I0322 17:35:26.562448 23810 socket.cpp:250] Trying to connect with ...[server1:56799]
E0322 17:35:26.604753 23810 socket.cpp:276] ERROR: No peer by name [server1]
I0322 17:35:36.604908 23810 socket.cpp:250] Trying to connect with ...[server1:56799]
E0322 17:35:36.605989 23810 socket.cpp:276] ERROR: No peer by name [server1]

This is the log from server2.

The same message from server1:

I0322 17:32:55.835394 11905 socket.cpp:250] Trying to connect with ...[server2:59510]
E0322 17:32:55.877120 11905 socket.cpp:276] ERROR: No peer by name [server2]
I0322 17:33:05.877269 11905 socket.cpp:250] Trying to connect with ...[server2:59510]
E0322 17:33:05.878090 11905 socket.cpp:276] ERROR: No peer by name [server2]
I0322 17:33:15.890472 11905 parallel.cpp:392] GPUs pairs
I0322 17:33:15.975946 11959 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
16/03/22 17:33:15 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 905 bytes result sent to driver
16/03/22 17:33:16 INFO CoarseGrainedExecutorBackend: Got assigned task 5
16/03/22 17:33:16 INFO Executor: Running task 1.0 in stage 2.0 (TID 5)
16/03/22 17:33:16 INFO TorrentBroadcast: Started reading broadcast variable 3
16/03/22 17:33:16 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1412.0 B, free 11.4 KB)
16/03/22 17:33:16 INFO TorrentBroadcast: Reading broadcast variable 3 took 12 ms
16/03/22 17:33:16 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.2 KB, free 13.5 KB)
16/03/22 17:33:16 INFO CacheManager: Partition rdd_5_1 not found, computing it
16/03/22 17:33:16 INFO CacheManager: Partition rdd_0_1 not found, computing it
16/03/22 17:33:16 INFO LmdbRDD: Processing partition 1
16/03/22 17:33:17 INFO LmdbRDD: Completed partition 1
16/03/22 17:33:17 INFO BlockManager: Found block rdd_0_1 locally
E0322 17:33:17.626077 11905 socket.cpp:45] ERROR: Sending message header!
E0322 17:33:17.626099 11905 socket.cpp:345] ERROR: Sending data from client

Both servers can connect to each other through ssh without password either by full domain names or just server names.
Both servers are Ubuntu 14.04

Yarn executor ends at executor.CoarseGrainedExecutorBackend in CPU MODE

I am trying to run given examples in distributed spark mode both with CPU and GPU but both jobs fail.

One with CPU mode gives following error

16/06/10 14:42:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/10 14:42:29 INFO spark.SecurityManager: Changing view acls to: ubuntu
16/06/10 14:42:29 INFO spark.SecurityManager: Changing modify acls to: ubuntu
16/06/10 14:42:29 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
16/06/10 14:42:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

Any design details to be exposed about CaffeOnSpark?

I found this project really useful and want to do some interesting things with it as well as some possible pull requests. While it seems that the related materials is so limited that we have to bite the bullet to read the source code for a systematic understanding. That's really not an easy effort and worse we may produce some misunderstanding during this process.

So my question is that is there any design details to be exposed about this project in the near future? Or may you open source some detailed framework references just like caffe's.

Really appreciate for any useful comment!

coding mistake in ImageDataSource.scala

caffe-grid/src/main/scala/com/yahoo/ml/caffe/ImageDataSource.scala:121
if (mat.width() != sample_width || mat.height() != sample_width) {
the second sample_width should be sample_height

nvidia-smi hangs when devices=2 using spark-submit in yarn

In single node with 2 gpus, when submit using devices=2. nvidia-smi hang up. Need to shutdown computer to bring nvidia gpu back for use. The operation system is ubuntu 14.04.

no caffedistri in java.library.path

WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, melodyrain-node2): java.lang.UnsatisfiedLinkError: no caffedistri in java.library.path

Does the WARN matter? I got

WARN server.TransportChannelHandler: Exception in connection from melodyrain-node2/192.168.0.103:33104
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)

later and the program stops.

Would you please provide cifar-10 training and testing data as example?

cifar-10 is a decent example to do benchmark on. At least it will take a minute to achieve 75% accuracy.

Getting error while running in yarn mode

16/06/10 14:32:20 INFO caffe.CaffeProcessor: my rank is 1
16/06/10 14:32:20 INFO caffe.LMDB: Batch size:64
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0610 14:32:20.571682 10978 CaffeNet.cpp:67] Check failed: d >= 0 (-1 vs. 0) cannot grab GPU device

command used is

spark-submit --master yarn --deploy-mode cluster --num-executors ${SPARK_WORKER_INSTANCES} --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" --class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar -train -features accuracy,loss -label label -conf lenet_memory_solver.prototxt -devices ${DEVICES} -connection ethernet -model hdfs:///mnist.model -output hdfs:///mnist_features_result

Please guide

AMI updated to new code

is the AMI ami-6373ca10 updated with the latest code. If no what are the steps to bring it up to the latest development.

Remote RPC client disassociated

We executed MNIST Autoencoder from Caffe example in two different Spark cluster with 2 slaves: g2.2xlarge (1 GPU, 8 vCPUs) and g2.8xlarge (4 GPUs, 32 vCPUs). We had already modified prototxt files to use MemoryData.
Running the following CLI in both clusters,

spark-submit --master spark://$(hostname):7077
--files mnist_memory_autoencoder_solver.prototxt,mnist_memory_autoencoder.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train -persistent
-features accuracy,loss -label label
-conf mnist_memory_autoencoder_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model /mnist_memory.model
-output /mnist_model_autoencoder

we got an error

16/03/31 13:57:36 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 6, ip-172-31-6-221.eu-west-1.compute.internal, partition 1,PROCESS_LOCAL, 2216 bytes)
16/03/31 13:57:36 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 7, ip-172-31-6-221.eu-west-1.compute.internal, partition 0,PROCESS_LOCAL, 2216 bytes)
16/03/31 13:57:37 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-6-221.eu-west-1.compute.internal:35232 with 37.9 GB RAM, BlockManagerId(4, ip-172-31-6-221.eu-west-1.compute.internal, 35232)
16/03/31 13:57:37 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-6-221.eu-west-1.compute.internal:35232 (size: 2.0 KB, free: 37.9 GB)
16/03/31 13:57:39 ERROR TaskSchedulerImpl: Lost executor 4 on ip-172-31-6-221.eu-west-1.compute.internal: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
16/03/31 13:57:39 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 7, ip-172-31-6-221.eu-west-1.compute.internal): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
16/03/31 13:57:39 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
16/03/31 13:57:39 WARN TaskSetManager: Lost task 1.3 in stage 0.0 (TID 6, ip-172-31-6-221.eu-west-1.compute.internal): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
16/03/31 13:57:39 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/03/31 13:57:39 INFO TaskSchedulerImpl: Cancelling stage 0
16/03/31 13:57:39 INFO DAGScheduler: ResultStage 0 (collect at CaffeOnSpark.scala:125) failed in 11.747 s
16/03/31 13:57:39 INFO DAGScheduler: Job 0 failed: collect at CaffeOnSpark.scala:125, took 11.988590 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7, ip-172-31-6-221.eu-west-1.compute.internal): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

Additionally, we ran successfully this same example (with type "Data" instead ˜MemoryData") in a single notebook.

Could you help us?

Thanks!

Distributed training model for CaffeOnSpark

It is data parallel training or model parallel training?

the program has been stuck at: min at CaffeOnSpark.scala

when I set num-executor=5,Program has been stuck there。I hava three computers，each computer has eight card，20cores，120g。How I set parameters, use all of these resources

the configure as follow:
export LD_LIBRARY_PATH=/home/hadoop/projects/CaffeOnSpark/caffe-public/distribute/lib:/home/hadoop/projects/CaffeOnSpark/caffe-distri/distribute/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-7.5/lib64:/opt/OpenBLAS/lib
export SPARK_WORKER_INSTANCES=2
export DEVICES=1

spark-submit --master yarn --deploy-mode cluster
--num-executors 5
--executor-cores 4
--driver-memory 35g
--executor-memory 35g
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=240s
--files /home/hadoop/projects/CaffeOnSpark/data/lenet_memory_solver.prototxt,/home/hadoop/projects/CaffeOnSpark/data/lenet_memory_train_test.prototxt
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
/home/hadoop/projects/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-devices 5
-connection ethernet
-model test1
-output testresult1

the log of executors:

com.yahoo.ml.jcaffe.CaffeNet.sync(Native Method)
com.yahoo.ml.caffe.CaffeProcessor.sync(CaffeProcessor.scala:156)
com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$6.apply(CaffeOnSpark.scala:176)
com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$6.apply(CaffeOnSpark.scala:171)
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
org.apache.spark.scheduler.Task.run(Task.scala:89)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:724)

Trainining ImageNet on CaffeOnSpark

Hello,

I wanted to study the scalability properties of CaffeOnSpark, and both MNIST and CIFAR-10 do not seem good examples, because increasing the number of GPUs/machines, does not translate into a significant speedup.
So, I want to try a more complex model with more data and feel ImageNet is a good example. Given that ImageNet has >100 GB of train data, do you see any problems in porting that example to CaffeOnSpark. Also any must know suggestions will prove helpful.

If it can prove helpful, I will be willing to contribute back the instructions to run and necessary files for the ImageNet example.

BUILD SUCCESS, but error follows.

Hello there. First of all thank you guys for developing CaffeOnSpark. It is a great step in the right direction.

I was able to build CaffeOnSpark successfully:

[INFO] caffe ............................................. SUCCESS [0.001s]
[INFO] caffe-distri ...................................... SUCCESS [1:57.821s]
[INFO] caffe-grid ........................................ SUCCESS [3:05.423s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 5:03.547s
[INFO] Finished at: Sat Apr 23 21:36:26 UTC 2016
[INFO] Final Memory: 55M/396M

The Issue:
However, an error follows it installation stops. Is this something that I should worry about ?

jar -xvf caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar META-INF/native/linux64/liblmdbjni.so
 inflated: META-INF/native/linux64/liblmdbjni.so
mv META-INF/native/linux64/liblmdbjni.so /home/cc/CaffeOnSpark/caffe-distri/distribute/lib
cp -r /home/cc/CaffeOnSpark/caffe-public/python/caffe /home/cc/CaffeOnSpark/caffe-grid/src/main/python/
cd /home/cc/CaffeOnSpark/caffe-grid/src/main/python/; zip -r caffeonsparkpythonapi  *; mv caffeonsparkpythonapi.zip /home/cc/CaffeOnSpark/caffe-grid/target/;cd /home/cc/CaffeOnSpark
/bin/sh: 1: zip: not found
mv: cannot stat ‘caffeonsparkpythonapi.zip’: No such file or directory

Thanks !

Spark 2.0 Branch / Support [Enhancement]

I added a PR for Spark 2.0 using the SparkSession instead of SparkContext. In addition the libraries were moved to scala 2.11 and hadoop 2.7.1 to be more in line with spark 2.X direction. The tests were run and pass.

#77

Java API for CaffeOnSpark

Do you have this feature in your road map? It will be useful since Spark supports both Python and Java APIs.

Fails mnist dataset example in local mode and standalone mode

My computer is Macbook pro retina. I successfully install caffeonspark, and try to validate using mnist dataset example on the wiki page. However, no matter which mode I use, it fails the execution. Below are two different error codes for local and standalone mode. For the sake of simplicity, I only list the error parts, any suggestion would be appreciated

First, standalone mode, the command I type is:

spark-submit --master ${MASTER_URL} \
>     --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt \
>     --conf spark.cores.max=${TOTAL_CORES} \
>     --conf spark.task.cpus=${CORES_PER_WORKER} \
>     --conf spark.driver.extraLibraryPath="${DYLD_LIBRARY_PATH}" \
>     --conf spark.executorEnv.DYLD_LIBRARY_PATH="${DYLD_LIBRARY_PATH}" \
>     --class com.yahoo.ml.caffe.CaffeOnSpark  \
>     ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \
>         -train \
>         -features accuracy,loss -label label \
>         -conf lenet_memory_solver.prototxt \
>     -clusterSize ${SPARK_WORKER_INSTANCES} \
>         -devices 1 \
>     -connection ethernet \
>         -model file:${CAFFE_ON_SPARK}/mnist_lenet.model \
>         -output file:${CAFFE_ON_SPARK}/lenet_features_result

error code for standalone mode:

Exception in thread "main" org.apache.spark.SparkException: addFile does not support local directories when not running local mode.
    at org.apache.spark.SparkContext.addFile(SparkContext.scala:1368)
    at com.yahoo.ml.caffe.LmdbRDD.getPartitions(LmdbRDD.scala:44)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:158)
    at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40)
    at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/06/13 22:40:56 INFO SparkContext: Invoking stop() from shutdown hook
16/06/13 22:40:56 INFO SparkUI: Stopped Spark web UI at http://192.168.38.105:4040
16/06/13 22:40:56 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/06/13 22:40:56 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/06/13 22:40:56 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/06/13 22:40:56 INFO MemoryStore: MemoryStore cleared
16/06/13 22:40:56 INFO BlockManager: BlockManager stopped
16/06/13 22:40:56 INFO BlockManagerMaster: BlockManagerMaster stopped
16/06/13 22:40:56 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/06/13 22:40:56 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/06/13 22:40:56 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/06/13 22:40:56 INFO SparkContext: Successfully stopped SparkContext
16/06/13 22:40:56 INFO ShutdownHookManager: Shutdown hook called
16/06/13 22:40:56 INFO ShutdownHookManager: Deleting directory /private/var/folders/cy/7_kx0w3j6ts0w99_cl8y5lnh0000gp/T/spark-83f3d650-9edb-4baa-a50a-9cd8f1f0395b
16/06/13 22:40:56 INFO ShutdownHookManager: Deleting directory /private/var/folders/cy/7_kx0w3j6ts0w99_cl8y5lnh0000gp/T/spark-83f3d650-9edb-4baa-a50a-9cd8f1f0395b/httpd-aea4f61e-bf7e-45d2-abd2-5e42218da9c2

for local mode, the command is
spark-submit --master local[5] --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt --conf spark.driver.extraLibraryPath="${DYLD_LIBRARY_PATH}" --conf spark.executorEnv.DYLD_LIBRARY_PATH="${DYLD_LIBRARY_PATH}" --class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar -train -features accuracy,loss -label label -conf lenet_memory_solver.prototxt -connection ethernet -model file:${CAFFE_ON_SPARK}/mnist_lenet.model -output file:${CAFFE_ON_SPARK}/lenet_features_result

org.apache.spark.SparkException: File /private/var/folders/cy/7_kx0w3j6ts0w99_cl8y5lnh0000gp/T/spark-b3d79d43-ca35-4e26-abc9-29a1291e102b/userFiles-a7d52b0b-8ce2-4eb0-9498-d6506b5f69b7/mnist_train_lmdb exists and does not match contents of file:/MyInstalledLibraries/caffe/examples/mnist/mnist_train_lmdb/
at org.apache.spark.util.Utils$.copyFile(Utils.scala:489)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:595)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:394)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:393)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/06/13 23:12:23 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): org.apache.spark.SparkException: File /private/var/folders/cy/7_kx0w3j6ts0w99_cl8y5lnh0000gp/T/spark-b3d79d43-ca35-4e26-abc9-29a1291e102b/userFiles-a7d52b0b-8ce2-4eb0-9498-d6506b5f69b7/mnist_train_lmdb exists and does not match contents of file:/MyInstalledLibraries/caffe/examples/mnist/mnist_train_lmdb/
at org.apache.spark.util.Utils$.copyFile(Utils.scala:489)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:595)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:394)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:393)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

16/06/13 23:12:23 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
16/06/13 23:12:23 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
16/06/13 23:12:23 INFO TaskSchedulerImpl: Cancelling stage 2
16/06/13 23:12:23 INFO DAGScheduler: ResultStage 2 (reduce at CaffeOnSpark.scala:210) failed in 0.227 s
16/06/13 23:12:23 INFO DAGScheduler: Job 2 failed: reduce at CaffeOnSpark.scala:210, took 0.268631 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, localhost): org.apache.spark.SparkException: File /private/var/folders/cy/7_kx0w3j6ts0w99_cl8y5lnh0000gp/T/spark-b3d79d43-ca35-4e26-abc9-29a1291e102b/userFiles-a7d52b0b-8ce2-4eb0-9498-d6506b5f69b7/mnist_train_lmdb exists and does not match contents of file:/MyInstalledLibraries/caffe/examples/mnist/mnist_train_lmdb/
at org.apache.spark.util.Utils$.copyFile(Utils.scala:489)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:595)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:394)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:393)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:390)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:390)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Null Pointer Exception on Runnin CIFAR-10 example

Hello,

I was able to run the MNIST example on AWS cluster following steps in the README. I then tried to run the Cifar-10 example following the steps

Downloading the data in lmdb format and mean image to master and both slaves
Modifying the data layer to MemoryData layer. as shown in attached files
cifar10_quick_solver.prototxt.txt
cifar10_quick_train_test.prototxt.txt

I am getting the below exceptions
16/02/28 00:11:58 WARN TransportChannelHandler: Exception in connection from ip-172-31-29-90.eu-west-1.compute.internal/172.31.29.90:46325 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 16/02/28 00:11:58 ERROR TaskSchedulerImpl: Lost executor 0 on ip-172-31-29-90.eu-west-1.compute.internal: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
AND

16/02/28 00:12:00 WARN TaskSetManager: Lost task 1.1 in stage 1.0 (TID 5, ip-172-31-29-90.eu-west-1.compute.internal): java.lang.NullPointerException at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:158) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:154) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:154) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

It will be great if you could guide in solving it.

Does libcaffedistri.so dependent on libboost_system.so.1.54.0?

In a cluster envirement, after running below command:
spark-submit --master ${MASTER_URL} --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt --conf spark.cores.max=${TOTAL_CORES} --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" --class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar -train -features accuracy,loss -label label -conf lenet_memory_solver.prototxt -clusterSize ${SPARK_WORKER_INSTANCES} -devices 1 -connection ethernet -model file:${CAFFE_ON_SPARK}/mnist_lenet.model -output file:${CAFFE_ON_SPARK}/lenet_features_result

the following error message appeared in the slave node logs.

6/03/15 19:26:44 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1910.0 B, free 4.8 KB)
16/03/15 19:26:44 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.110.52.32:44175 (size: 1910.0 B, free: 457.9 MB)
16/03/15 19:26:44 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/03/15 19:26:44 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at map at CaffeOnSpark.scala:121)
16/03/15 19:26:44 INFO cluster.YarnClusterScheduler: Adding task set 0.0 with 1 tasks
16/03/15 19:26:44 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, yuntu2, partition 0,PROCESS_LOCAL, 1966 bytes)
16/03/15 19:26:44 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on yuntu2:42173 (size: 1910.0 B, free: 511.5 MB)
16/03/15 19:26:45 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, yuntu2): java.lang.UnsatisfiedLinkError: /home/atlas/work/caffe_spark/CaffeOnSpark-master/caffe-distri/distribute/lib/libcaffedistri.so: libboost_system.so.1.54.0: cannot open shared object file: No such file or directory
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary1(ClassLoader.java:1965)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1890)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1880)
at java.lang.Runtime.loadLibrary0(Runtime.java:849)
at java.lang.System.loadLibrary(System.java:1088)
at com.yahoo.ml.jcaffe.BaseObject.(BaseObject.java:10)
at com.yahoo.ml.caffe.CaffeProcessor.(CaffeProcessor.scala:55)
at com.yahoo.ml.caffe.CaffeProcessor$.instance(CaffeProcessor.scala:21)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$4.apply(CaffeOnSpark.scala:123)
at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$4.apply(CaffeOnSpark.scala:121)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

16/03/15 19:26:45 INFO scheduler.TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, yuntu2, partition 0,PROCESS_LOCAL, 1966 bytes)

I have 3 slave nodes.
one is on Ubuntu 14.04.3 LTS
the 2 nodes are Ubuntu 14.10

And i see the version of libboost are different.

Ubuntu 14.04.3 LTS -----> libboost_system.so.1.54
Ubuntu 14.10 -----> libboost_system.so.1.55

i'd like to know which version of ubuntu do you test?

Job hangs on ACCEPTED state.waiting for AM container to be allocated, launched and register with RM

16/03/11 16:09:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/11 16:09:25 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/03/11 16:09:26 INFO yarn.Client: Requesting a new application from cluster with 0 NodeManagers
16/03/11 16:09:26 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
16/03/11 16:09:26 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
16/03/11 16:09:26 INFO yarn.Client: Setting up container launch context for our AM
16/03/11 16:09:26 INFO yarn.Client: Setting up the launch environment for our AM container
16/03/11 16:09:26 INFO yarn.Client: Preparing resources for our AM container
16/03/11 16:09:26 INFO yarn.Client: Uploading resource file:/home/atlas/work/caffe_spark/CaffeOnSpark-master/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar -> hdfs://master:9000/user/atlas/.sparkStaging/application_1457683710951_0001/spark-assembly-1.6.0-hadoop2.6.0.jar
16/03/11 16:09:43 INFO yarn.Client: Uploading resource file:/home/atlas/work/caffe_spark/CaffeOnSpark-master/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar -> hdfs://master:9000/user/atlas/.sparkStaging/application_1457683710951_0001/spark-examples-1.6.0-hadoop2.6.0.jar
16/03/11 16:09:53 INFO yarn.Client: Uploading resource file:/tmp/spark-186580ef-7b45-4d23-a810-8329df0d983e/__spark_conf__5049458426184257601.zip -> hdfs://master:9000/user/atlas/.sparkStaging/application_1457683710951_0001/__spark_conf__5049458426184257601.zip
16/03/11 16:09:54 INFO spark.SecurityManager: Changing view acls to: atlas
16/03/11 16:09:54 INFO spark.SecurityManager: Changing modify acls to: atlas
16/03/11 16:09:54 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(atlas); users with modify permissions: Set(atlas)
16/03/11 16:09:54 INFO yarn.Client: Submitting application 1 to ResourceManager
16/03/11 16:09:54 INFO impl.YarnClientImpl: Submitted application application_1457683710951_0001
16/03/11 16:09:55 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:09:55 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1457683794747
final status: UNDEFINED
tracking URL: http://master:8088/proxy/application_1457683710951_0001/
user: atlas
16/03/11 16:09:56 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:09:57 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:09:58 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:09:59 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:10:00 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:10:01 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:10:02 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:10:03 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:10:04 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:10:05 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:10:06 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:10:07 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:10:08 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)
16/03/11 16:10:09 INFO yarn.Client: Application report for application_1457683710951_0001 (state: ACCEPTED)

ERROR: Read partial messageheader

I run CaffeOnSpark,but programe sometime can't work out,the ERROR is

16/03/21 17:28:22 INFO storage.BlockManager: Found block rdd_0_1 locally
E0321 17:28:24.265156 2261 socket.cpp:61] ERROR: Read partial messageheader [8 of 12]
16/03/21 17:29:22 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
16/03/21 17:29:22 INFO storage.DiskBlockManager: Shutdown hook called
16/03/21 17:29:22 INFO util.ShutdownHookManager: Shutdown hook called

error while building

while building i am getting following error :

src/main/cpp/tools/caffe_mini_cluster.cpp: In function ‘int main(int, char**)’:
[exec] src/main/cpp/tools/caffe_mini_cluster.cpp:285:14: error: expected type-specifier before ‘bp’
[exec] } catch (bp::error_already_set) {
[exec] ^
[exec] src/main/cpp/tools/caffe_mini_cluster.cpp:285:16: error: expected ‘)’ before ‘::’ token
[exec] } catch (bp::error_already_set) {
[exec]
[exec] ^
[exec] src/main/cpp/tools/caffe_mini_cluster.cpp:285:16: error: expected ‘{’ before ‘::’ token
[exec] src/main/cpp/tools/caffe_mini_cluster.cpp:285:16: error: ‘::error_already_set’ has not been declared
[exec] src/main/cpp/tools/caffe_mini_cluster.cpp:285:35: error: expected ‘;’ before ‘)’ token
[exec] } catch (bp::error_already_set) {
[exec] ^
[exec] src/main/cpp/tools/caffe_mini_cluster.cpp: In function ‘caffe::SolverAction::Enum GetRequestedAction(const string&)’:
[exec] src/main/cpp/tools/caffe_mini_cluster.cpp:168:1: warning: control reaches end of non-void function [-Wreturn-type]
[exec] }
[exec] ^

My Makefile.config file looks like :

Refer to http://caffe.berkeleyvision.org/installation.html

Contributions simplifying and improving our build system are welcome!

cuDNN acceleration switch (uncomment to build with cuDNN).

USE_CUDNN := 1

CPU-only switch (uncomment to build without GPU support).

CPU_ONLY := 1

Parallelization over InfiniBand or RoCE

INFINIBAND := 1

uncomment to disable IO dependencies and corresponding data layers

USE_OPENCV := 1

USE_LEVELDB := 0

USE_LMDB := 0

uncomment to allow MDB_NOLOCK when reading LMDB files (only if necessary)

You should not set this flag if you will be reading LMDBs with any

possibility of simultaneous read and write

ALLOW_LMDB_NOLOCK := 1

Uncomment if you're using OpenCV 3

OPENCV_VERSION := 3

To customize your choice of compiler, uncomment and set the following.

N.B. the default for Linux is g++ and the default for OSX is clang++

CUSTOM_CXX := g++

CUDA directory contains bin/ and lib/ directories that we need.

CUDA_DIR := /usr/local/cuda

On Ubuntu 14.04, if cuda tools are installed via

"sudo apt-get install nvidia-cuda-toolkit" then use this instead:

CUDA_DIR := /usr

CUDA architecture setting: going with all of them.

For CUDA < 6.0, comment the *_50 lines for compatibility.

CUDA_ARCH := -gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_30,code=sm_30
-gencode arch=compute_35,code=sm_35
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_50,code=compute_50

BLAS choice:

atlas for ATLAS (default)

mkl for MKL

open for OpenBlas

BLAS := atlas

Custom (MKL/ATLAS/OpenBLAS) include and lib directories.

Leave commented to accept the defaults for your choice of BLAS

(which should work)!

BLAS_INCLUDE := /path/to/your/blas

BLAS_LIB := /path/to/your/blas

Homebrew puts openblas in a directory that is not on the standard search path

BLAS_INCLUDE := $(shell brew --prefix openblas)/include

BLAS_LIB := $(shell brew --prefix openblas)/lib

This is required only if you will compile the matlab interface.

MATLAB directory should contain the mex binary in /bin.

MATLAB_DIR := /usr/local

MATLAB_DIR := /Applications/MATLAB_R2012b.app

NOTE: this is required only if you will compile the python interface.

We need to be able to find Python.h and numpy/arrayobject.h.

PYTHON_INCLUDE := /usr/include/python2.7 \

    /usr/lib/python2.7/dist-packages/numpy/core/include

Anaconda Python distribution is quite popular. Include path:

Verify anaconda location, sometimes it's in root.

ANACONDA_HOME := /opt/anaconda2
PYTHON_INCLUDE := $(ANACONDA_HOME)/include
$(ANACONDA_HOME)/include/python2.7
$(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include \

Uncomment to use Python 3 (default is Python 2)

PYTHON_LIBRARIES := boost_python3 python3.5m

PYTHON_INCLUDE := /usr/include/python3.5m \

/usr/lib/python3.5/dist-packages/numpy/core/include

We need to be able to find libpythonX.X.so or .dylib.

PYTHON_LIB := /usr/lib

PYTHON_LIB := $(ANACONDA_HOME)/lib

Homebrew installs numpy in a non standard path (keg only)

PYTHON_INCLUDE += $(dir $(shell python -c 'import numpy.core; print(numpy.core.file)'))/include

PYTHON_LIB += $(shell brew --prefix numpy)/lib

Uncomment to support layers written in Python (will link against Python libs)

WITH_PYTHON_LAYER := 1

Whatever else you find you need goes here.

INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib

If Homebrew is installed at a non standard location (for example your home directory) and you use it for general dependencies

INCLUDE_DIRS += $(shell brew --prefix)/include

LIBRARY_DIRS += $(shell brew --prefix)/lib

Uncomment to use `pkg-config` to specify OpenCV library paths.

(Usually not necessary -- OpenCV libraries are normally installed in one of the above $LIBRARY_DIRS.)

USE_PKG_CONFIG := 1

BUILD_DIR := build
DISTRIBUTE_DIR := distribute

Uncomment for debugging. Does not work on OSX due to BVLC/caffe#171

DEBUG := 1

The ID of the GPU that 'make runtest' will use to run unit tests.

TEST_GPUID := 0

enable pretty build (comment to see full commands)

Q ?= @
INCLUDE_DIRS += /usr/lib/jvm/java-1.7.0-openjdk-amd64/include

Please help

java.lang.IllegalStateException: RpcEnv already stopped.

I have a 3 node YARN cluster each with 8 core CPU , 8gb RAM available to nodemanagers and one GPU on each machine.

When i submit the job (mnist example given in the wiki page) on the cluster with following command i am getting error ::

spark-submit --master yarn --deploy-mode cluster --num-executors ${SPARK_WORKER_INSTANCES} --files ${CAFFE_ON_SPARK}/data/lenet_memory_solver.prototxt,${CAFFE_ON_SPARK}/data/lenet_memory_train_test.prototxt --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" --class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar -train -features accuracy,loss -label label -conf lenet_memory_solver.prototxt -devices 1 -connection ethernet -model hdfs:///mnist.model -output hdfs:///mnist_features_result

The errror is ::

Exception in thread "main" java.lang.RuntimeException: java.lang.IllegalStateException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159)
at org.apache.spark.rpc.netty.Dispatcher.postRemoteMessage(Dispatcher.scala:118)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:571)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:186)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:106)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

However when i run it on one machine it runs successfully. Can you please help

got error java.lang.NullPointerException

can someone help me out with this,
i'v created my own caffemodel, with 20,000 images:

my Train lmdb:
root@ip-172-31-15-194:~/CaffeOnSpark/data/faces# ls -l --block-size=M TrainLmdb/total 2642M
-rwxrwxrwx 1 root root 2642M Apr 4 23:51 data.mdb
-rwxrwxrwx 1 root root 1M Apr 6 11:20 lock.mdb

my Val lmdb:
root@ip-172-31-15-194:~/CaffeOnSpark/data/faces# ls -l --block-size=M ValLmdb/
total 1132M
-rwxrwxrwx 1 root root 1132M Apr 4 23:53 data.mdb
-rwxrwxrwx 1 root root 1M Apr 4 23:53 lock.mdb

2 slaves=> g2.2xlarge
1 master=>m4.xlarge

I0406 11:09:55.750506 3134 layer_factory.hpp:77] Creating layer data
I0406 11:09:55.750524 3134 net.cpp:106] Creating Layer data
I0406 11:09:55.750530 3134 net.cpp:411] data -> data
I0406 11:09:55.750538 3134 net.cpp:411] data -> label
I0406 11:09:55.751267 3134 net.cpp:150] Setting up data
I0406 11:09:55.751286 3134 net.cpp:157] Top shape: 100 3 32 32 (307200)
I0406 11:09:55.751291 3134 net.cpp:157] Top shape: 100 (100)
I0406 11:09:55.751296 3134 net.cpp:165] Memory required for data: 1229200
I0406 11:09:55.751299 3134 layer_factory.hpp:77] Creating layer label_data_1_split
I0406 11:09:55.751312 3134 net.cpp:106] Creating Layer label_data_1_split
I0406 11:09:55.751322 3134 net.cpp:454] label_data_1_split <- label
I0406 11:09:55.751328 3134 net.cpp:411] label_data_1_split -> label_data_1_split_0
I0406 11:09:55.751337 3134 net.cpp:411] label_data_1_split -> label_data_1_split_1
I0406 11:09:55.751381 3134 net.cpp:150] Setting up label_data_1_split
I0406 11:09:55.751396 3134 net.cpp:157] Top shape: 100 (100)
I0406 11:09:55.751401 3134 net.cpp:157] Top shape: 100 (100)
I0406 11:09:55.751405 3134 net.cpp:165] Memory required for data: 1230000
I0406 11:09:55.751410 3134 layer_factory.hpp:77] Creating layer conv1
I0406 11:09:55.751418 3134 net.cpp:106] Creating Layer conv1
I0406 11:09:55.751422 3134 net.cpp:454] conv1 <- data
I0406 11:09:55.751431 3134 net.cpp:411] conv1 -> conv1
I0406 11:09:55.752434 3134 net.cpp:150] Setting up conv1
I0406 11:09:55.752456 3134 net.cpp:157] Top shape: 100 32 32 32 (3276800)
I0406 11:09:55.752461 3134 net.cpp:165] Memory required for data: 14337200
I0406 11:09:55.752471 3134 layer_factory.hpp:77] Creating layer pool1
I0406 11:09:55.752480 3134 net.cpp:106] Creating Layer pool1
I0406 11:09:55.752485 3134 net.cpp:454] pool1 <- conv1
I0406 11:09:55.752490 3134 net.cpp:411] pool1 -> pool1
I0406 11:09:55.752538 3134 net.cpp:150] Setting up pool1
I0406 11:09:55.752553 3134 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0406 11:09:55.752557 3134 net.cpp:165] Memory required for data: 17614000
I0406 11:09:55.752562 3134 layer_factory.hpp:77] Creating layer relu1
I0406 11:09:55.752568 3134 net.cpp:106] Creating Layer relu1
I0406 11:09:55.752573 3134 net.cpp:454] relu1 <- pool1
I0406 11:09:55.752583 3134 net.cpp:397] relu1 -> pool1 (in-place)
I0406 11:09:55.752825 3134 net.cpp:150] Setting up relu1
I0406 11:09:55.752842 3134 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0406 11:09:55.752846 3134 net.cpp:165] Memory required for data: 20890800
I0406 11:09:55.752851 3134 layer_factory.hpp:77] Creating layer conv2
I0406 11:09:55.752861 3134 net.cpp:106] Creating Layer conv2
I0406 11:09:55.752871 3134 net.cpp:454] conv2 <- pool1
I0406 11:09:55.752882 3134 net.cpp:411] conv2 -> conv2
I0406 11:09:55.754490 3134 net.cpp:150] Setting up conv2
I0406 11:09:55.754510 3134 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0406 11:09:55.754514 3134 net.cpp:165] Memory required for data: 24167600
I0406 11:09:55.754524 3134 layer_factory.hpp:77] Creating layer relu2
I0406 11:09:55.754534 3134 net.cpp:106] Creating Layer relu2
I0406 11:09:55.754539 3134 net.cpp:454] relu2 <- conv2
I0406 11:09:55.754547 3134 net.cpp:397] relu2 -> conv2 (in-place)
I0406 11:09:55.754779 3134 net.cpp:150] Setting up relu2
I0406 11:09:55.754797 3134 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0406 11:09:55.754802 3134 net.cpp:165] Memory required for data: 27444400
I0406 11:09:55.754806 3134 layer_factory.hpp:77] Creating layer pool2
I0406 11:09:55.754815 3134 net.cpp:106] Creating Layer pool2
I0406 11:09:55.754828 3134 net.cpp:454] pool2 <- conv2
I0406 11:09:55.754837 3134 net.cpp:411] pool2 -> pool2
I0406 11:09:55.755035 3134 net.cpp:150] Setting up pool2
I0406 11:09:55.755053 3134 net.cpp:157] Top shape: 100 32 8 8 (204800)
I0406 11:09:55.755056 3134 net.cpp:165] Memory required for data: 28263600
I0406 11:09:55.755060 3134 layer_factory.hpp:77] Creating layer conv3
I0406 11:09:55.755074 3134 net.cpp:106] Creating Layer conv3
I0406 11:09:55.755084 3134 net.cpp:454] conv3 <- pool2
I0406 11:09:55.755095 3134 net.cpp:411] conv3 -> conv3
I0406 11:09:55.757617 3134 net.cpp:150] Setting up conv3
I0406 11:09:55.757640 3134 net.cpp:157] Top shape: 100 64 8 8 (409600)
I0406 11:09:55.757644 3134 net.cpp:165] Memory required for data: 29902000
I0406 11:09:55.757654 3134 layer_factory.hpp:77] Creating layer relu3
I0406 11:09:55.757663 3134 net.cpp:106] Creating Layer relu3
I0406 11:09:55.757666 3134 net.cpp:454] relu3 <- conv3
I0406 11:09:55.757671 3134 net.cpp:397] relu3 -> conv3 (in-place)
I0406 11:09:55.757904 3134 net.cpp:150] Setting up relu3
I0406 11:09:55.757921 3134 net.cpp:157] Top shape: 100 64 8 8 (409600)
I0406 11:09:55.757925 3134 net.cpp:165] Memory required for data: 31540400
I0406 11:09:55.757930 3134 layer_factory.hpp:77] Creating layer pool3
I0406 11:09:55.757936 3134 net.cpp:106] Creating Layer pool3
I0406 11:09:55.757941 3134 net.cpp:454] pool3 <- conv3
I0406 11:09:55.757951 3134 net.cpp:411] pool3 -> pool3
I0406 11:09:55.758177 3134 net.cpp:150] Setting up pool3
I0406 11:09:55.758194 3134 net.cpp:157] Top shape: 100 64 4 4 (102400)
I0406 11:09:55.758198 3134 net.cpp:165] Memory required for data: 31950000
I0406 11:09:55.758203 3134 layer_factory.hpp:77] Creating layer ip1
I0406 11:09:55.758209 3134 net.cpp:106] Creating Layer ip1
I0406 11:09:55.758213 3134 net.cpp:454] ip1 <- pool3
I0406 11:09:55.758221 3134 net.cpp:411] ip1 -> ip1
I0406 11:09:55.760910 3134 net.cpp:150] Setting up ip1
I0406 11:09:55.760928 3134 net.cpp:157] Top shape: 100 64 (6400)
I0406 11:09:55.760933 3134 net.cpp:165] Memory required for data: 31975600
I0406 11:09:55.760941 3134 layer_factory.hpp:77] Creating layer ip2
I0406 11:09:55.760951 3134 net.cpp:106] Creating Layer ip2
I0406 11:09:55.760957 3134 net.cpp:454] ip2 <- ip1
I0406 11:09:55.760967 3134 net.cpp:411] ip2 -> ip2
I0406 11:09:55.761073 3134 net.cpp:150] Setting up ip2
I0406 11:09:55.761090 3134 net.cpp:157] Top shape: 100 2 (200)
I0406 11:09:55.761093 3134 net.cpp:165] Memory required for data: 31976400
I0406 11:09:55.761102 3134 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
I0406 11:09:55.761111 3134 net.cpp:106] Creating Layer ip2_ip2_0_split
I0406 11:09:55.761113 3134 net.cpp:454] ip2_ip2_0_split <- ip2
I0406 11:09:55.761119 3134 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_0
I0406 11:09:55.761126 3134 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_1
I0406 11:09:55.761178 3134 net.cpp:150] Setting up ip2_ip2_0_split
I0406 11:09:55.761191 3134 net.cpp:157] Top shape: 100 2 (200)
I0406 11:09:55.761196 3134 net.cpp:157] Top shape: 100 2 (200)
I0406 11:09:55.761199 3134 net.cpp:165] Memory required for data: 31978000
I0406 11:09:55.761204 3134 layer_factory.hpp:77] Creating layer accuracy
I0406 11:09:55.761212 3134 net.cpp:106] Creating Layer accuracy
I0406 11:09:55.761221 3134 net.cpp:454] accuracy <- ip2_ip2_0_split_0
I0406 11:09:55.761231 3134 net.cpp:454] accuracy <- label_data_1_split_0
I0406 11:09:55.761240 3134 net.cpp:411] accuracy -> accuracy
I0406 11:09:55.761253 3134 net.cpp:150] Setting up accuracy
I0406 11:09:55.761260 3134 net.cpp:157] Top shape: (1)
I0406 11:09:55.761262 3134 net.cpp:165] Memory required for data: 31978004
I0406 11:09:55.761266 3134 layer_factory.hpp:77] Creating layer loss
I0406 11:09:55.761273 3134 net.cpp:106] Creating Layer loss
I0406 11:09:55.761277 3134 net.cpp:454] loss <- ip2_ip2_0_split_1
I0406 11:09:55.761281 3134 net.cpp:454] loss <- label_data_1_split_1
I0406 11:09:55.761287 3134 net.cpp:411] loss -> loss
I0406 11:09:55.761294 3134 layer_factory.hpp:77] Creating layer loss
I0406 11:09:55.761638 3134 net.cpp:150] Setting up loss
I0406 11:09:55.761658 3134 net.cpp:157] Top shape: (1)
I0406 11:09:55.761662 3134 net.cpp:160] with loss weight 1
I0406 11:09:55.761670 3134 net.cpp:165] Memory required for data: 31978008
I0406 11:09:55.761674 3134 net.cpp:226] loss needs backward computation.
I0406 11:09:55.761678 3134 net.cpp:228] accuracy does not need backward computation.
I0406 11:09:55.761682 3134 net.cpp:226] ip2_ip2_0_split needs backward computation.
I0406 11:09:55.761685 3134 net.cpp:226] ip2 needs backward computation.
I0406 11:09:55.761689 3134 net.cpp:226] ip1 needs backward computation.
I0406 11:09:55.761693 3134 net.cpp:226] pool3 needs backward computation.
I0406 11:09:55.761695 3134 net.cpp:226] relu3 needs backward computation.
I0406 11:09:55.761698 3134 net.cpp:226] conv3 needs backward computation.
I0406 11:09:55.761703 3134 net.cpp:226] pool2 needs backward computation.
I0406 11:09:55.761705 3134 net.cpp:226] relu2 needs backward computation.
I0406 11:09:55.761708 3134 net.cpp:226] conv2 needs backward computation.
I0406 11:09:55.761711 3134 net.cpp:226] relu1 needs backward computation.
I0406 11:09:55.761714 3134 net.cpp:226] pool1 needs backward computation.
I0406 11:09:55.761718 3134 net.cpp:226] conv1 needs backward computation.
I0406 11:09:55.761721 3134 net.cpp:228] label_data_1_split does not need backward computation.
I0406 11:09:55.761725 3134 net.cpp:228] data does not need backward computation.
I0406 11:09:55.761729 3134 net.cpp:270] This network produces output accuracy
I0406 11:09:55.761732 3134 net.cpp:270] This network produces output loss
I0406 11:09:55.761747 3134 net.cpp:283] Network initialization done.
I0406 11:09:55.761801 3134 solver.cpp:60] Solver scaffolding done.
I0406 11:09:55.762250 3134 socket.cpp:219] Waiting for valid port [0]
I0406 11:09:55.762285 3146 socket.cpp:158] Assigned socket server port [49578]
I0406 11:09:55.763015 3146 socket.cpp:171] Socket Server ready [0.0.0.0]
I0406 11:09:55.772425 3134 socket.cpp:219] Waiting for valid port [49578]
I0406 11:09:55.772439 3134 socket.cpp:227] Valid port found [49578]
I0406 11:09:55.772455 3134 CaffeNet.cpp:186] Socket adapter: ip-172-31-39-229:49578
I0406 11:09:55.772732 3134 CaffeNet.cpp:325] 0-th Socket addr: ip-172-31-39-229:49578
I0406 11:09:55.772750 3134 CaffeNet.cpp:325] 1-th Socket addr:
I0406 11:09:55.772763 3134 JniCaffeNet.cpp:110] 0-th local addr: ip-172-31-39-229:49578
I0406 11:09:55.772774 3134 JniCaffeNet.cpp:110] 1-th local addr:
16/04/06 11:09:55 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 976 bytes result sent to driver
16/04/06 11:09:56 INFO CoarseGrainedExecutorBackend: Got assigned task 2
16/04/06 11:09:56 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
16/04/06 11:09:56 INFO TorrentBroadcast: Started reading broadcast variable 2
16/04/06 11:09:56 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1580.0 B, free 6.7 KB)
16/04/06 11:09:56 INFO TorrentBroadcast: Reading broadcast variable 2 took 10 ms
16/04/06 11:09:56 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 9.2 KB)
16/04/06 11:09:56 INFO TorrentBroadcast: Started reading broadcast variable 1
16/04/06 11:09:56 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 97.0 B, free 9.3 KB)
16/04/06 11:09:56 INFO TorrentBroadcast: Reading broadcast variable 1 took 10 ms
16/04/06 11:09:56 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 392.0 B, free 9.7 KB)
I0406 11:09:56.094302 3134 common.cpp:61] 1-th string is NULL
I0406 11:09:56.094362 3134 socket.cpp:250] Trying to connect with ...[ip-172-31-39-230:50843]
I0406 11:09:56.096086 3134 socket.cpp:309] Connected to server [ip-172-31-39-230.us-west-2.compute.internal:50843] with client_fd [41]
I0406 11:09:56.097497 3146 socket.cpp:184] Accepted the connection from client [ip-172-31-39-230.us-west-2.compute.internal]
I0406 11:10:06.097164 3134 parallel.cpp:392] GPUs pairs
I0406 11:10:06.100571 3148 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0406 11:10:06.103435 3149 data_transformer.cpp:25] Loading mean file from: mean.binaryproto
16/04/06 11:10:06 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 936 bytes result sent to driver
16/04/06 11:10:06 INFO CoarseGrainedExecutorBackend: Got assigned task 5
16/04/06 11:10:06 INFO Executor: Running task 1.0 in stage 2.0 (TID 5)
16/04/06 11:10:06 INFO TorrentBroadcast: Started reading broadcast variable 3
16/04/06 11:10:06 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.1 KB, free 11.8 KB)
16/04/06 11:10:06 INFO TorrentBroadcast: Reading broadcast variable 3 took 11 ms
16/04/06 11:10:06 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 3.3 KB, free 15.2 KB)
16/04/06 11:10:06 INFO CacheManager: Partition rdd_6_1 not found, computing it
16/04/06 11:10:06 INFO CacheManager: Partition rdd_1_1 not found, computing it
16/04/06 11:10:06 INFO LmdbRDD: Processing partition 1
16/04/06 11:10:06 ERROR Executor: Exception in task 1.0 in stage 2.0 (TID 5)
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at com.yahoo.ml.caffe.LmdbRDD.localLMDBFile(LmdbRDD.scala:180)
at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:197)
at com.yahoo.ml.caffe.LmdbRDD$$anon$1.(LmdbRDD.scala:98)
at com.yahoo.ml.caffe.LmdbRDD.compute(LmdbRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/04/06 11:10:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7
16/04/06 11:10:06 INFO Executor: Running task 0.1 in stage 2.0 (TID 7)
16/04/06 11:10:06 INFO CacheManager: Partition rdd_6_0 not found, computing it
16/04/06 11:10:06 INFO CacheManager: Partition rdd_1_0 not found, computing it
16/04/06 11:10:06 INFO LmdbRDD: Processing partition 0
16/04/06 11:10:06 ERROR Executor: Exception in task 0.1 in stage 2.0 (TID 7)
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at com.yahoo.ml.caffe.LmdbRDD.localLMDBFile(LmdbRDD.scala:180)
at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:197)
at com.yahoo.ml.caffe.LmdbRDD$$anon$1.(LmdbRDD.scala:98)
at com.yahoo.ml.caffe.LmdbRDD.compute(LmdbRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/04/06 11:10:06 INFO CoarseGrainedExecutorBackend: Got assigned task 8
16/04/06 11:10:06 INFO Executor: Running task 1.2 in stage 2.0 (TID 8)
16/04/06 11:10:06 INFO CacheManager: Partition rdd_6_1 not found, computing it
16/04/06 11:10:06 INFO CacheManager: Partition rdd_1_1 not found, computing it
16/04/06 11:10:06 INFO LmdbRDD: Processing partition 1
16/04/06 11:10:06 ERROR Executor: Exception in task 1.2 in stage 2.0 (TID 8)
java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at com.yahoo.ml.caffe.LmdbRDD.localLMDBFile(LmdbRDD.scala:180)
at com.yahoo.ml.caffe.LmdbRDD.com$yahoo$ml$caffe$LmdbRDD$$openDB(LmdbRDD.scala:197)
at com.yahoo.ml.caffe.LmdbRDD$$anon$1.(LmdbRDD.scala:98)
at com.yahoo.ml.caffe.LmdbRDD.compute(LmdbRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/04/06 11:10:06 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown

socket.cpp:61] ERROR: Read partial messageheader [8 of 12]

When i run the lenet example; the job hang if this error appear at the slave node;this error will appear if set the --num-executors >=2 ; this my start shell :
hadoop fs -rm -f -r hdfs:///yudaoming/mnist
hadoop fs -rm -r -f hdfs:///yudaoming/mnist/feature_result

export LD_LIBRARY_PATH=/home/hadoop/projects/CaffeOnSpark/caffe-public/distribute/lib:/home/hadoop/projects/CaffeOnSpark/caffe-distri/distribute/lib
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-7.5/lib64:/opt/OpenBLAS/lib
export SPARK_WORKER_INSTANCES=2
export DEVICES=1

spark-submit --master yarn --deploy-mode cluster
--num-executors 2
--executor-cores 1
--files /home/hadoop/projects/CaffeOnSpark/data/lenet_memory_solver.prototxt,/home/hadoop/projects/CaffeOnSpark/data/lenet_memory_train_test.prototxt
--driver-memory 10g
--executor-memory 30g
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
/home/hadoop/projects/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf lenet_memory_solver.prototxt
-devices 4
-connection ethernet
-model hdfs:///yudaoming/mnist/mnist.model
-output hdfs:///yudaoming/mnist/feature_result

My cluster has three servers: one master node ,two slave nodes ;each server have 8 nvidia K80s , 128G memory and 16 intel E2630 cores .

org.apache.maven.plugins:maven-antrun-plugin:1.7

root@ubuntu:~/GitProgram/Spark-Program/CaffeOnSpark# make build
cd caffe-public; make proto; make -j4 -e distribute; cd ..
make[1]: Entering directory /root/GitProgram/Spark-Program/CaffeOnSpark/caffe-public' PROTOC src/caffe/proto/caffe.proto make[1]: protoc: Command not found make[1]: *** [.build_release/src/caffe/proto/caffe.pb.cc] Error 127 make[1]: Leaving directory/root/GitProgram/Spark-Program/CaffeOnSpark/caffe-public'
make[1]: Entering directory /root/GitProgram/Spark-Program/CaffeOnSpark/caffe-public' PROTOC src/caffe/proto/caffe.proto make[1]: protoc: Command not found make[1]: *** [.build_release/src/caffe/proto/caffe.pb.h] Error 127 make[1]: *** Waiting for unfinished jobs.... make[1]: Leaving directory/root/GitProgram/Spark-Program/CaffeOnSpark/caffe-public'
export LD_LIBRARY_PATH="/usr/local/cuda/lib64::/root/GitProgram/Spark-Program/CaffeOnSpark/caffe-public/distribute/lib:/root/GitProgram/Spark-Program/CaffeOnSpark/caffe-distri/distribute/lib:/usr/lib64:/lib64 "; mvn -B package
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for com.yahoo.ml:caffe-grid:jar:0.1-SNAPSHOT
[WARNING] The expression ${version} is deprecated. Please use ${project.version} instead.
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] caffe
[INFO] caffe-distri
[INFO] caffe-grid
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building caffe 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building caffe-distri 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-antrun-plugin:1.7:run (proto) @ caffe-distri ---
[INFO] Executing tasks

protoc:
[exec] make[1]: Entering directory /root/GitProgram/Spark-Program/CaffeOnSpark/caffe-distri' [exec] make[1]: Leaving directory/root/GitProgmake[1]: *** No rule to make target ../caffe-public/distribute/proto/caffe.proto', needed bysrc/main/java/caffe/Caffe.java'. ram/Spark-Program/CaffeOnSpark/caffe-distri'
[exec] Stop.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] caffe .............................................. SUCCESS [ 0.003 s]
[INFO] caffe-distri ....................................... FAILURE [ 1.268 s]
[INFO] caffe-grid ......................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.591 s
[INFO] Finished at: 2016-05-16T18:20:36+08:00
[INFO] Final Memory: 9M/176M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (proto) on project caffe-distri: An Ant BuildException has occured: exec returned: 2
[ERROR] around Ant part ...... @ 5:109 in /root/GitProgram/Spark-Program/CaffeOnSpark/caffe-distri/target/antrun/build-protoc.xml
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :caffe-distri
make: *** [build] Error 1