amplab / sparknet Goto Github PK

Distributed Neural Networks for Spark

License: MIT License

Shell 1.82% Python 40.63% Scala 52.63% Java 4.91%

sparknet's Introduction

SparkNet

Distributed Neural Networks for Spark. Details are available in the paper. Ask questions on the sparknet-users mailing list!

Quick Start

Start a Spark cluster using our AMI

Create an AWS secret key and access key. Instructions here.
Run export AWS_SECRET_ACCESS_KEY= and export AWS_ACCESS_KEY_ID= with the relevant values.
Clone our repository locally.

Start a 5-worker Spark cluster on EC2 by running

 SparkNet/ec2/spark-ec2 --key-pair=key \
                        --identity-file=key.pem \
                        --region=eu-west-1 \
                        --zone=eu-west-1c \
                        --instance-type=g2.8xlarge \
                        --ami=ami-d0833da3 \
                        --copy-aws-credentials \
                        --spark-version=1.5.0 \
                        --spot-price=1.5 \
                        --no-ganglia \
                        --user-data SparkNet/ec2/cloud-config.txt \
                        --slaves=5 \
                        launch sparknet

You will probably have to change several fields in this command. For example, the flags --key-pair and --identity-file specify the key pair you will use to connect to the cluster. The flag --slaves specifies the number of Spark workers.

Train Cifar using SparkNet

SSH to the Spark master as root.
Run bash /root/SparkNet/data/cifar10/get_cifar10.sh to get the Cifar data

Train Cifar on 5 workers using

 /root/spark/bin/spark-submit --class apps.CifarApp /root/SparkNet/target/scala-2.10/sparknet-assembly-0.1-SNAPSHOT.jar 5

That's all! Information is logged on the master in /root/SparkNet/training_log*.txt.

Train ImageNet using SparkNet

Obtain the ImageNet data by following the instructions here with
```
wget http://.../ILSVRC2012_img_train.tar
wget http://.../ILSVRC2012_img_val.tar
```
This involves creating an account and submitting a request.
On the Spark master, create ~/.aws/credentials with the following content:
```
[default]
aws_access_key_id=
aws_secret_access_key=
```
and fill in the two fields.
Copy this to the workers with ~/spark-ec2/copy-dir ~/.aws (copy this command exactly because it is somewhat sensitive to the trailing backslashes and that kind of thing).
Create an Amazon S3 bucket with name S3_BUCKET.

Upload the ImageNet data in the appropriate format to S3 with the command

python $SPARKNET_HOME/scripts/put_imagenet_on_s3.py $S3_BUCKET \
    --train_tar_file=/path/to/ILSVRC2012_img_train.tar \
    --val_tar_file=/path/to/ILSVRC2012_img_val.tar \
    --new_width=256 \
    --new_height=256

This command resizes the images to 256x256, shuffles the training data, and tars the validation files into chunks.

Train ImageNet on 5 workers using

/root/spark/bin/spark-submit --class apps.ImageNetApp /root/SparkNet/target/scala-2.10/sparknet-assembly-0.1-SNAPSHOT.jar 5 $S3_BUCKET

Installing SparkNet on an existing Spark cluster

The specific instructions might depend on your cluster configurations, if you run into problems, make sure to share your experience on the mailing list.

If you are going to use GPUs, make sure that CUDA-7.0 is installed on all the nodes.
Depending on your configuration, you might have to add the following to your ~/.bashrc, and run source ~/.bashrc.
```
export LD_LIBRARY_PATH=/usr/local/cuda-7.0/targets/x86_64-linux/lib/
export _JAVA_OPTIONS=-Xmx8g
export SPARKNET_HOME=/root/SparkNet/
```
Keep in mind to substitute in the right directories (the first one should contain the file libcudart.so.7.0).
Clone the SparkNet repository git clone https://github.com/amplab/SparkNet.git in your home directory.
Copy the SparkNet directory on all the nodes using
```
~/spark-ec2/copy-dir ~/SparkNet
```
Build SparkNet with
```
cd ~/SparkNet
git pull
sbt assembly
```
Now you can for example run the CIFAR App as shown above.

Building your own AMI

Start an EC2 instance with Ubuntu 14.04 and a GPU instance type (e.g., g2.8xlarge). Suppose it has IP address xxx.xx.xx.xxx.

Connect to the node as ubuntu:

ssh -i ~/.ssh/key.pem [email protected]

Install an editor

sudo apt-get update
sudo apt-get install emacs

Open the file
```
sudo emacs /root/.ssh/authorized_keys
```
and delete everything before ssh-rsa ... so that you can connect to the node as root.
Close the connection with exit.

Connect to the node as root:

ssh -i ~/.ssh/key.pem [email protected]

Install CUDA-7.0.

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb
dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb
apt-get update
apt-get upgrade -y
apt-get install -y linux-image-extra-`uname -r` linux-headers-`uname -r` linux-image-`uname -r`
apt-get install cuda-7-0 -y

Install sbt. Instructions here.
apt-get update
apt-get install awscli s3cmd
Install Java apt-get install openjdk-7-jdk.
Clone the SparkNet repository git clone https://github.com/amplab/SparkNet.git in your home directory.
Add the following to your ~/.bashrc, and run source ~/.bashrc.
```
export LD_LIBRARY_PATH=/usr/local/cuda-7.0/targets/x86_64-linux/lib/
export _JAVA_OPTIONS=-Xmx8g
export SPARKNET_HOME=/root/SparkNet/
```
Some of these paths may need to be adapted, but the LD_LIBRARY_PATH directory should contain libcudart.so.7.0 (this file can be found with locate libcudart.so.7.0 after running updatedb).
Build SparkNet with
```
cd ~/SparkNet
git pull
sbt assembly
```
Create the file ~/.bash_profile and add the following:
```
if [ "$BASH" ]; then
  if [ -f ~/.bashrc ]; then
    . ~/.bashrc
  fi
fi
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
```
Spark expects JAVA_HOME to be set in your ~/.bash_profile and the launch script SparkNet/ec2/spark-ec2 will give an error if it isn't there.
Clear your bash history cat /dev/null > ~/.bash_history && history -c && exit.
Now you can create an image of your instance, and you're all set! This is the procedure that we used to create our AMI.

JavaCPP Binaries

We have built the JavaCPP binaries for a couple platforms. They are stored at the following locations:

Ubuntu with GPUs: http://www.eecs.berkeley.edu/~rkn/snapshot-2016-03-05/
Ubuntu with CPUs: http://www.eecs.berkeley.edu/~rkn/snapshot-2016-03-16-CPU/
CentOS 6 with CPUs: http://www.eecs.berkeley.edu/~rkn/snapshot-2016-03-23-CENTOS6-CPU/

sparknet's People

Contributors

Stargazers

Watchers

Forkers

wgapl jkbradley arrmac ml-ai-nlp-ir codingcat linearregression zuxfoucault imaxmin mkolod takechiyoo likeucode ml-lab copyfun snazz2001 keaideii chrisvnicholson tachyon5 mediative albertofernandezvillan yiheng codeaudit jrabary hengqujushi giserh lvchigo wangxiong2015 youngjt txd866 intellifora beartell nsteenv kcompher caomw trinayan georgedittmar udemirezen junwucs cfregly sagarjoglekar hainm txd888 wavelets aurelienwaite peratham harishraj rahulbhalerao001 jayinai javadba jassonvia muhaozhang xsongx duankai chunde yanjiegao li-ch daishichao pangchingchow liwzhi saiadityag obinsc ralic cequencer hanhanwu xiangqiaolxq shyamalschandra nagyistge liuzhidan jinxustartup syh6585 coderxingzhang ericxt realentertain datastark yangjunpro vincibean bpupadhyaya wxdublin liu4lin fatmas1982 irwenqiang jkarnows gongenhao njnuwjq hongyunnchen is00hcw feynmanliang j-abi sunmeng007 tempbottle jay-zeng guoyang2011 cloudstdio milestonesvn hma02 gongfupanada jeffzhengye billcamel cuissai02 cgk619 desperado1992

sparknet's Issues

is current ubuntu library GPU cudnn enabled?

I have tried to run with GoogleNet(more computing). The forwardbackward took really long time. Is the .so library provided is GPU cudnn enabled? There is a new rc3-1.2 of javacpp caffe preset release. What is the procedure to adopt rc3-1.2 release? Included is the logs::

workerId = 0
getWeights took 0.05 s
transformInto took 1.942 s
ForwardBackward took 99.302 s

transformInto took 2.67 s
ForwardBackward took 109.891 s

transformInto took 1.676 s
ForwardBackward took 122.732 s

transformInto took 1.571 s
ForwardBackward took 131.627 s

transformInto took 1.463 s
ForwardBackward took 138.765 s

transformInto took 1.564 s
ForwardBackward took 140.662 s

transformInto took 1.66 s
ForwardBackward took 141.29 s

Make WeightCollection not a class

Caffe weights should just be a Map[String, List[NDArray]]. Having a class around that obscures what is going on. It'd be better have WeightCollection be an object which just contains a bunch of helpful methods (for adding two weight collections, testing equality, etc).

can sparknet support a single GPU of each node on the sparknet cluster?

dear robertnishihara:
Now I can run only one GPU on the local master node on my spark cluster which has 2 nodes ,and the GPU of anther worker node on my spark cluster does not run ,can sparknet cluster support this case ?

           thanks a lot.

exception trying to run Mnist example

Hi!

I'm trying to run SparkNet on a MapR cluster running Spark 1.5.2
I can get Caffe to run locally, including python bindings, and the SparkNet assembly is using the SPARKNETCPU artefacts (with JavaCPP on the 03-16 version as indicated in another post.

the job starts up and completes Stage 3 successfully but then throws an exception:
16/04/10 10:18:52 WARN TaskSetManager: Lost task 3.0 in stage 14.0 (TID 41, 10.0.0.217): java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at libs.JavaNDArray.baseFlatInto(JavaNDArray.java:67)
at libs.JavaNDArray.recursiveFlatInto(JavaNDArray.java:79)
at libs.JavaNDArray.recursiveFlatInto(JavaNDArray.java:82)
at libs.JavaNDArray.flatCopy(JavaNDArray.java:93)
at libs.JavaNDArray.toFlat(JavaNDArray.java:111)
at libs.NDArray.toFlat(NDArray.scala:32)
at libs.TensorFlowUtils$.tensorFromNDArray(TensorFlowUtils.scala:71)
at libs.TensorFlowNet$$anonfun$setWeights$1.apply(TensorFlowNet.scala:114)
at libs.TensorFlowNet$$anonfun$setWeights$1.apply(TensorFlowNet.scala:112)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102)
at libs.TensorFlowNet.setWeights(TensorFlowNet.scala:112)
at apps.MnistApp$$anonfun$main$4.apply$mcVI$sp(MnistApp.scala:96)
at apps.MnistApp$$anonfun$main$4.apply(MnistApp.scala:96)
at apps.MnistApp$$anonfun$main$4.apply(MnistApp.scala:96)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:894)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:894)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Any help would be greatly appreciated.

Note: the Cifar example also fails with what seems to be the exact same error.

Switch from RDDs to DataFrames

Support saving the Net and Solver state

Issue brought up in #77.

Clean up tests

Most of the tests are outdated. We should:

Remove deprecated tests.
Clean up tests that are still relevant and make them compile.
Add new tests.

Multi GPUs do not run on the sparknet cluster

Multi GPUs do not run on the sparknet cluster.our big data development platform is ubuntu15.10,hadoop2.6,spark1.5,sparknet,caffe1.2,cuda7.5,4 GPUS which have 2 on 2 nodes .now GPU can run on the local node and only run one GPU not 2 GPU

SPARKNET distrbuted CPU with caffe.

Hi,
I am currently trying to use SparkNet with distributed CPU instead of GPU (for some comparisons), I have set Caffe with CPU_only config in caffe makefile configuration and then compiled SparkNet, then I submitted the CIFAR app locally :
spark-submit --master local[*] --class apps.CifarApp SparkNet/target/scala-2.10/sparknet-assembly-0.1-SNAPSHOT.jar 1
I am getting the following : F0108 11:18:43.727069 13912 split_layer.cpp:53] Cannot use GPU in CPU-only Caffe: check mode.

What other modifications need to be done so that SparkNet perform the training on distrbuted CPU smoothly.

what happens to the executor JVM thread (that originally owns the RDD partition data ) when control is given to the GPU for processing? what if another thread is scheduled on that executor while a previous thread is doing computations on the GPU?

how does your design integrate with SPARK original spark DAG and task scheduler? is the cluster manager aware of the existence of GPU and make scheduling decisions accordingly? what if the cluster manager schedule the task (JVM thread on a non-GPU enabled nodes)?

Thank you.

Can i compile SparkNet without using CUDA ( want to run with CPU)

Hi
I compiled SparkNet successfully with Cuda 7.0 . but when i tried to run "Train Cifar using SparkNet" application its show me .

F0628 17:53:57.325634 29332 cudnn_conv_layer.cpp:52] Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected
*** Check failure stack trace: ***

My OS is Ubuntu 14.04 and running on virtual Machine and i don't have GPU support now . so can i test application with CPU without using Cuda?

if possible give me steps to compile and run application with CPU.

Regards
Prateek

error while running CifarApp

When I am running the CifarApp on SparkCluster, the following error comesup:

16/06/08 12:50:04 INFO DAGScheduler: ResultStage 14 (foreach at CifarApp.scala:105) failed in 0.040 s 16/06/08 12:50:04 INFO DAGScheduler: Job 8 failed: foreach at CifarApp.scala:105, took 0.049292 s Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 43, localhost): java.lang.ArrayIndexOutOfBoundsException

how to run on ppc64le platform?

I tried to build this project on ppc64le platform but unfortunately got following error:

java.lang.ExceptionInInitializerError
at TensorFlowNetSpec$$anonfun$1.apply$mcV$sp(TensorFlowNetSpec.scala:15)
at TensorFlowNetSpec$$anonfun$1.apply(TensorFlowNetSpec.scala:14)

The line 15 of "TensorFlowNetSpec" is to create an object from class GraphDef, which looks like coming from tensorflow package.

Do you think this would be the problem of javacpp-presets, i.e., not supporting ppc64le? Any help will be appreciated. Thanks!

Issue regarding building SparkNet for existing Spark Cluster

Hey, I am facing problem in building SparkNet, I have followed steps written under building it for existing Spark Cluster and since I don't plan to use GPU to have skipped CUDA step (Step 1). After I run 'sbt assembly', the error thrown is

[info] Run completed in 10 seconds, 779 milliseconds.
[info] Total number of tests run: 6
[info] Suites: completed 5, aborted 0
[info] Tests: succeeded 6, failed 0, canceled 0, ignored 9, pending 0
[info] All tests passed.
[error] Error during tests:
[error]     TensorFlowNetSpec
[error] (test:test) sbt.TestsFailedException: Tests unsuccessful

Is there any dependency that is not mentioned in the README.md ?

ImageNet running in Yarn, nodeManager memory keep on increasing

I have run the imagenet in yarn cluster mode. Noticed nodemanager memory keep on increasing. Seems to be some memory leak in c++/jni code since coarsedGrainedbackend memory is very stable.

See the two process: (1127 keep on growing, while 1130 very stable)

****0 S yarn 1127 1125 0 80 0 - 2910 wait 13:15 ? 00:00:00 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/../../../CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/native:/opt/gpu/cuda/lib64:/data02/nhe/SparkNet/lib:/data02/nhe/cuda-7.0::/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop/lib/native /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms22528m -Xmx22528m -Djava.io.tmpdir=/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/tmp '-Dspark.authenticate=false' '-Dspark.driver.port=56487' '-Dspark.shuffle.service.port=7337' '-Dspark.ui.port=0' -Dspark.yarn.app.container.log.dir=/data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:56487 --executor-id 1 --hostname bdalab12.samsungsdsra.com --cores 16 --app-id application_1461609406099_0001 --user-class-path file:/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/app.jar 1> /data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002/stdout 2> /data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002/stderr

0 S yarn 1130 1127 99 80 0 - 56878287 futex_ 13:15 ? 01:25:40 /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms22528m -Xmx22528m -Djava.io.tmpdir=/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/tmp -Dspark.authenticate=false -Dspark.driver.port=56487 -Dspark.shuffle.service.port=7337 -Dspark.ui.port=0 -Dspark.yarn.app.container.log.dir=/data02/yarn/container-logs/application_1461609406099_0001/container_1461609406099_0001_02_000002 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:56487 --executor-id 1 --hostname bdalab12.samsungsdsra.com --cores 16 --app-id application_1461609406099_0001 --user-class-path file:/data02/yarn/nm/usercache/hdfs/appcache/application_1461609406099_0001/container_1461609406099_0001_02_000002/app.jar

which branch should I choose to support multi GPUs on spark cluster

hi,dear friends:
which branch should I choose to support multi GPUs on spark cluster on the github,master,multigpu,multigpucore,etc,and what diffirences are they?

thanks a lot!

Error while running CAFFE Cifar-10

I am sorry for opening another error thread, but I am trying the new AMI d0833da3 and I got the following error (pulled and rebuilt) while running CIFAR-10
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 12.0 failed 1 times, most recent failure: Lost task 3.0 in stage 12.0 (TID 355, 172.31.20.36): ExecutorLostFailure (executor 2 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1813) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1826) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1839) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1910) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:890) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:888) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.foreach(RDD.scala:888) at apps.CifarApp$.main(CifarApp.scala:82) at apps.CifarApp.main(CifarApp.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

In the stdout of the executors I got this

`
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00007f9b11147be0, pid=9265, tid=140304540440320

JRE version: OpenJDK Runtime Environment (7.0_95) (build 1.7.0_95-b00)
Java VM: OpenJDK 64-Bit Server VM (24.95-b01 mixed mode linux-amd64 )
Derivative: IcedTea 2.6.4
Distribution: Ubuntu 14.04.3 LTS, package 7u95-2.6.4-0ubuntu0.14.04.1
Problematic frame:
C [libcaffe.so.1.0.0-rc3+0x2e2be0] caffe::Caffe::RNG::generator()+0x0

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:
/root/spark/work/app-20160308211614-0002/3/hs_err_pid9265.log

If you would like to submit a bug report, please include
instructions on how to reproduce the bug and visit:
http://icedtea.classpath.org/bugzilla

The dump file is attached in case useful.
hs_err_pid9541.log.txt

Provide instructions for setting up SparkNet on custom cluster

At the moment, the instructions for setting up SparkNet on a custom cluster are not separated from the instructions for setting it up on EC2, for clarification it would be nice to have a separate section.

UnsatisfiedLinkError when running Caffe Test

Hi,

I try to build SparkNet in CentOS 6.6 server with CPU only and install it in my spark cluster.
I follow #112 and comment out the @Ignore line in src/test/scala/libs/CaffeNetSpec.scala. After building successfully, i run sbt "test-only CaffeNetSpec", but it failed.

java.lang.UnsatisfiedLinkError: no jnicaffe in java.library.path
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1865)
    at java.lang.Runtime.loadLibrary0(Runtime.java:870)
    at java.lang.System.loadLibrary(System.java:1122)
    at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:632)
    at org.bytedeco.javacpp.Loader.load(Loader.java:470)
    at org.bytedeco.javacpp.Loader.load(Loader.java:407)
    at org.bytedeco.javacpp.caffe.<clinit>(caffe.java:16)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.bytedeco.javacpp.Loader.load(Loader.java:442)
    at org.bytedeco.javacpp.Loader.load(Loader.java:407)
    at org.bytedeco.javacpp.caffe$NetParameter.<clinit>(caffe.java:1946)
    at CaffeNetSpec$$anonfun$1.apply$mcV$sp(CaffeNetSpec.scala:16)
    at CaffeNetSpec$$anonfun$1.apply(CaffeNetSpec.scala:15)
    at CaffeNetSpec$$anonfun$1.apply(CaffeNetSpec.scala:15)
.
......
Caused by: java.lang.UnsatisfiedLinkError: /tmp/javacpp21624919454281404/libjnicaffe.so: /tmp/javacpp21624919454281404/libjnicaffe.so: undefined symbol: _ZN5caffe15WindowDataLayerIdED1Ev
    at java.lang.ClassLoader$NativeLibrary.load(Native Method)
    at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1937)
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1822)
    at java.lang.Runtime.load0(Runtime.java:809)
    at java.lang.System.load(System.java:1086)
    at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:615)
    at org.bytedeco.javacpp.Loader.load(Loader.java:470)
    at org.bytedeco.javacpp.Loader.load(Loader.java:407)
    at org.bytedeco.javacpp.caffe.<clinit>(caffe.java:16)

Could anyone give me some suggestions of what's wrong in my building process?
Many thanks!

Out of Memory

Hi,

I have spark cluster with:
Workers: 5
Cores: 5 Total
Memory: 10.9 GB Total

When I run CifarApp, I got the following error:
ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-3] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space

many thank for help

Re-enable Caffe tests

At the moment, Caffe and TensorFlow tests can't be run together because they both seem to lock the GPU; for now we work around this problem by just running the TensorFlow tests when sbt assembly is run; this should be changed

Wrap TensorFlow using JavaCPP

See preliminary discussion in #70.

JavaCPP-Presets supports TensorFlow. We should create unified Net and Solver interfaces that are implemented by both Caffe and TensorFlow.

Error : Check failed: error == cudaSuccess (30 vs. 0) unknown error

I had created a private AMI from a running code (after the cache changes), and the Imagenet example was running correctly on this AMI.

However, today I created a new cluster from this AMI and got the error - "Check failed: error == cudaSuccess (30 vs. 0) unknown error".

would you please have cmake ../libccaffe support USE_CUDNN

without cudnn, the performance degrade 6-10 times.

Curious about the 'hey' message

Good morning!

While I was running the CifarApp example, I found an interesting message. I would call it the 'hey' message. It's like:

I0202 15:56:30.392312 16899 solver.cpp:236] Iteration 5000, loss = 1.12068
I0202 15:56:30.392349 16899 solver.cpp:252] Train net output #0: loss = 1.12068 (* 1 = 1.12068 loss)
I0202 15:56:30.392357 16899 sgd_solver.cpp:106] Iteration 5000, lr = 0.001
hey
I0202 15:56:35.119498 16899 net.cpp:751] Copying source layer data
I0202 15:56:35.119524 16899 net.cpp:751] Copying source layer label
I0202 15:56:35.119531 16899 net.cpp:751] Copying source layer conv1

It seems it is harmless, but may I ask what this message is and why it appears? Thank you for reading!

build issue

Does anybody have build issue? We download the source code and one maven dependency is not available. imageio is not available and only imageio-core is available. When try to resolve dependency one by one, stuck at org.bytedeco.javacpp-presets - caffe . Not sure exactly version used in this project.

I think this is a very basic thing to let other users leverage this contribution.

Switch to using javacpp-presets

Using javacpp-presets instead of wrapping caffe ourselves will simplify the code base and help with maintainability.

CaffeOnSpark

What will be the impact of Yahoo CaffeOnSpark release for SparkNet?

Number of iterations

Good afternoon!

Thank you for your kind answers about my previous questions. I really appreciate your help.

May I ask about the number of iterations? I ran CifarApp.scala with two different settings: cifar10_full and cifar10_quick. The variable max_iter is set to 60000 and 4000 in the solver.prototxt files, respectively.

But when I ran CifarApp.scala, it continued learning to more than 206100 iterations. So I cancelled the job and read the source code. It seemed that the only way to quit from while(true) in CifarApp.scala is the failure of assert statements in the infinite loop.

May I ask how to control the number of iterations? Is the value of max_iter in solver.prototxt files are ignored in CifarApp.scala? It seems that when prototxt files are loaded, the parameters will go through some transformation by ProtoLoader.replaceDataLayers, loadSolverPrototxtWithNet and CaffeNet() but I am not sure whether I understand the code correctly.

If I am asking a basic question, please forgive me. Thank you so much for your invaluable help!

Saving caffemodel and solverstate

Hi! Always thank you for this great program. May I ask about saving caffemodel and solverstate?

I tried to save the trained model as .caffemodel and .solverstate files. It seemed that the snapshot of cifar10_quick_solver.prototxt did not work in SparkNet.

I found commented lines in CifarDBApp.scala:

if (i % 10 == 0) {
  net.setWeights(netWeights)
  net.saveWeightsToFile("/root/weights/" + i.toString + ".caffemodel")
}

But I could not understand how to use those lines. At the same time, it seems that many things have been changed with recent update so I guess even if I had found out how to use those lines, it would not be applicable to the new version of SparkNet.

I feel sorry to ask a question because I can see that you are very busy. A lot of code update! Thank you for always responding to my issue writings and questions.

Issue with ImageNet Example

Hello,

I followed all the steps of the example including those in the issue, but my execution is getting stuck at a point. It is able to preprocess the training data but gets stuck in preprocessing the validation data. For the validation data, I downloaded the tar, untarred it, and repacked them into tars of around 250MB each, and then uploaded to S3. However, I am afraid I may have shuffled the order of images in this process.

Due to this or some other reason, the preprocessing step for validation data remains stuck indefinitely, however preprocessing of all the train data gets completed in 20 min. I am using 2 g2.8xlarge machines as slaves, which remain at high usage during the time the execution is stuck. Any tips on debugging or resolving this issue will be greatly appreciated. I have tried with both persist() and persist(StorageLevel.MEMORY_AND_DISK).

As shown below, it gets stuck after launching 26 tasks for the 25 tars of val data.

16/02/13 09:48:27 INFO scheduler.DAGScheduler: Submitting 26 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at mapPartitions at ScaleAndConvert.scala:31)
16/02/13 09:48:27 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 26 tasks
16/02/13 09:48:27 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.31.16.22, PROCESS_LOCAL, 2221 bytes)
16/02/13 09:48:27 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 172.31.25.24, PROCESS_LOCAL, 2228 bytes)
..
..
16/02/13 09:48:27 INFO scheduler.TaskSetManager: Starting task 25.0 in stage 0.0 (TID 25, 172.31.25.24, PROCESS_LOCAL, 2228 bytes)
16/02/13 09:48:27 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.31.16.22:43854 (size: 1742.0 B, free: 27.4 GB)
16/02/13 09:48:27 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.31.25.24:38520 (size: 1742.0 B, free: 27.4 GB)
16/02/13 09:48:28 INFO storage.BlockManagerInfo: Added rdd_4_0 in memory on 172.31.16.22:43854 (size: 24.0 B, free: 27.4 GB)
16/02/13 09:48:28 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1503 ms on 172.31.16.22 (1/26)

Also, on a related note I will like to point out 2-3 points that were not present in the Readme and which lead to errors on running the example.

Creation of a /tmp/spark-events/ directory is required else it leads to an error.
The readme states that the ~/.aws/credentials is required only on the master but not on the slaves. However, while running the example it is required on all the machines.
After changing the S3 bucket name, the Sparknet project needs to be rebuilt using 'sbt assembly'. This was not specified in the Readme. Further, the issue states that name of the validation folders is actually test, but in the version of the project present in the AMI, the names are actually val.txt and IL..._va

SparkNet for CPU-only clusters

Could you provide SparkNet for CPU-only clusters ?

Not able to find AMI - ami-c0dd7db3

Hello,

I am trying to setup Sparknet on a clusterof Amazon machines. However, I am not able to find the AMI. It will be great if you could help me locate it.

Issue with TFImageNet Example

Hello,
Nice to see the integration with Tensorflow and GPUs back 👍
I setup the cluster with the new AMI and was able to run the MNIST example. It ran succesfully but in the Spark web ui, was able to see a lot of jobs skipped.

I hope that is not a problem. Also can the MNIST example use all the GPUs?

Further, for the TFImageNetApp, I ran into the following error. My ImageNetApp (caffe) used to correctly work with my S3 bucket.
Command
/root/spark/bin/spark-submit --class apps.TFImageNetApp /root/SparkNet/target/scala-2.10/sparknet-assembly-0.1-SNAPSHOT.jar 2 sparknetdivideo
Error
java.lang.IllegalArgumentException: The data and shape arguments are not compatible, data.length = 196608 and shape = Array(227, 256, 256). at libs.NDArray$.apply(NDArray.scala:55) at libs.ImageNetTensorFlowPreprocessor$$anonfun$convert$16.apply(Preprocessor.scala:131) at libs.ImageNetTensorFlowPreprocessor$$anonfun$convert$16.apply(Preprocessor.scala:122) at libs.TensorFlowNet$$anonfun$loadFrom$1.apply$mcVI$sp(TensorFlowNet.scala:64) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at libs.TensorFlowNet.loadFrom(TensorFlowNet.scala:63) at libs.TensorFlowNet.forward(TensorFlowNet.scala:74) at apps.TFImageNetApp$$anonfun$7$$anonfun$apply$2.apply$mcVI$sp(TFImageNetApp.scala:106) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at apps.TFImageNetApp$$anonfun$7.apply(TFImageNetApp.scala:105) at apps.TFImageNetApp$$anonfun$7.apply(TFImageNetApp.scala:102) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:706) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

java.lang.UnsatisfiedLinkError: no jniopencv_core in java.library.path

Error after pulling the new changes and building sbt assembly. Trying out this solution - bytedeco/javacv#122

Machine Learning Provider Abstraction Layer

I am a coder for a team looking to consider using SparkNet with another ML library besides caffe. The intent of this Issue is to capture discussions on a ML Provider Abstraction Layer (MLPAL?) that would permit pluggable use of Caffe vs SomeOtherMLLibrary.

To the core committers: do you already have thoughts and/or a Roadmap for this? In any case our thoughts will start appearing here.

Limitting SparkNet to only 1 physical CPU core

Hi! May I ask about the relationship between multi CPU cores and SparkNet?

I wanted to find out the relationship between number of CPUs and the running time. So I ran the CifarApp with spark-submit --master local command. I thought that it would run CifarApp in only one physical CPU core because the Spark official website says that --master local will "Run Spark locally with one worker thread (i.e. no parallelism at all)."

However when I checked with top command, all of my 8 physical cores were being used. At first I thought I might gave wrong options. So I tried various options but every time Spark used all 8 physical cores.

I also tried making a Spark standalone cluster in my desktop, putting 8 worker instances with SPARK_WORKER_INSTANCES=8 and SPARK_WORKER_CORES=1 and then running SparkNet on only 1 worker. But still Spark used all of my 8 physical cores.

In comparison, when I tried with the Pi Example in the official Spark site, Spark used only 1 physical core. I used the word 'physical cores' here because I found that Spark can have a lot of virtual cores. So in my case, # physical core = 8 and # virtual core = 1, and all physical cores are being used by one virtual core.

Until now I have two possible explanations about this phenomenon: (1) Even if Spark allocates only 1 thread to SparkNet, SparkNet will use all available physical cores. (2) Spark will use all available physical cores whenever it can. SparkNet was in this situation but Pi Example was not.

Would you help me? Thank you for your help!

Unable to install on macosx

The javacpp-presets is not installing cleanly on os/x. This is a prerequisite for using SparkNet on mac.

The page https://github.com/amplab/SparkNet/blob/master/doc/creating-jars.md is for ubuntu and centos only.

The following page is apparently the one to follow:

https://github.com/bytedeco/javacpp-presets/wiki/Build-Environments

However the command

bash cppbuild.sh install

ended up with a tar corruption error after an hour of compilation.

So if anyone successfully builds on os/x it would be appreciated to update the above documentation.

In ImageNetApp, leave JPEGs compressed so the full dataset fits in memory more easily

See discussion in #63.

Multi GPU for Caffe

This is an issue for tracking the status of implementing b73ed97 on top of the current JavaCPP architecture.

TODOs for completing this:

Integrate the patch for caffe/include/caffe/parallel.hpp and caffe/src/caffe/parallel.cpp into the repo https://github.com/amplab/caffe
Build up to date JARs for this version of caffe
Create a MultiGPU example

Deep feature extraction and deployment

Hi there,
We're trying to use caffe with spark for some video analytic task. Caffe is mainly used as feature extractor. Currently, our implementation is based on caffe javacpp preset. We have a lot of memory issue with this approach because of our way to manage the caffe network allocation and initialization. We'd like to test SparkNet as feature extractor but it seems that SparkNet doesn't have this feature yet. As far as I know it is just use for training and testing a neural network. Do you plan to use also SparkNet to deploy
a caffe network on spark and how can we add just a forward method on Net class ?

Hyperparameters optimizzation

Do you plan to add something similar to https://github.com/maxpumperla/hyperas?

Improve error message for CifarApp if the dataset has not been downloaded

At the moment the error message is

Exception in thread "main" java.util.NoSuchElementException: head of empty list
        at scala.collection.immutable.Nil$.head(List.scala:337)
        at scala.collection.immutable.Nil$.head(List.scala:334)
        at loaders.CifarLoader.<init>(CifarLoader.scala:48)
        at apps.CifarApp$.main(CifarApp.scala:52)
        at apps.CifarApp.main(CifarApp.scala)

which is a little non-informative.

Small typo in the code

Good afternoon!
Probably log("testing, i") in CifarApp.scala (line 102) is a typo of 'log("testing", i)`?

TODO steps in the Readme for Imagenet example.

The steps for running Imagenet example have the line "Tar the validation files by running" followed by a TODO note. However, at the ImageNet site, ILSVRC2012_img_val.tar file is directly available.

Does anymore preprocessing need to be done, or can it be directly uploaded to S3.

Similarly for the train data, there are two tars available at the Imagenet site
Training images (Task 1 & 2). 138GB.
Training images (Task 3). 728MB.

It is not clear from the Readme which tar to exactly use for this example.

Could you please shed some light on the above two points. If it seems a fair point, upon getting the clarification, if needed I will be interested in making this change to the Readme as a small contribution.

Question about disabling GPU

Good afternoon!

May I ask how to disable GPU in SparkNet? To be specific, I want to try CifarApp.scala under various conditions. Will setting solver_mode=CPU in cifar10_full_solver.prototxt be enough for turning GPU off in SparkNet? Or should I change more settings?

Thanks for your help. I am actively learning deep learning with SparkNet!

sbt assembly error

[error] Could not run test TensorFlowNetSpec: java.util.NoSuchElementException: key not found: SPARKNET_HOME
...
...
...
[error] Could not run test CaffeNetSpec: java.util.NoSuchElementException: key not found: SPARKNET_HOME
[info] Run completed in 9 seconds, 564 milliseconds.
[info] Total number of tests run: 6
[info] Suites: completed 4, aborted 0
[info] Tests: succeeded 6, failed 0, canceled 0, ignored 1, pending 0
[info] All tests passed.
[error] Error during tests:
[error] TensorFlowNetSpec
[error] CaffeNetSpec
16/04/22 10:37:22 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
error sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 133 s, completed 2016-4-22 10:37:22

I got the errors shown above when running sbt assembly . and I have set the variable
SPARKNET_HOME in the ~/.bashrc . I don't know why this happened .

Any help would be greatly appreciated . Thanks

the code in training
trainDF.foreachPartition
trainIt =>
val it = trainIt.drop(startIdx)
othercomputing
is the most time consuming one. (8:1 ratio == drop time vs other computing time ).
This is due to inherit issue of spark framework. Should someone in Spark to take a look this issue?