GithubHelp home page GithubHelp logo

What's wrong about caffeonspark HOT 4 CLOSED

yahoo avatar yahoo commented on August 22, 2024
What's wrong

from caffeonspark.

Comments (4)

anfeng avatar anfeng commented on August 22, 2024

@dejunzhang Are you following steps of GetStarted_yarn? I suspect that your HDFS environments have problems.

Please try a simple hadoop fs op, say
hadoop fs -put README.md /

from caffeonspark.

dejunzhang avatar dejunzhang commented on August 22, 2024

@anfeng thank you for your help. I have solved the problem.
The problem is that some other users also run spark on the same cluster.
The problem disappeared when i closed the spark process.

from caffeonspark.

dejunzhang avatar dejunzhang commented on August 22, 2024

@anfeng Now i follow the instruction: https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_yarn
,and successfully submit the cifar10 train job to the cluster:
export SPARK_WORKER_INSTANCES=2
export DEVICES=1
spark-submit --master yarn --deploy-mode cluster --num-executors ${SPARK_WORKER_INSTANCES}
--files ./data/cifar10_quick_solver.prototxt,./data/cifar10_quick_train_test.prototxt,./data/mean.binaryproto
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train -features accuracy,loss -label label -conf cifar10_quick_solver.prototxt -devices ${DEVICES}
-connection ethernet -model result/cifar10.model.h5 -output result/cifar10_features_result

And 7 hours later i still didn't get any results and output. the job is still running. i can only see below caffe related logs(spark executors) from the container log of one slave node.

16/03/18 10:52:53 INFO storage.BlockManager: Found block rdd_0_0 locally
16/03/18 10:52:53 INFO storage.MemoryStore: Block rdd_5_0 stored as values in memory (estimated size 40.0 B, free 13.3 KB)
16/03/18 10:52:53 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 4). 1549 bytes result sent to driver
16/03/18 10:52:53 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
16/03/18 10:52:53 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 7)
16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1411.0 B, free 14.7 KB)
16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 18 ms
16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.2 KB, free 16.8 KB)
16/03/18 10:52:53 INFO storage.BlockManager: Found block rdd_5_0 locally
16/03/18 10:52:53 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 7). 2003 bytes result sent to driver
16/03/18 10:52:53 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
16/03/18 10:52:53 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 8)
16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 5
16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1380.0 B, free 18.2 KB)
16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Reading broadcast variable 5 took 17 ms
16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.1 KB, free 20.2 KB)
16/03/18 10:52:53 INFO storage.BlockManager: Found block rdd_0_0 locally
I0318 10:52:54.014348 32222 solver.cpp:237] Iteration 0, loss = 2.30203
I0318 10:52:54.014411 32222 solver.cpp:253] Train net output #0: loss = 2.30203 (* 1 = 2.30203 loss)
I0318 10:52:54.054888 32222 sgd_solver.cpp:106] Iteration 0, lr = 0.001
E0318 10:53:01.022980 32210 socket.cpp:61] ERROR: Read partial messageheader [4 of 12] <--------
16/03/18 11:22:15 INFO storage.BlockManager: Removing RDD 5 <------------------

I also train cifar10 with quick and full solver on one node with 1 GPU.
time cost is as below:
quick solver: 66006 ms with 5000 iterations, achieve 0.7529 accuaracy.
full solver: 2529184 ms with 7000 iterations, achieve 0.8171accuaracy.

the training time is not so long. But why the distributed version cost 7+ hours? I think there might be something wrong with the clusters. i am a new guy for spark and hadoop. Do you know what's wrong about it? Thank you very much.

My configuration is:
1 master node only with CPU, 3 slave nodes with GPUs. 8 CPU cores and 8 GB memory allocated for each slave nodes.

from caffeonspark.

dejunzhang avatar dejunzhang commented on August 22, 2024

one slave node use 4 Gb memory and the other one use 2 Gb memory for the job.

from caffeonspark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.