GithubHelp home page GithubHelp logo

Comments (5)

anfeng avatar anfeng commented on August 22, 2024

Please share your logs. We should address the root cause.

Andy Feng

Sent from my iPhone

On Mar 20, 2016, at 6:31 PM, dejunzhang [email protected] wrote:

When i run the cifar10 example, the job often hang if an error appears on one slave node.
E0321 09:20:01.425690 12928 socket.cpp:61] ERROR: Read partial messageheader [4 of 12]

I don't know whether the network bandwidth problem causes the above error happened(error extracting socket head)?
So my question is whether there exists an error recovery mechanism to reduce probability of this kind of errors?
Notes: i use ethernet connection and an 1 Gb/s switch.
Thank you.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub

from caffeonspark.

dejunzhang avatar dejunzhang commented on August 22, 2024

My command is:
export SPARK_WORKER_INSTANCES=2
export DEVICES=1
spark-submit --master yarn --deploy-mode cluster --num-executors ${SPARK_WORKER_INSTANCES} --files ./data/cifar10_quick_solver.prototxt,./data/cifar10_quick_train_test.prototxt,./data/mean.binaryproto --conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}" --conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" --class com.yahoo.ml.caffe.CaffeOnSpark ${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar -train -features accuracy,loss -label label -conf cifar10_quick_solver.prototxt -devices ${DEVICES} -connection ethernet -model result/cifar10.model.h5 -output result/cifar10_features_result

Below is the logs:
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "gaussian"
std: 0.1
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0318 10:52:39.912713 32199 layer_factory.hpp:77] Creating layer data
I0318 10:52:39.912732 32199 net.cpp:106] Creating Layer data
I0318 10:52:39.912742 32199 net.cpp:411] data -> data
I0318 10:52:39.912756 32199 net.cpp:411] data -> label
I0318 10:52:39.913648 32199 net.cpp:150] Setting up data
I0318 10:52:39.913666 32199 net.cpp:157] Top shape: 100 3 32 32 (307200)
I0318 10:52:39.913677 32199 net.cpp:157] Top shape: 100 (100)
I0318 10:52:39.913683 32199 net.cpp:165] Memory required for data: 1229200
I0318 10:52:39.913691 32199 layer_factory.hpp:77] Creating layer label_data_1_split
I0318 10:52:39.913709 32199 net.cpp:106] Creating Layer label_data_1_split
I0318 10:52:39.913717 32199 net.cpp:454] label_data_1_split <- label
I0318 10:52:39.913732 32199 net.cpp:411] label_data_1_split -> label_data_1_split_0
I0318 10:52:39.913745 32199 net.cpp:411] label_data_1_split -> label_data_1_split_1
I0318 10:52:39.913802 32199 net.cpp:150] Setting up label_data_1_split
I0318 10:52:39.913822 32199 net.cpp:157] Top shape: 100 (100)
I0318 10:52:39.913830 32199 net.cpp:157] Top shape: 100 (100)
I0318 10:52:39.913837 32199 net.cpp:165] Memory required for data: 1230000
I0318 10:52:39.913844 32199 layer_factory.hpp:77] Creating layer conv1
I0318 10:52:39.913861 32199 net.cpp:106] Creating Layer conv1
I0318 10:52:39.913868 32199 net.cpp:454] conv1 <- data
I0318 10:52:39.913879 32199 net.cpp:411] conv1 -> conv1
I0318 10:52:39.915364 32199 net.cpp:150] Setting up conv1
I0318 10:52:39.915385 32199 net.cpp:157] Top shape: 100 32 32 32 (3276800)
I0318 10:52:39.915393 32199 net.cpp:165] Memory required for data: 14337200
I0318 10:52:39.915408 32199 layer_factory.hpp:77] Creating layer pool1
I0318 10:52:39.915423 32199 net.cpp:106] Creating Layer pool1
I0318 10:52:39.915432 32199 net.cpp:454] pool1 <- conv1
I0318 10:52:39.915441 32199 net.cpp:411] pool1 -> pool1
I0318 10:52:39.915503 32199 net.cpp:150] Setting up pool1
I0318 10:52:39.915518 32199 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0318 10:52:39.915525 32199 net.cpp:165] Memory required for data: 17614000
I0318 10:52:39.915532 32199 layer_factory.hpp:77] Creating layer relu1
I0318 10:52:39.915544 32199 net.cpp:106] Creating Layer relu1
I0318 10:52:39.915550 32199 net.cpp:454] relu1 <- pool1
I0318 10:52:39.915561 32199 net.cpp:397] relu1 -> pool1 (in-place)
I0318 10:52:39.915945 32199 net.cpp:150] Setting up relu1
I0318 10:52:39.915966 32199 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0318 10:52:39.915974 32199 net.cpp:165] Memory required for data: 20890800
I0318 10:52:39.915982 32199 layer_factory.hpp:77] Creating layer conv2
I0318 10:52:39.915998 32199 net.cpp:106] Creating Layer conv2
I0318 10:52:39.916007 32199 net.cpp:454] conv2 <- pool1
I0318 10:52:39.916018 32199 net.cpp:411] conv2 -> conv2
I0318 10:52:39.918570 32199 net.cpp:150] Setting up conv2
I0318 10:52:39.918589 32199 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0318 10:52:39.918598 32199 net.cpp:165] Memory required for data: 24167600
I0318 10:52:39.918612 32199 layer_factory.hpp:77] Creating layer relu2
I0318 10:52:39.918623 32199 net.cpp:106] Creating Layer relu2
I0318 10:52:39.918632 32199 net.cpp:454] relu2 <- conv2
I0318 10:52:39.918655 32199 net.cpp:397] relu2 -> conv2 (in-place)
I0318 10:52:39.919062 32199 net.cpp:150] Setting up relu2
I0318 10:52:39.919083 32199 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0318 10:52:39.919091 32199 net.cpp:165] Memory required for data: 27444400
I0318 10:52:39.919098 32199 layer_factory.hpp:77] Creating layer pool2
I0318 10:52:39.919109 32199 net.cpp:106] Creating Layer pool2
I0318 10:52:39.919116 32199 net.cpp:454] pool2 <- conv2
I0318 10:52:39.919126 32199 net.cpp:411] pool2 -> pool2
I0318 10:52:39.919373 32199 net.cpp:150] Setting up pool2
I0318 10:52:39.919390 32199 net.cpp:157] Top shape: 100 32 8 8 (204800)
I0318 10:52:39.919397 32199 net.cpp:165] Memory required for data: 28263600
I0318 10:52:39.919404 32199 layer_factory.hpp:77] Creating layer conv3
I0318 10:52:39.919423 32199 net.cpp:106] Creating Layer conv3
I0318 10:52:39.919432 32199 net.cpp:454] conv3 <- pool2
I0318 10:52:39.919445 32199 net.cpp:411] conv3 -> conv3
I0318 10:52:39.922948 32199 net.cpp:150] Setting up conv3
I0318 10:52:39.922972 32199 net.cpp:157] Top shape: 100 64 8 8 (409600)
I0318 10:52:39.922979 32199 net.cpp:165] Memory required for data: 29902000
I0318 10:52:39.922994 32199 layer_factory.hpp:77] Creating layer relu3
I0318 10:52:39.923005 32199 net.cpp:106] Creating Layer relu3
I0318 10:52:39.923012 32199 net.cpp:454] relu3 <- conv3
I0318 10:52:39.923022 32199 net.cpp:397] relu3 -> conv3 (in-place)
I0318 10:52:39.923421 32199 net.cpp:150] Setting up relu3
I0318 10:52:39.923441 32199 net.cpp:157] Top shape: 100 64 8 8 (409600)
I0318 10:52:39.923449 32199 net.cpp:165] Memory required for data: 31540400
I0318 10:52:39.923456 32199 layer_factory.hpp:77] Creating layer pool3
I0318 10:52:39.923466 32199 net.cpp:106] Creating Layer pool3
I0318 10:52:39.923473 32199 net.cpp:454] pool3 <- conv3
I0318 10:52:39.923482 32199 net.cpp:411] pool3 -> pool3
I0318 10:52:39.923745 32199 net.cpp:150] Setting up pool3
I0318 10:52:39.923763 32199 net.cpp:157] Top shape: 100 64 4 4 (102400)
I0318 10:52:39.923770 32199 net.cpp:165] Memory required for data: 31950000
I0318 10:52:39.923777 32199 layer_factory.hpp:77] Creating layer ip1
I0318 10:52:39.923789 32199 net.cpp:106] Creating Layer ip1
I0318 10:52:39.923795 32199 net.cpp:454] ip1 <- pool3
I0318 10:52:39.923809 32199 net.cpp:411] ip1 -> ip1
I0318 10:52:39.927144 32199 net.cpp:150] Setting up ip1
I0318 10:52:39.927163 32199 net.cpp:157] Top shape: 100 64 (6400)
I0318 10:52:39.927171 32199 net.cpp:165] Memory required for data: 31975600
I0318 10:52:39.927182 32199 layer_factory.hpp:77] Creating layer ip2
I0318 10:52:39.927194 32199 net.cpp:106] Creating Layer ip2
I0318 10:52:39.927202 32199 net.cpp:454] ip2 <- ip1
I0318 10:52:39.927217 32199 net.cpp:411] ip2 -> ip2
I0318 10:52:39.927392 32199 net.cpp:150] Setting up ip2
I0318 10:52:39.927407 32199 net.cpp:157] Top shape: 100 10 (1000)
I0318 10:52:39.927414 32199 net.cpp:165] Memory required for data: 31979600
I0318 10:52:39.927429 32199 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
I0318 10:52:39.927443 32199 net.cpp:106] Creating Layer ip2_ip2_0_split
I0318 10:52:39.927451 32199 net.cpp:454] ip2_ip2_0_split <- ip2
I0318 10:52:39.927460 32199 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_0
I0318 10:52:39.927472 32199 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_1
I0318 10:52:39.927526 32199 net.cpp:150] Setting up ip2_ip2_0_split
I0318 10:52:39.927539 32199 net.cpp:157] Top shape: 100 10 (1000)
I0318 10:52:39.927547 32199 net.cpp:157] Top shape: 100 10 (1000)
I0318 10:52:39.927553 32199 net.cpp:165] Memory required for data: 31987600
I0318 10:52:39.927561 32199 layer_factory.hpp:77] Creating layer accuracy
I0318 10:52:39.927574 32199 net.cpp:106] Creating Layer accuracy
I0318 10:52:39.927582 32199 net.cpp:454] accuracy <- ip2_ip2_0_split_0
I0318 10:52:39.927590 32199 net.cpp:454] accuracy <- label_data_1_split_0
I0318 10:52:39.927600 32199 net.cpp:411] accuracy -> accuracy
I0318 10:52:39.927614 32199 net.cpp:150] Setting up accuracy
I0318 10:52:39.927624 32199 net.cpp:157] Top shape: (1)
I0318 10:52:39.927630 32199 net.cpp:165] Memory required for data: 31987604
I0318 10:52:39.927636 32199 layer_factory.hpp:77] Creating layer loss
I0318 10:52:39.927654 32199 net.cpp:106] Creating Layer loss
I0318 10:52:39.927664 32199 net.cpp:454] loss <- ip2_ip2_0_split_1
I0318 10:52:39.927671 32199 net.cpp:454] loss <- label_data_1_split_1
I0318 10:52:39.927680 32199 net.cpp:411] loss -> loss
I0318 10:52:39.927692 32199 layer_factory.hpp:77] Creating layer loss
I0318 10:52:39.928221 32199 net.cpp:150] Setting up loss
I0318 10:52:39.928239 32199 net.cpp:157] Top shape: (1)
I0318 10:52:39.928247 32199 net.cpp:160] with loss weight 1
I0318 10:52:39.928258 32199 net.cpp:165] Memory required for data: 31987608
I0318 10:52:39.928266 32199 net.cpp:226] loss needs backward computation.
I0318 10:52:39.928273 32199 net.cpp:228] accuracy does not need backward computation.
I0318 10:52:39.928280 32199 net.cpp:226] ip2_ip2_0_split needs backward computation.
I0318 10:52:39.928287 32199 net.cpp:226] ip2 needs backward computation.
I0318 10:52:39.928293 32199 net.cpp:226] ip1 needs backward computation.
I0318 10:52:39.928299 32199 net.cpp:226] pool3 needs backward computation.
I0318 10:52:39.928306 32199 net.cpp:226] relu3 needs backward computation.
I0318 10:52:39.928313 32199 net.cpp:226] conv3 needs backward computation.
I0318 10:52:39.928318 32199 net.cpp:226] pool2 needs backward computation.
I0318 10:52:39.928325 32199 net.cpp:226] relu2 needs backward computation.
I0318 10:52:39.928330 32199 net.cpp:226] conv2 needs backward computation.
I0318 10:52:39.928338 32199 net.cpp:226] relu1 needs backward computation.
I0318 10:52:39.928344 32199 net.cpp:226] pool1 needs backward computation.
I0318 10:52:39.928349 32199 net.cpp:226] conv1 needs backward computation.
I0318 10:52:39.928356 32199 net.cpp:228] label_data_1_split does not need backward computation.
I0318 10:52:39.928364 32199 net.cpp:228] data does not need backward computation.
I0318 10:52:39.928369 32199 net.cpp:270] This network produces output accuracy
I0318 10:52:39.928377 32199 net.cpp:270] This network produces output loss
I0318 10:52:39.928400 32199 net.cpp:283] Network initialization done.
I0318 10:52:39.928475 32199 solver.cpp:60] Solver scaffolding done.
I0318 10:52:39.929064 32199 socket.cpp:219] Waiting for valid port [0]
I0318 10:52:39.929131 32209 socket.cpp:158] Assigned socket server port [55211]
I0318 10:52:39.929682 32209 socket.cpp:171] Socket Server ready []
I0318 10:52:39.939147 32199 socket.cpp:219] Waiting for valid port [55211]
I0318 10:52:39.939160 32199 socket.cpp:227] Valid port found [55211]
I0318 10:52:39.939175 32199 CaffeNet.cpp:186] Socket adapter: yuntu2:55211
I0318 10:52:39.939337 32199 CaffeNet.cpp:325] 0-th Socket addr:
I0318 10:52:39.939352 32199 CaffeNet.cpp:325] 1-th Socket addr: yuntu2:55211
I0318 10:52:39.939363 32199 JniCaffeNet.cpp:110] 0-th local addr:
I0318 10:52:39.939368 32199 JniCaffeNet.cpp:110] 1-th local addr: yuntu2:55211
16/03/18 10:52:39 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 919 bytes result sent to driver
16/03/18 10:52:40 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 3
16/03/18 10:52:40 INFO executor.Executor: Running task 1.0 in stage 1.0 (TID 3)
16/03/18 10:52:40 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
16/03/18 10:52:40 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1582.0 B, free 6.7 KB)
16/03/18 10:52:40 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 17 ms
16/03/18 10:52:40 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 9.3 KB)
16/03/18 10:52:40 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
16/03/18 10:52:40 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 87.0 B, free 9.4 KB)
16/03/18 10:52:40 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 15 ms
16/03/18 10:52:40 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 344.0 B, free 9.7 KB)
I0318 10:52:40.148423 32199 common.cpp:61] 0-th string is NULL
I0318 10:52:40.148521 32199 socket.cpp:250] Trying to connect with ...[yuntu1:54029]
I0318 10:52:40.148941 32199 socket.cpp:309] Connected to server [yuntu1:54029] with client_fd [282]
I0318 10:52:40.183872 32209 socket.cpp:184] Accepted the connection from client [yuntu1]
I0318 10:52:50.150034 32199 parallel.cpp:392] GPUs pairs
I0318 10:52:50.157075 32222 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0318 10:52:50.160940 32223 data_transformer.cpp:25] Loading mean file from: /home/atlas/work/caffe_spark/CaffeOnSpark-master/data/mean.binaryproto
16/03/18 10:52:50 INFO executor.Executor: Finished task 1.0 in stage 1.0 (TID 3). 899 bytes result sent to driver
16/03/18 10:52:51 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 4
16/03/18 10:52:51 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 4)
16/03/18 10:52:51 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
16/03/18 10:52:51 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1412.0 B, free 11.1 KB)
16/03/18 10:52:51 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 20 ms
16/03/18 10:52:51 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.2 KB, free 13.3 KB)
16/03/18 10:52:51 INFO spark.CacheManager: Partition rdd_5_0 not found, computing it
16/03/18 10:52:51 INFO spark.CacheManager: Partition rdd_0_0 not found, computing it
16/03/18 10:52:51 INFO caffe.LmdbRDD: Processing partition 0
16/03/18 10:52:53 INFO caffe.LmdbRDD: Completed partition 0
16/03/18 10:52:53 INFO storage.BlockManager: Found block rdd_0_0 locally
16/03/18 10:52:53 INFO storage.MemoryStore: Block rdd_5_0 stored as values in memory (estimated size 40.0 B, free 13.3 KB)
16/03/18 10:52:53 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 4). 1549 bytes result sent to driver
16/03/18 10:52:53 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
16/03/18 10:52:53 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 7)
16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1411.0 B, free 14.7 KB)
16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 18 ms
16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.2 KB, free 16.8 KB)
16/03/18 10:52:53 INFO storage.BlockManager: Found block rdd_5_0 locally
16/03/18 10:52:53 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 7). 2003 bytes result sent to driver
16/03/18 10:52:53 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
16/03/18 10:52:53 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 8)
16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 5
16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1380.0 B, free 18.2 KB)
16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Reading broadcast variable 5 took 17 ms
16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.1 KB, free 20.2 KB)
16/03/18 10:52:53 INFO storage.BlockManager: Found block rdd_0_0 locally
I0318 10:52:54.014348 32222 solver.cpp:237] Iteration 0, loss = 2.30203
I0318 10:52:54.014411 32222 solver.cpp:253] Train net output #0: loss = 2.30203 (* 1 = 2.30203 loss)
I0318 10:52:54.054888 32222 sgd_solver.cpp:106] Iteration 0, lr = 0.001
E0318 10:53:01.022980 32210 socket.cpp:61] ERROR: Read partial messageheader [4 of 12]
16/03/18 11:22:15 INFO storage.BlockManager: Removing RDD 5
16/03/21 08:26:08 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
16/03/21 08:26:08 INFO storage.DiskBlockManager: Shutdown hook called
16/03/21 08:26:08 INFO util.ShutdownHookManager: Shutdown hook called

from caffeonspark.

junshi15 avatar junshi15 commented on August 22, 2024

RECEIVED SIGNAL 15: SIGTERM
It looks like your memory is limiting you.
Please add the following configs in your launch command
--driver-memory Xg --conf spark.yarn.driver.memoryOverhead=Y
--executor-memory Xg --conf spark.yarn.executor.memoryOverhead=Y
make X and Y as large as you can. X is in GB, Y is in MB, i.e. if you want to set driver-memory to 16GB, and Overhead to 8GB, then use the following:
--driver-memory 16g --conf spark.yarn.driver.memoryOverhead=8192
Same for the executor memory.

from caffeonspark.

mriduljain avatar mriduljain commented on August 22, 2024

This is happening because one of your processes is misbehaving or dead or stuck, which causes communication problems with the other.

from caffeonspark.

dejunzhang avatar dejunzhang commented on August 22, 2024

@junshi15 , you are right. thank you for your help.:)

from caffeonspark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.