GithubHelp home page GithubHelp logo

Comments (7)

junshi15 avatar junshi15 commented on August 22, 2024

It's hard to say what went wrong. If you have spark webgui, you can look at executor thread dump (or go to your executors and "jstack your_spark_pid"). it may show why the program was stuck.

from caffeonspark.

anfeng avatar anfeng commented on August 22, 2024

Please attach the log file into this issue. We need more context for debug.

from caffeonspark.

chengdianxuezi avatar chengdianxuezi commented on August 22, 2024

com.yahoo.ml.jcaffe.CaffeNet.sync(Native Method)
com.yahoo.ml.caffe.CaffeProcessor.sync(CaffeProcessor.scala:156)
com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$6.apply(CaffeOnSpark.scala:176)
com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$6.apply(CaffeOnSpark.scala:171)
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
org.apache.spark.scheduler.Task.run(Task.scala:89)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:724)

from caffeonspark.

anfeng avatar anfeng commented on August 22, 2024

@chengdianxuezi Can you post the WHOLE log (not that small section)? You could get the complete log via "yarn logs -applicationID ..."

from caffeonspark.

chengdianxuezi avatar chengdianxuezi commented on August 22, 2024

Container: container_1458635993280_0010_01_000006 on dlgpu10.ai.bjcc.qihoo.net_54591

LogType:stderr
Log Upload Time:23-Mar-2016 16:53:58
LogLength:31209
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/dev/hadoop/nm-local-dir/usercache/hadoop/filecache/11/spark-assembly-1.6.0-hadoop2.6.4-U4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/software/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/03/23 16:51:14 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/03/23 16:51:14 INFO yarn.YarnSparkHadoopUtil: Set ugi in YarnConf to: hadoop,hadoop
16/03/23 16:51:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/23 16:51:14 INFO yarn.YarnSparkHadoopUtil: running as user: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: Changing view acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/03/23 16:51:15 INFO spark.SecurityManager: Changing view acls to: hadoop
16/03/23 16:51:15 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/03/23 16:51:15 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/03/23 16:51:16 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/03/23 16:51:16 INFO Remoting: Starting remoting
16/03/23 16:51:16 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:54395]
16/03/23 16:51:16 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 54395.
16/03/23 16:51:16 INFO storage.DiskBlockManager: Created local directory at /home/hadoop/yarn/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-263aa769-42b6-40d7-8246-276962488419
16/03/23 16:51:16 INFO storage.DiskBlockManager: Created local directory at /dev/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-3db9d9f0-d30e-4dc4-99d7-4699802069b3
16/03/23 16:51:16 INFO storage.DiskBlockManager: Created local directory at /run/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-79eb8e7e-19b9-448f-bfbe-b2ad11e47af2
16/03/23 16:51:16 INFO storage.MemoryStore: MemoryStore started with capacity 24.9 GB
16/03/23 16:51:16 INFO yarn.YarnSparkHadoopUtil: Set ugi in YarnConf to: hadoop,hadoop
16/03/23 16:51:16 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://[email protected]:45855
16/03/23 16:51:16 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
16/03/23 16:51:16 INFO executor.Executor: Starting executor ID 4 on host dlgpu10.ai.bjcc.qihoo.net
16/03/23 16:51:16 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46213.
16/03/23 16:51:16 INFO netty.NettyBlockTransferService: Server created on 46213
16/03/23 16:51:16 INFO storage.BlockManager: external shuffle service port = 7337
16/03/23 16:51:16 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/03/23 16:51:16 INFO storage.BlockManagerMaster: Registered BlockManager
16/03/23 16:51:16 INFO storage.BlockManager: Registering executor with local external shuffle service.
16/03/23 16:51:19 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 4
16/03/23 16:51:19 INFO executor.Executor: Running task 4.0 in stage 0.0 (TID 4)
16/03/23 16:51:19 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
16/03/23 16:51:20 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.1 KB, free 2.1 KB)
16/03/23 16:51:20 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 296 ms
16/03/23 16:51:20 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.2 KB, free 5.3 KB)
16/03/23 16:51:20 INFO caffe.CaffeProcessor: my rank is 4
16/03/23 16:51:20 INFO caffe.LMDB: Batch size:64
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0323 16:51:27.078397 26145 CaffeNet.cpp:78] set root solver device id to 0
I0323 16:51:27.259912 26145 solver.cpp:48] Initializing solver from parameters:
test_iter: 1
test_interval: 10001
base_lr: 0.01
display: 100
max_iter: 10000
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 10001
snapshot_prefix: "mnist_lenet"
solver_mode: GPU
device_id: 0
net_param {
name: "LeNet"
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_train_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
memory_data_param {
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_test_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
}
test_initialization: false
I0323 16:51:27.260068 26145 solver.cpp:86] Creating training net specified in net_param.
I0323 16:51:27.260135 26145 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer data
I0323 16:51:27.260150 26145 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0323 16:51:27.260231 26145 net.cpp:49] Initializing net from parameters:
name: "LeNet"
state {
phase: TRAIN
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_train_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0323 16:51:27.260318 26145 layer_factory.hpp:77] Creating layer data
I0323 16:51:27.260342 26145 net.cpp:106] Creating Layer data
I0323 16:51:27.260352 26145 net.cpp:411] data -> data
I0323 16:51:27.260378 26145 net.cpp:411] data -> label
I0323 16:51:27.264107 26145 net.cpp:150] Setting up data
I0323 16:51:27.264132 26145 net.cpp:157] Top shape: 64 1 28 28 (50176)
I0323 16:51:27.264139 26145 net.cpp:157] Top shape: 64 (64)
I0323 16:51:27.264143 26145 net.cpp:165] Memory required for data: 200960
I0323 16:51:27.264153 26145 layer_factory.hpp:77] Creating layer conv1
I0323 16:51:27.264171 26145 net.cpp:106] Creating Layer conv1
I0323 16:51:27.264179 26145 net.cpp:454] conv1 <- data
I0323 16:51:27.264192 26145 net.cpp:411] conv1 -> conv1
I0323 16:51:27.266789 26145 net.cpp:150] Setting up conv1
I0323 16:51:27.266804 26145 net.cpp:157] Top shape: 64 20 24 24 (737280)
I0323 16:51:27.266808 26145 net.cpp:165] Memory required for data: 3150080
I0323 16:51:27.266825 26145 layer_factory.hpp:77] Creating layer pool1
I0323 16:51:27.266836 26145 net.cpp:106] Creating Layer pool1
I0323 16:51:27.266841 26145 net.cpp:454] pool1 <- conv1
I0323 16:51:27.266847 26145 net.cpp:411] pool1 -> pool1
I0323 16:51:27.266974 26145 net.cpp:150] Setting up pool1
I0323 16:51:27.266983 26145 net.cpp:157] Top shape: 64 20 12 12 (184320)
I0323 16:51:27.266988 26145 net.cpp:165] Memory required for data: 3887360
I0323 16:51:27.266991 26145 layer_factory.hpp:77] Creating layer conv2
I0323 16:51:27.267002 26145 net.cpp:106] Creating Layer conv2
I0323 16:51:27.267006 26145 net.cpp:454] conv2 <- pool1
I0323 16:51:27.267016 26145 net.cpp:411] conv2 -> conv2
I0323 16:51:27.267971 26145 net.cpp:150] Setting up conv2
I0323 16:51:27.267982 26145 net.cpp:157] Top shape: 64 50 8 8 (204800)
I0323 16:51:27.267985 26145 net.cpp:165] Memory required for data: 4706560
I0323 16:51:27.267994 26145 layer_factory.hpp:77] Creating layer pool2
I0323 16:51:27.268008 26145 net.cpp:106] Creating Layer pool2
I0323 16:51:27.268013 26145 net.cpp:454] pool2 <- conv2
I0323 16:51:27.268018 26145 net.cpp:411] pool2 -> pool2
I0323 16:51:27.268118 26145 net.cpp:150] Setting up pool2
I0323 16:51:27.268129 26145 net.cpp:157] Top shape: 64 50 4 4 (51200)
I0323 16:51:27.268133 26145 net.cpp:165] Memory required for data: 4911360
I0323 16:51:27.268137 26145 layer_factory.hpp:77] Creating layer ip1
I0323 16:51:27.268149 26145 net.cpp:106] Creating Layer ip1
I0323 16:51:27.268154 26145 net.cpp:454] ip1 <- pool2
I0323 16:51:27.268159 26145 net.cpp:411] ip1 -> ip1
I0323 16:51:27.274052 26145 net.cpp:150] Setting up ip1
I0323 16:51:27.274067 26145 net.cpp:157] Top shape: 64 500 (32000)
I0323 16:51:27.274071 26145 net.cpp:165] Memory required for data: 5039360
I0323 16:51:27.274081 26145 layer_factory.hpp:77] Creating layer relu1
I0323 16:51:27.274091 26145 net.cpp:106] Creating Layer relu1
I0323 16:51:27.274096 26145 net.cpp:454] relu1 <- ip1
I0323 16:51:27.274103 26145 net.cpp:397] relu1 -> ip1 (in-place)
I0323 16:51:27.274114 26145 net.cpp:150] Setting up relu1
I0323 16:51:27.274121 26145 net.cpp:157] Top shape: 64 500 (32000)
I0323 16:51:27.274123 26145 net.cpp:165] Memory required for data: 5167360
I0323 16:51:27.274127 26145 layer_factory.hpp:77] Creating layer ip2
I0323 16:51:27.274133 26145 net.cpp:106] Creating Layer ip2
I0323 16:51:27.274137 26145 net.cpp:454] ip2 <- ip1
I0323 16:51:27.274145 26145 net.cpp:411] ip2 -> ip2
I0323 16:51:27.275876 26145 net.cpp:150] Setting up ip2
I0323 16:51:27.275890 26145 net.cpp:157] Top shape: 64 10 (640)
I0323 16:51:27.275894 26145 net.cpp:165] Memory required for data: 5169920
I0323 16:51:27.275902 26145 layer_factory.hpp:77] Creating layer loss
I0323 16:51:27.275916 26145 net.cpp:106] Creating Layer loss
I0323 16:51:27.275920 26145 net.cpp:454] loss <- ip2
I0323 16:51:27.275930 26145 net.cpp:454] loss <- label
I0323 16:51:27.275941 26145 net.cpp:411] loss -> loss
I0323 16:51:27.275959 26145 layer_factory.hpp:77] Creating layer loss
I0323 16:51:27.276264 26145 net.cpp:150] Setting up loss
I0323 16:51:27.276273 26145 net.cpp:157] Top shape: (1)
I0323 16:51:27.276278 26145 net.cpp:160] with loss weight 1
I0323 16:51:27.276295 26145 net.cpp:165] Memory required for data: 5169924
I0323 16:51:27.276299 26145 net.cpp:226] loss needs backward computation.
I0323 16:51:27.276304 26145 net.cpp:226] ip2 needs backward computation.
I0323 16:51:27.276307 26145 net.cpp:226] relu1 needs backward computation.
I0323 16:51:27.276311 26145 net.cpp:226] ip1 needs backward computation.
I0323 16:51:27.276314 26145 net.cpp:226] pool2 needs backward computation.
I0323 16:51:27.276319 26145 net.cpp:226] conv2 needs backward computation.
I0323 16:51:27.276324 26145 net.cpp:226] pool1 needs backward computation.
I0323 16:51:27.276329 26145 net.cpp:226] conv1 needs backward computation.
I0323 16:51:27.276332 26145 net.cpp:228] data does not need backward computation.
I0323 16:51:27.276336 26145 net.cpp:270] This network produces output loss
I0323 16:51:27.276346 26145 net.cpp:283] Network initialization done.
I0323 16:51:27.276401 26145 solver.cpp:181] Creating test net (#0) specified by net_param
I0323 16:51:27.276422 26145 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer data
I0323 16:51:27.276518 26145 net.cpp:49] Initializing net from parameters:
name: "LeNet"
state {
phase: TEST
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
memory_data_param {
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_test_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0323 16:51:27.276599 26145 layer_factory.hpp:77] Creating layer data
I0323 16:51:27.276609 26145 net.cpp:106] Creating Layer data
I0323 16:51:27.276615 26145 net.cpp:411] data -> data
I0323 16:51:27.276623 26145 net.cpp:411] data -> label
I0323 16:51:27.278290 26145 net.cpp:150] Setting up data
I0323 16:51:27.278304 26145 net.cpp:157] Top shape: 100 1 28 28 (78400)
I0323 16:51:27.278311 26145 net.cpp:157] Top shape: 100 (100)
I0323 16:51:27.278313 26145 net.cpp:165] Memory required for data: 314000
I0323 16:51:27.278318 26145 layer_factory.hpp:77] Creating layer label_data_1_split
I0323 16:51:27.278328 26145 net.cpp:106] Creating Layer label_data_1_split
I0323 16:51:27.278332 26145 net.cpp:454] label_data_1_split <- label
I0323 16:51:27.278342 26145 net.cpp:411] label_data_1_split -> label_data_1_split_0
I0323 16:51:27.278349 26145 net.cpp:411] label_data_1_split -> label_data_1_split_1
I0323 16:51:27.278456 26145 net.cpp:150] Setting up label_data_1_split
I0323 16:51:27.278465 26145 net.cpp:157] Top shape: 100 (100)
I0323 16:51:27.278470 26145 net.cpp:157] Top shape: 100 (100)
I0323 16:51:27.278472 26145 net.cpp:165] Memory required for data: 314800
I0323 16:51:27.278476 26145 layer_factory.hpp:77] Creating layer conv1
I0323 16:51:27.278488 26145 net.cpp:106] Creating Layer conv1
I0323 16:51:27.278492 26145 net.cpp:454] conv1 <- data
I0323 16:51:27.278501 26145 net.cpp:411] conv1 -> conv1
I0323 16:51:27.279278 26145 net.cpp:150] Setting up conv1
I0323 16:51:27.279289 26145 net.cpp:157] Top shape: 100 20 24 24 (1152000)
I0323 16:51:27.279294 26145 net.cpp:165] Memory required for data: 4922800
I0323 16:51:27.279305 26145 layer_factory.hpp:77] Creating layer pool1
I0323 16:51:27.279311 26145 net.cpp:106] Creating Layer pool1
I0323 16:51:27.279315 26145 net.cpp:454] pool1 <- conv1
I0323 16:51:27.279321 26145 net.cpp:411] pool1 -> pool1
I0323 16:51:27.279425 26145 net.cpp:150] Setting up pool1
I0323 16:51:27.279433 26145 net.cpp:157] Top shape: 100 20 12 12 (288000)
I0323 16:51:27.279436 26145 net.cpp:165] Memory required for data: 6074800
I0323 16:51:27.279440 26145 layer_factory.hpp:77] Creating layer conv2
I0323 16:51:27.279450 26145 net.cpp:106] Creating Layer conv2
I0323 16:51:27.279454 26145 net.cpp:454] conv2 <- pool1
I0323 16:51:27.279464 26145 net.cpp:411] conv2 -> conv2
I0323 16:51:27.280467 26145 net.cpp:150] Setting up conv2
I0323 16:51:27.280479 26145 net.cpp:157] Top shape: 100 50 8 8 (320000)
I0323 16:51:27.280483 26145 net.cpp:165] Memory required for data: 7354800
I0323 16:51:27.280491 26145 layer_factory.hpp:77] Creating layer pool2
I0323 16:51:27.280498 26145 net.cpp:106] Creating Layer pool2
I0323 16:51:27.280503 26145 net.cpp:454] pool2 <- conv2
I0323 16:51:27.280508 26145 net.cpp:411] pool2 -> pool2
I0323 16:51:27.280607 26145 net.cpp:150] Setting up pool2
I0323 16:51:27.280616 26145 net.cpp:157] Top shape: 100 50 4 4 (80000)
I0323 16:51:27.280622 26145 net.cpp:165] Memory required for data: 7674800
I0323 16:51:27.280624 26145 layer_factory.hpp:77] Creating layer ip1
I0323 16:51:27.280634 26145 net.cpp:106] Creating Layer ip1
I0323 16:51:27.280638 26145 net.cpp:454] ip1 <- pool2
I0323 16:51:27.280643 26145 net.cpp:411] ip1 -> ip1
I0323 16:51:27.285825 26145 net.cpp:150] Setting up ip1
I0323 16:51:27.285840 26145 net.cpp:157] Top shape: 100 500 (50000)
I0323 16:51:27.285845 26145 net.cpp:165] Memory required for data: 7874800
I0323 16:51:27.285856 26145 layer_factory.hpp:77] Creating layer relu1
I0323 16:51:27.285863 26145 net.cpp:106] Creating Layer relu1
I0323 16:51:27.285867 26145 net.cpp:454] relu1 <- ip1
I0323 16:51:27.285873 26145 net.cpp:397] relu1 -> ip1 (in-place)
I0323 16:51:27.285881 26145 net.cpp:150] Setting up relu1
I0323 16:51:27.285887 26145 net.cpp:157] Top shape: 100 500 (50000)
I0323 16:51:27.285889 26145 net.cpp:165] Memory required for data: 8074800
I0323 16:51:27.285892 26145 layer_factory.hpp:77] Creating layer ip2
I0323 16:51:27.285903 26145 net.cpp:106] Creating Layer ip2
I0323 16:51:27.285907 26145 net.cpp:454] ip2 <- ip1
I0323 16:51:27.285912 26145 net.cpp:411] ip2 -> ip2
I0323 16:51:27.286298 26145 net.cpp:150] Setting up ip2
I0323 16:51:27.286308 26145 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:27.286311 26145 net.cpp:165] Memory required for data: 8078800
I0323 16:51:27.286319 26145 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
I0323 16:51:27.286324 26145 net.cpp:106] Creating Layer ip2_ip2_0_split
I0323 16:51:27.286327 26145 net.cpp:454] ip2_ip2_0_split <- ip2
I0323 16:51:27.286335 26145 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_0
I0323 16:51:27.286342 26145 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_1
I0323 16:51:27.286450 26145 net.cpp:150] Setting up ip2_ip2_0_split
I0323 16:51:27.286458 26145 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:27.286463 26145 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:27.286466 26145 net.cpp:165] Memory required for data: 8086800
I0323 16:51:27.286469 26145 layer_factory.hpp:77] Creating layer accuracy
I0323 16:51:27.286481 26145 net.cpp:106] Creating Layer accuracy
I0323 16:51:27.286485 26145 net.cpp:454] accuracy <- ip2_ip2_0_split_0
I0323 16:51:27.286490 26145 net.cpp:454] accuracy <- label_data_1_split_0
I0323 16:51:27.286497 26145 net.cpp:411] accuracy -> accuracy
I0323 16:51:27.286509 26145 net.cpp:150] Setting up accuracy
I0323 16:51:27.286514 26145 net.cpp:157] Top shape: (1)
I0323 16:51:27.286516 26145 net.cpp:165] Memory required for data: 8086804
I0323 16:51:27.286520 26145 layer_factory.hpp:77] Creating layer loss
I0323 16:51:27.286525 26145 net.cpp:106] Creating Layer loss
I0323 16:51:27.286530 26145 net.cpp:454] loss <- ip2_ip2_0_split_1
I0323 16:51:27.286533 26145 net.cpp:454] loss <- label_data_1_split_1
I0323 16:51:27.286540 26145 net.cpp:411] loss -> loss
I0323 16:51:27.286546 26145 layer_factory.hpp:77] Creating layer loss
I0323 16:51:27.286836 26145 net.cpp:150] Setting up loss
I0323 16:51:27.286844 26145 net.cpp:157] Top shape: (1)
I0323 16:51:27.286849 26145 net.cpp:160] with loss weight 1
I0323 16:51:27.286855 26145 net.cpp:165] Memory required for data: 8086808
I0323 16:51:27.286859 26145 net.cpp:226] loss needs backward computation.
I0323 16:51:27.286862 26145 net.cpp:228] accuracy does not need backward computation.
I0323 16:51:27.286867 26145 net.cpp:226] ip2_ip2_0_split needs backward computation.
I0323 16:51:27.286870 26145 net.cpp:226] ip2 needs backward computation.
I0323 16:51:27.286875 26145 net.cpp:226] relu1 needs backward computation.
I0323 16:51:27.286877 26145 net.cpp:226] ip1 needs backward computation.
I0323 16:51:27.286881 26145 net.cpp:226] pool2 needs backward computation.
I0323 16:51:27.286885 26145 net.cpp:226] conv2 needs backward computation.
I0323 16:51:27.286888 26145 net.cpp:226] pool1 needs backward computation.
I0323 16:51:27.286892 26145 net.cpp:226] conv1 needs backward computation.
I0323 16:51:27.286898 26145 net.cpp:228] label_data_1_split does not need backward computation.
I0323 16:51:27.286906 26145 net.cpp:228] data does not need backward computation.
I0323 16:51:27.286909 26145 net.cpp:270] This network produces output accuracy
I0323 16:51:27.286912 26145 net.cpp:270] This network produces output loss
I0323 16:51:27.286924 26145 net.cpp:283] Network initialization done.
I0323 16:51:27.286973 26145 solver.cpp:60] Solver scaffolding done.
I0323 16:51:27.288064 26145 socket.cpp:219] Waiting for valid port [0]
I0323 16:51:27.288095 26165 socket.cpp:158] Assigned socket server port [46110]
I0323 16:51:27.290241 26165 socket.cpp:171] Socket Server ready [0.0.0.0]
I0323 16:51:27.298146 26145 socket.cpp:219] Waiting for valid port [46110]
I0323 16:51:27.298151 26145 socket.cpp:227] Valid port found [46110]
I0323 16:51:27.298162 26145 CaffeNet.cpp:186] Socket adapter: dlgpu10.ai.bjcc.qihoo.net:46110
I0323 16:51:27.298370 26145 CaffeNet.cpp:325] 0-th Socket addr: dlgpu10.ai.bjcc.qihoo.net:46110
I0323 16:51:27.298380 26145 CaffeNet.cpp:325] 1-th Socket addr: dlgpu10.ai.bjcc.qihoo.net:46110
I0323 16:51:27.298384 26145 CaffeNet.cpp:325] 2-th Socket addr: dlgpu10.ai.bjcc.qihoo.net:46110
I0323 16:51:27.298388 26145 CaffeNet.cpp:325] 3-th Socket addr: dlgpu10.ai.bjcc.qihoo.net:46110
I0323 16:51:27.298393 26145 CaffeNet.cpp:325] 4-th Socket addr:
I0323 16:51:27.298401 26145 JniCaffeNet.cpp:110] 0-th local addr: dlgpu10.ai.bjcc.qihoo.net:46110
I0323 16:51:27.298406 26145 JniCaffeNet.cpp:110] 1-th local addr: dlgpu10.ai.bjcc.qihoo.net:46110
I0323 16:51:27.298410 26145 JniCaffeNet.cpp:110] 2-th local addr: dlgpu10.ai.bjcc.qihoo.net:46110
I0323 16:51:27.298413 26145 JniCaffeNet.cpp:110] 3-th local addr: dlgpu10.ai.bjcc.qihoo.net:46110
I0323 16:51:27.298418 26145 JniCaffeNet.cpp:110] 4-th local addr:
16/03/23 16:51:27 INFO executor.Executor: Finished task 4.0 in stage 0.0 (TID 4). 1069 bytes result sent to driver
16/03/23 16:51:27 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
16/03/23 16:51:27 INFO executor.Executor: Running task 2.0 in stage 1.0 (TID 7)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1622.0 B, free 6.9 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 15 ms
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 9.5 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 235.0 B, free 9.7 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 12 ms
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 12.8 KB)
I0323 16:51:27.552399 26145 common.cpp:61] 4-th string is NULL
I0323 16:51:27.552476 26145 socket.cpp:250] Trying to connect with ...[dlgpu19.ai.bjcc.qihoo.net:57402]
I0323 16:51:27.552788 26145 socket.cpp:309] Connected to server [dlgpu19.ai.bjcc.qihoo.net:57402] with client_fd [315]
I0323 16:51:37.552959 26145 socket.cpp:250] Trying to connect with ...[dlgpu19.ai.bjcc.qihoo.net:41630]
I0323 16:51:37.553390 26145 socket.cpp:309] Connected to server [dlgpu19.ai.bjcc.qihoo.net:41630] with client_fd [316]
I0323 16:51:47.553521 26145 socket.cpp:250] Trying to connect with ...[dlgpu10.ai.bjcc.qihoo.net:58067]
I0323 16:51:47.553771 26145 socket.cpp:309] Connected to server [dlgpu10.ai.bjcc.qihoo.net:58067] with client_fd [317]
I0323 16:51:57.553901 26145 socket.cpp:250] Trying to connect with ...[dlgpu20.ai.bjcc.qihoo.net:47155]
I0323 16:51:57.554342 26145 socket.cpp:309] Connected to server [dlgpu20.ai.bjcc.qihoo.net:47155] with client_fd [318]
I0323 16:51:57.555569 26165 socket.cpp:184] Accepted the connection from client [dlgpu10.ai.bjcc.qihoo.net]
I0323 16:51:57.558529 26165 socket.cpp:184] Accepted the connection from client [dlgpu20.ai.bjcc.qihoo.net]
I0323 16:51:57.560233 26165 socket.cpp:184] Accepted the connection from client [dlgpu19.ai.bjcc.qihoo.net]
I0323 16:51:57.562348 26165 socket.cpp:184] Accepted the connection from client [dlgpu19.ai.bjcc.qihoo.net]
I0323 16:52:07.600042 26145 parallel.cpp:392] GPUs pairs 0:1, 2:3, 0:2, 0:4
I0323 16:52:07.676470 26145 parallel.cpp:234] GPU 4 does not have p2p access to GPU 0
I0323 16:52:07.710218 26292 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
16/03/23 16:52:07 INFO executor.Executor: Finished task 2.0 in stage 1.0 (TID 7). 918 bytes result sent to driver
I0323 16:52:08.245970 26298 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.270828 26299 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.277041 26304 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.279109 26310 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
16/03/23 16:52:08 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 10
16/03/23 16:52:08 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 10)
16/03/23 16:52:08 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
16/03/23 16:52:08 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1450.0 B, free 4.7 KB)
16/03/23 16:52:08 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 12 ms
16/03/23 16:52:08 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.2 KB, free 6.9 KB)
16/03/23 16:52:08 INFO spark.CacheManager: Partition rdd_5_0 not found, computing it
16/03/23 16:52:08 INFO spark.CacheManager: Partition rdd_0_0 not found, computing it
16/03/23 16:52:08 INFO caffe.LmdbRDD: Processing partition 0
16/03/23 16:52:10 INFO caffe.LmdbRDD: Completed partition 0
16/03/23 16:52:10 INFO storage.BlockManager: Found block rdd_0_0 locally
16/03/23 16:53:58 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
16/03/23 16:53:58 INFO storage.MemoryStore: MemoryStore cleared
16/03/23 16:53:58 INFO storage.BlockManager: BlockManager stopped
16/03/23 16:53:58 WARN executor.CoarseGrainedExecutorBackend: An unknown (dlgpu20.ai.bjcc.qihoo.net:45855) driver disconnected.
16/03/23 16:53:58 ERROR executor.CoarseGrainedExecutorBackend: Driver 10.142.118.172:45855 disassociated! Shutting down.
16/03/23 16:53:58 INFO util.ShutdownHookManager: Shutdown hook called
16/03/23 16:53:58 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

LogType:stdout
Log Upload Time:23-Mar-2016 16:53:58
LogLength:0
Log Contents:

Container: container_1458635993280_0010_01_000003 on dlgpu10.ai.bjcc.qihoo.net_54591

LogType:stderr
Log Upload Time:23-Mar-2016 16:53:58
LogLength:31209
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/dev/hadoop/nm-local-dir/usercache/hadoop/filecache/11/spark-assembly-1.6.0-hadoop2.6.4-U4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/software/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/03/23 16:51:12 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/03/23 16:51:13 INFO yarn.YarnSparkHadoopUtil: Set ugi in YarnConf to: hadoop,hadoop
16/03/23 16:51:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/23 16:51:13 INFO yarn.YarnSparkHadoopUtil: running as user: hadoop
16/03/23 16:51:13 INFO spark.SecurityManager: Changing view acls to: hadoop
16/03/23 16:51:13 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/03/23 16:51:13 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/03/23 16:51:14 INFO spark.SecurityManager: Changing view acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/03/23 16:51:14 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/03/23 16:51:14 INFO Remoting: Starting remoting
16/03/23 16:51:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:39671]
16/03/23 16:51:14 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 39671.
16/03/23 16:51:14 INFO storage.DiskBlockManager: Created local directory at /home/hadoop/yarn/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-2691b3ef-44ad-456c-8e25-6ca9a207b8d8
16/03/23 16:51:14 INFO storage.DiskBlockManager: Created local directory at /dev/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-8983bf2d-3fed-4721-8ca6-f9ff555b5b7b
16/03/23 16:51:14 INFO storage.DiskBlockManager: Created local directory at /run/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-88fabbbb-2ab7-4079-bfe2-8f7ff6bdb746
16/03/23 16:51:14 INFO storage.MemoryStore: MemoryStore started with capacity 24.9 GB
16/03/23 16:51:14 INFO yarn.YarnSparkHadoopUtil: Set ugi in YarnConf to: hadoop,hadoop
16/03/23 16:51:15 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://[email protected]:45855
16/03/23 16:51:15 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
16/03/23 16:51:15 INFO executor.Executor: Starting executor ID 2 on host dlgpu10.ai.bjcc.qihoo.net
16/03/23 16:51:15 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33341.
16/03/23 16:51:15 INFO netty.NettyBlockTransferService: Server created on 33341
16/03/23 16:51:15 INFO storage.BlockManager: external shuffle service port = 7337
16/03/23 16:51:15 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/03/23 16:51:15 INFO storage.BlockManagerMaster: Registered BlockManager
16/03/23 16:51:15 INFO storage.BlockManager: Registering executor with local external shuffle service.
16/03/23 16:51:19 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 2
16/03/23 16:51:19 INFO executor.Executor: Running task 2.0 in stage 0.0 (TID 2)
16/03/23 16:51:19 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
16/03/23 16:51:20 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.1 KB, free 2.1 KB)
16/03/23 16:51:20 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 277 ms
16/03/23 16:51:20 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.2 KB, free 5.3 KB)
16/03/23 16:51:20 INFO caffe.CaffeProcessor: my rank is 2
16/03/23 16:51:20 INFO caffe.LMDB: Batch size:64
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0323 16:51:27.065438 26144 CaffeNet.cpp:78] set root solver device id to 0
I0323 16:51:27.236641 26144 solver.cpp:48] Initializing solver from parameters:
test_iter: 1
test_interval: 10001
base_lr: 0.01
display: 100
max_iter: 10000
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 10001
snapshot_prefix: "mnist_lenet"
solver_mode: GPU
device_id: 0
net_param {
name: "LeNet"
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_train_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
memory_data_param {
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_test_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
}
test_initialization: false
I0323 16:51:27.236922 26144 solver.cpp:86] Creating training net specified in net_param.
I0323 16:51:27.237049 26144 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer data
I0323 16:51:27.237072 26144 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0323 16:51:27.237207 26144 net.cpp:49] Initializing net from parameters:
name: "LeNet"
state {
phase: TRAIN
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_train_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0323 16:51:27.237365 26144 layer_factory.hpp:77] Creating layer data
I0323 16:51:27.237407 26144 net.cpp:106] Creating Layer data
I0323 16:51:27.237423 26144 net.cpp:411] data -> data
I0323 16:51:27.237469 26144 net.cpp:411] data -> label
I0323 16:51:27.243285 26144 net.cpp:150] Setting up data
I0323 16:51:27.243335 26144 net.cpp:157] Top shape: 64 1 28 28 (50176)
I0323 16:51:27.243348 26144 net.cpp:157] Top shape: 64 (64)
I0323 16:51:27.243355 26144 net.cpp:165] Memory required for data: 200960
I0323 16:51:27.243369 26144 layer_factory.hpp:77] Creating layer conv1
I0323 16:51:27.243397 26144 net.cpp:106] Creating Layer conv1
I0323 16:51:27.243408 26144 net.cpp:454] conv1 <- data
I0323 16:51:27.243434 26144 net.cpp:411] conv1 -> conv1
I0323 16:51:27.248071 26144 net.cpp:150] Setting up conv1
I0323 16:51:27.248101 26144 net.cpp:157] Top shape: 64 20 24 24 (737280)
I0323 16:51:27.248111 26144 net.cpp:165] Memory required for data: 3150080
I0323 16:51:27.248141 26144 layer_factory.hpp:77] Creating layer pool1
I0323 16:51:27.248162 26144 net.cpp:106] Creating Layer pool1
I0323 16:51:27.248170 26144 net.cpp:454] pool1 <- conv1
I0323 16:51:27.248180 26144 net.cpp:411] pool1 -> pool1
I0323 16:51:27.248386 26144 net.cpp:150] Setting up pool1
I0323 16:51:27.248400 26144 net.cpp:157] Top shape: 64 20 12 12 (184320)
I0323 16:51:27.248407 26144 net.cpp:165] Memory required for data: 3887360
I0323 16:51:27.248414 26144 layer_factory.hpp:77] Creating layer conv2
I0323 16:51:27.248431 26144 net.cpp:106] Creating Layer conv2
I0323 16:51:27.248440 26144 net.cpp:454] conv2 <- pool1
I0323 16:51:27.248452 26144 net.cpp:411] conv2 -> conv2
I0323 16:51:27.250042 26144 net.cpp:150] Setting up conv2
I0323 16:51:27.250069 26144 net.cpp:157] Top shape: 64 50 8 8 (204800)
I0323 16:51:27.250077 26144 net.cpp:165] Memory required for data: 4706560
I0323 16:51:27.250092 26144 layer_factory.hpp:77] Creating layer pool2
I0323 16:51:27.250104 26144 net.cpp:106] Creating Layer pool2
I0323 16:51:27.250113 26144 net.cpp:454] pool2 <- conv2
I0323 16:51:27.250121 26144 net.cpp:411] pool2 -> pool2
I0323 16:51:27.250284 26144 net.cpp:150] Setting up pool2
I0323 16:51:27.250305 26144 net.cpp:157] Top shape: 64 50 4 4 (51200)
I0323 16:51:27.250313 26144 net.cpp:165] Memory required for data: 4911360
I0323 16:51:27.250319 26144 layer_factory.hpp:77] Creating layer ip1
I0323 16:51:27.250339 26144 net.cpp:106] Creating Layer ip1
I0323 16:51:27.250345 26144 net.cpp:454] ip1 <- pool2
I0323 16:51:27.250355 26144 net.cpp:411] ip1 -> ip1
I0323 16:51:27.258404 26144 net.cpp:150] Setting up ip1
I0323 16:51:27.258431 26144 net.cpp:157] Top shape: 64 500 (32000)
I0323 16:51:27.258440 26144 net.cpp:165] Memory required for data: 5039360
I0323 16:51:27.258455 26144 layer_factory.hpp:77] Creating layer relu1
I0323 16:51:27.258473 26144 net.cpp:106] Creating Layer relu1
I0323 16:51:27.258481 26144 net.cpp:454] relu1 <- ip1
I0323 16:51:27.258491 26144 net.cpp:397] relu1 -> ip1 (in-place)
I0323 16:51:27.258507 26144 net.cpp:150] Setting up relu1
I0323 16:51:27.258515 26144 net.cpp:157] Top shape: 64 500 (32000)
I0323 16:51:27.258522 26144 net.cpp:165] Memory required for data: 5167360
I0323 16:51:27.258527 26144 layer_factory.hpp:77] Creating layer ip2
I0323 16:51:27.258538 26144 net.cpp:106] Creating Layer ip2
I0323 16:51:27.258544 26144 net.cpp:454] ip2 <- ip1
I0323 16:51:27.258558 26144 net.cpp:411] ip2 -> ip2
I0323 16:51:27.260877 26144 net.cpp:150] Setting up ip2
I0323 16:51:27.260903 26144 net.cpp:157] Top shape: 64 10 (640)
I0323 16:51:27.260911 26144 net.cpp:165] Memory required for data: 5169920
I0323 16:51:27.260923 26144 layer_factory.hpp:77] Creating layer loss
I0323 16:51:27.260956 26144 net.cpp:106] Creating Layer loss
I0323 16:51:27.260963 26144 net.cpp:454] loss <- ip2
I0323 16:51:27.260972 26144 net.cpp:454] loss <- label
I0323 16:51:27.260984 26144 net.cpp:411] loss -> loss
I0323 16:51:27.261020 26144 layer_factory.hpp:77] Creating layer loss
I0323 16:51:27.261478 26144 net.cpp:150] Setting up loss
I0323 16:51:27.261492 26144 net.cpp:157] Top shape: (1)
I0323 16:51:27.261499 26144 net.cpp:160] with loss weight 1
I0323 16:51:27.261528 26144 net.cpp:165] Memory required for data: 5169924
I0323 16:51:27.261534 26144 net.cpp:226] loss needs backward computation.
I0323 16:51:27.261543 26144 net.cpp:226] ip2 needs backward computation.
I0323 16:51:27.261548 26144 net.cpp:226] relu1 needs backward computation.
I0323 16:51:27.261554 26144 net.cpp:226] ip1 needs backward computation.
I0323 16:51:27.261559 26144 net.cpp:226] pool2 needs backward computation.
I0323 16:51:27.261565 26144 net.cpp:226] conv2 needs backward computation.
I0323 16:51:27.261571 26144 net.cpp:226] pool1 needs backward computation.
I0323 16:51:27.261577 26144 net.cpp:226] conv1 needs backward computation.
I0323 16:51:27.261584 26144 net.cpp:228] data does not need backward computation.
I0323 16:51:27.261590 26144 net.cpp:270] This network produces output loss
I0323 16:51:27.261605 26144 net.cpp:283] Network initialization done.
I0323 16:51:27.261695 26144 solver.cpp:181] Creating test net (#0) specified by net_param
I0323 16:51:27.261728 26144 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer data
I0323 16:51:27.261884 26144 net.cpp:49] Initializing net from parameters:
name: "LeNet"
state {
phase: TEST
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
memory_data_param {
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_test_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0323 16:51:27.262022 26144 layer_factory.hpp:77] Creating layer data
I0323 16:51:27.262045 26144 net.cpp:106] Creating Layer data
I0323 16:51:27.262053 26144 net.cpp:411] data -> data
I0323 16:51:27.262068 26144 net.cpp:411] data -> label
I0323 16:51:27.264725 26144 net.cpp:150] Setting up data
I0323 16:51:27.264752 26144 net.cpp:157] Top shape: 100 1 28 28 (78400)
I0323 16:51:27.264765 26144 net.cpp:157] Top shape: 100 (100)
I0323 16:51:27.264772 26144 net.cpp:165] Memory required for data: 314000
I0323 16:51:27.264780 26144 layer_factory.hpp:77] Creating layer label_data_1_split
I0323 16:51:27.264796 26144 net.cpp:106] Creating Layer label_data_1_split
I0323 16:51:27.264802 26144 net.cpp:454] label_data_1_split <- label
I0323 16:51:27.264811 26144 net.cpp:411] label_data_1_split -> label_data_1_split_0
I0323 16:51:27.264823 26144 net.cpp:411] label_data_1_split -> label_data_1_split_1
I0323 16:51:27.264999 26144 net.cpp:150] Setting up label_data_1_split
I0323 16:51:27.265017 26144 net.cpp:157] Top shape: 100 (100)
I0323 16:51:27.265025 26144 net.cpp:157] Top shape: 100 (100)
I0323 16:51:27.265032 26144 net.cpp:165] Memory required for data: 314800
I0323 16:51:27.265038 26144 layer_factory.hpp:77] Creating layer conv1
I0323 16:51:27.265058 26144 net.cpp:106] Creating Layer conv1
I0323 16:51:27.265064 26144 net.cpp:454] conv1 <- data
I0323 16:51:27.265079 26144 net.cpp:411] conv1 -> conv1
I0323 16:51:27.266289 26144 net.cpp:150] Setting up conv1
I0323 16:51:27.266309 26144 net.cpp:157] Top shape: 100 20 24 24 (1152000)
I0323 16:51:27.266315 26144 net.cpp:165] Memory required for data: 4922800
I0323 16:51:27.266330 26144 layer_factory.hpp:77] Creating layer pool1
I0323 16:51:27.266345 26144 net.cpp:106] Creating Layer pool1
I0323 16:51:27.266351 26144 net.cpp:454] pool1 <- conv1
I0323 16:51:27.266362 26144 net.cpp:411] pool1 -> pool1
I0323 16:51:27.266513 26144 net.cpp:150] Setting up pool1
I0323 16:51:27.266525 26144 net.cpp:157] Top shape: 100 20 12 12 (288000)
I0323 16:51:27.266530 26144 net.cpp:165] Memory required for data: 6074800
I0323 16:51:27.266535 26144 layer_factory.hpp:77] Creating layer conv2
I0323 16:51:27.266552 26144 net.cpp:106] Creating Layer conv2
I0323 16:51:27.266558 26144 net.cpp:454] conv2 <- pool1
I0323 16:51:27.266571 26144 net.cpp:411] conv2 -> conv2
I0323 16:51:27.268051 26144 net.cpp:150] Setting up conv2
I0323 16:51:27.268070 26144 net.cpp:157] Top shape: 100 50 8 8 (320000)
I0323 16:51:27.268076 26144 net.cpp:165] Memory required for data: 7354800
I0323 16:51:27.268093 26144 layer_factory.hpp:77] Creating layer pool2
I0323 16:51:27.268105 26144 net.cpp:106] Creating Layer pool2
I0323 16:51:27.268110 26144 net.cpp:454] pool2 <- conv2
I0323 16:51:27.268121 26144 net.cpp:411] pool2 -> pool2
I0323 16:51:27.268266 26144 net.cpp:150] Setting up pool2
I0323 16:51:27.268278 26144 net.cpp:157] Top shape: 100 50 4 4 (80000)
I0323 16:51:27.268291 26144 net.cpp:165] Memory required for data: 7674800
I0323 16:51:27.268297 26144 layer_factory.hpp:77] Creating layer ip1
I0323 16:51:27.268311 26144 net.cpp:106] Creating Layer ip1
I0323 16:51:27.268317 26144 net.cpp:454] ip1 <- pool2
I0323 16:51:27.268328 26144 net.cpp:411] ip1 -> ip1
I0323 16:51:27.276182 26144 net.cpp:150] Setting up ip1
I0323 16:51:27.276206 26144 net.cpp:157] Top shape: 100 500 (50000)
I0323 16:51:27.276213 26144 net.cpp:165] Memory required for data: 7874800
I0323 16:51:27.276228 26144 layer_factory.hpp:77] Creating layer relu1
I0323 16:51:27.276242 26144 net.cpp:106] Creating Layer relu1
I0323 16:51:27.276249 26144 net.cpp:454] relu1 <- ip1
I0323 16:51:27.276257 26144 net.cpp:397] relu1 -> ip1 (in-place)
I0323 16:51:27.276268 26144 net.cpp:150] Setting up relu1
I0323 16:51:27.276275 26144 net.cpp:157] Top shape: 100 500 (50000)
I0323 16:51:27.276280 26144 net.cpp:165] Memory required for data: 8074800
I0323 16:51:27.276285 26144 layer_factory.hpp:77] Creating layer ip2
I0323 16:51:27.276299 26144 net.cpp:106] Creating Layer ip2
I0323 16:51:27.276304 26144 net.cpp:454] ip2 <- ip1
I0323 16:51:27.276312 26144 net.cpp:411] ip2 -> ip2
I0323 16:51:27.276834 26144 net.cpp:150] Setting up ip2
I0323 16:51:27.276847 26144 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:27.276854 26144 net.cpp:165] Memory required for data: 8078800
I0323 16:51:27.276862 26144 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
I0323 16:51:27.276870 26144 net.cpp:106] Creating Layer ip2_ip2_0_split
I0323 16:51:27.276876 26144 net.cpp:454] ip2_ip2_0_split <- ip2
I0323 16:51:27.276892 26144 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_0
I0323 16:51:27.276902 26144 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_1
I0323 16:51:27.277052 26144 net.cpp:150] Setting up ip2_ip2_0_split
I0323 16:51:27.277067 26144 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:27.277075 26144 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:27.277079 26144 net.cpp:165] Memory required for data: 8086800
I0323 16:51:27.277084 26144 layer_factory.hpp:77] Creating layer accuracy
I0323 16:51:27.277096 26144 net.cpp:106] Creating Layer accuracy
I0323 16:51:27.277102 26144 net.cpp:454] accuracy <- ip2_ip2_0_split_0
I0323 16:51:27.277109 26144 net.cpp:454] accuracy <- label_data_1_split_0
I0323 16:51:27.277120 26144 net.cpp:411] accuracy -> accuracy
I0323 16:51:27.277137 26144 net.cpp:150] Setting up accuracy
I0323 16:51:27.277143 26144 net.cpp:157] Top shape: (1)
I0323 16:51:27.277149 26144 net.cpp:165] Memory required for data: 8086804
I0323 16:51:27.277153 26144 layer_factory.hpp:77] Creating layer loss
I0323 16:51:27.277161 26144 net.cpp:106] Creating Layer loss
I0323 16:51:27.277166 26144 net.cpp:454] loss <- ip2_ip2_0_split_1
I0323 16:51:27.277173 26144 net.cpp:454] loss <- label_data_1_split_1
I0323 16:51:27.277184 26144 net.cpp:411] loss -> loss
I0323 16:51:27.277195 26144 layer_factory.hpp:77] Creating layer loss
I0323 16:51:27.277585 26144 net.cpp:150] Setting up loss
I0323 16:51:27.277595 26144 net.cpp:157] Top shape: (1)
I0323 16:51:27.277601 26144 net.cpp:160] with loss weight 1
I0323 16:51:27.277611 26144 net.cpp:165] Memory required for data: 8086808
I0323 16:51:27.277616 26144 net.cpp:226] loss needs backward computation.
I0323 16:51:27.277622 26144 net.cpp:228] accuracy does not need backward computation.
I0323 16:51:27.277629 26144 net.cpp:226] ip2_ip2_0_split needs backward computation.
I0323 16:51:27.277634 26144 net.cpp:226] ip2 needs backward computation.
I0323 16:51:27.277639 26144 net.cpp:226] relu1 needs backward computation.
I0323 16:51:27.277644 26144 net.cpp:226] ip1 needs backward computation.
I0323 16:51:27.277649 26144 net.cpp:226] pool2 needs backward computation.
I0323 16:51:27.277654 26144 net.cpp:226] conv2 needs backward computation.
I0323 16:51:27.277660 26144 net.cpp:226] pool1 needs backward computation.
I0323 16:51:27.277665 26144 net.cpp:226] conv1 needs backward computation.
I0323 16:51:27.277672 26144 net.cpp:228] label_data_1_split does not need backward computation.
I0323 16:51:27.277688 26144 net.cpp:228] data does not need backward computation.
I0323 16:51:27.277693 26144 net.cpp:270] This network produces output accuracy
I0323 16:51:27.277699 26144 net.cpp:270] This network produces output loss
I0323 16:51:27.277719 26144 net.cpp:283] Network initialization done.
I0323 16:51:27.277781 26144 solver.cpp:60] Solver scaffolding done.
I0323 16:51:27.279260 26144 socket.cpp:219] Waiting for valid port [0]
I0323 16:51:27.279287 26164 socket.cpp:158] Assigned socket server port [58067]
I0323 16:51:27.281308 26164 socket.cpp:171] Socket Server ready [0.0.0.0]
I0323 16:51:27.289355 26144 socket.cpp:219] Waiting for valid port [58067]
I0323 16:51:27.289378 26144 socket.cpp:227] Valid port found [58067]
I0323 16:51:27.289403 26144 CaffeNet.cpp:186] Socket adapter: dlgpu10.ai.bjcc.qihoo.net:58067
I0323 16:51:27.289767 26144 CaffeNet.cpp:325] 0-th Socket addr: dlgpu10.ai.bjcc.qihoo.net:58067
I0323 16:51:27.289790 26144 CaffeNet.cpp:325] 1-th Socket addr: dlgpu10.ai.bjcc.qihoo.net:58067
I0323 16:51:27.289804 26144 CaffeNet.cpp:325] 2-th Socket addr:
I0323 16:51:27.289815 26144 CaffeNet.cpp:325] 3-th Socket addr: dlgpu10.ai.bjcc.qihoo.net:58067
I0323 16:51:27.289824 26144 CaffeNet.cpp:325] 4-th Socket addr: dlgpu10.ai.bjcc.qihoo.net:58067
I0323 16:51:27.289845 26144 JniCaffeNet.cpp:110] 0-th local addr: dlgpu10.ai.bjcc.qihoo.net:58067
I0323 16:51:27.289860 26144 JniCaffeNet.cpp:110] 1-th local addr: dlgpu10.ai.bjcc.qihoo.net:58067
I0323 16:51:27.289870 26144 JniCaffeNet.cpp:110] 2-th local addr:
I0323 16:51:27.289878 26144 JniCaffeNet.cpp:110] 3-th local addr: dlgpu10.ai.bjcc.qihoo.net:58067
I0323 16:51:27.289887 26144 JniCaffeNet.cpp:110] 4-th local addr: dlgpu10.ai.bjcc.qihoo.net:58067
16/03/23 16:51:27 INFO executor.Executor: Finished task 2.0 in stage 0.0 (TID 2). 1069 bytes result sent to driver
16/03/23 16:51:27 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 6
16/03/23 16:51:27 INFO executor.Executor: Running task 1.0 in stage 1.0 (TID 6)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1622.0 B, free 6.9 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 15 ms
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 9.5 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 235.0 B, free 9.7 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 13 ms
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 12.8 KB)
I0323 16:51:27.553778 26144 common.cpp:61] 2-th string is NULL
I0323 16:51:27.553853 26144 socket.cpp:250] Trying to connect with ...[dlgpu19.ai.bjcc.qihoo.net:57402]
I0323 16:51:27.554154 26144 socket.cpp:309] Connected to server [dlgpu19.ai.bjcc.qihoo.net:57402] with client_fd [315]
I0323 16:51:37.554280 26144 socket.cpp:250] Trying to connect with ...[dlgpu19.ai.bjcc.qihoo.net:41630]
I0323 16:51:37.554639 26144 socket.cpp:309] Connected to server [dlgpu19.ai.bjcc.qihoo.net:41630] with client_fd [316]
I0323 16:51:37.559141 26164 socket.cpp:184] Accepted the connection from client [dlgpu19.ai.bjcc.qihoo.net]
I0323 16:51:37.561334 26164 socket.cpp:184] Accepted the connection from client [dlgpu19.ai.bjcc.qihoo.net]
I0323 16:51:47.553865 26164 socket.cpp:184] Accepted the connection from client [dlgpu10.ai.bjcc.qihoo.net]
I0323 16:51:47.554746 26144 socket.cpp:250] Trying to connect with ...[dlgpu20.ai.bjcc.qihoo.net:47155]
I0323 16:51:47.555070 26144 socket.cpp:309] Connected to server [dlgpu20.ai.bjcc.qihoo.net:47155] with client_fd [320]
I0323 16:51:47.558022 26164 socket.cpp:184] Accepted the connection from client [dlgpu20.ai.bjcc.qihoo.net]
I0323 16:51:57.555244 26144 socket.cpp:250] Trying to connect with ...[dlgpu10.ai.bjcc.qihoo.net:46110]
I0323 16:51:57.555477 26144 socket.cpp:309] Connected to server [dlgpu10.ai.bjcc.qihoo.net:46110] with client_fd [322]
I0323 16:52:07.601250 26144 parallel.cpp:392] GPUs pairs 0:1, 2:3, 0:2, 0:4
I0323 16:52:07.675889 26144 parallel.cpp:234] GPU 4 does not have p2p access to GPU 0
I0323 16:52:07.709733 26291 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
16/03/23 16:52:07 INFO executor.Executor: Finished task 1.0 in stage 1.0 (TID 6). 918 bytes result sent to driver
I0323 16:52:08.261286 26294 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.296454 26303 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.299464 26308 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.301334 26301 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
16/03/23 16:52:08 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 14
16/03/23 16:52:08 INFO executor.Executor: Running task 4.0 in stage 2.0 (TID 14)
16/03/23 16:52:08 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
16/03/23 16:52:08 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1450.0 B, free 4.7 KB)
16/03/23 16:52:08 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 14 ms
16/03/23 16:52:08 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.2 KB, free 6.9 KB)
16/03/23 16:52:08 INFO spark.CacheManager: Partition rdd_5_4 not found, computing it
16/03/23 16:52:08 INFO spark.CacheManager: Partition rdd_0_4 not found, computing it
16/03/23 16:52:08 INFO caffe.LmdbRDD: Processing partition 4
16/03/23 16:52:10 INFO caffe.LmdbRDD: Completed partition 4
16/03/23 16:52:10 INFO storage.BlockManager: Found block rdd_0_4 locally
16/03/23 16:53:58 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
16/03/23 16:53:58 INFO storage.MemoryStore: MemoryStore cleared
16/03/23 16:53:58 INFO storage.BlockManager: BlockManager stopped
16/03/23 16:53:58 WARN executor.CoarseGrainedExecutorBackend: An unknown (dlgpu20.ai.bjcc.qihoo.net:45855) driver disconnected.
16/03/23 16:53:58 ERROR executor.CoarseGrainedExecutorBackend: Driver 10.142.118.172:45855 disassociated! Shutting down.
16/03/23 16:53:58 INFO util.ShutdownHookManager: Shutdown hook called
16/03/23 16:53:58 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

LogType:stdout
Log Upload Time:23-Mar-2016 16:53:58
LogLength:0
Log Contents:

Container: container_1458635993280_0010_01_000007 on dlgpu19.ai.bjcc.qihoo.net_33935

LogType:stderr
Log Upload Time:23-Mar-2016 16:53:59
LogLength:32045
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/dev/hadoop/nm-local-dir/usercache/hadoop/filecache/11/spark-assembly-1.6.0-hadoop2.6.4-U4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/software/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/03/23 16:51:14 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/03/23 16:51:14 INFO yarn.YarnSparkHadoopUtil: Set ugi in YarnConf to: hadoop,hadoop
16/03/23 16:51:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/23 16:51:14 INFO yarn.YarnSparkHadoopUtil: running as user: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: Changing view acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/03/23 16:51:15 INFO spark.SecurityManager: Changing view acls to: hadoop
16/03/23 16:51:15 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/03/23 16:51:15 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/03/23 16:51:15 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/03/23 16:51:15 INFO Remoting: Starting remoting
16/03/23 16:51:16 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:42763]
16/03/23 16:51:16 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 42763.
16/03/23 16:51:16 INFO storage.DiskBlockManager: Created local directory at /home/hadoop/yarn/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-9024c2f5-2a3f-4a1a-99ab-76d15f01278b
16/03/23 16:51:16 INFO storage.DiskBlockManager: Created local directory at /dev/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-ac7b2147-3d49-41d9-9c23-7d95ee0cdd1c
16/03/23 16:51:16 INFO storage.DiskBlockManager: Created local directory at /run/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-2cbded0d-1350-44a2-9a31-f91a452b59f7
16/03/23 16:51:16 INFO storage.MemoryStore: MemoryStore started with capacity 24.9 GB
16/03/23 16:51:16 INFO yarn.YarnSparkHadoopUtil: Set ugi in YarnConf to: hadoop,hadoop
16/03/23 16:51:16 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://[email protected]:45855
16/03/23 16:51:16 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
16/03/23 16:51:16 INFO executor.Executor: Starting executor ID 5 on host dlgpu19.ai.bjcc.qihoo.net
16/03/23 16:51:16 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40461.
16/03/23 16:51:16 INFO netty.NettyBlockTransferService: Server created on 40461
16/03/23 16:51:16 INFO storage.BlockManager: external shuffle service port = 7337
16/03/23 16:51:16 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/03/23 16:51:16 INFO storage.BlockManagerMaster: Registered BlockManager
16/03/23 16:51:16 INFO storage.BlockManager: Registering executor with local external shuffle service.
16/03/23 16:51:19 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
16/03/23 16:51:19 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/03/23 16:51:19 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
16/03/23 16:51:20 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.1 KB, free 2.1 KB)
16/03/23 16:51:20 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 292 ms
16/03/23 16:51:20 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.2 KB, free 5.3 KB)
16/03/23 16:51:20 INFO caffe.CaffeProcessor: my rank is 0
16/03/23 16:51:20 INFO caffe.LMDB: Batch size:64
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0323 16:51:22.875790 1378 CaffeNet.cpp:78] set root solver device id to 0
I0323 16:51:23.038626 1378 solver.cpp:48] Initializing solver from parameters:
test_iter: 1
test_interval: 10001
base_lr: 0.01
display: 100
max_iter: 10000
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 10001
snapshot_prefix: "mnist_lenet"
solver_mode: GPU
device_id: 0
net_param {
name: "LeNet"
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_train_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
memory_data_param {
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_test_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
}
test_initialization: false
I0323 16:51:23.038815 1378 solver.cpp:86] Creating training net specified in net_param.
I0323 16:51:23.038894 1378 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer data
I0323 16:51:23.038911 1378 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0323 16:51:23.039005 1378 net.cpp:49] Initializing net from parameters:
name: "LeNet"
state {
phase: TRAIN
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_train_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0323 16:51:23.039119 1378 layer_factory.hpp:77] Creating layer data
I0323 16:51:23.039150 1378 net.cpp:106] Creating Layer data
I0323 16:51:23.039162 1378 net.cpp:411] data -> data
I0323 16:51:23.039197 1378 net.cpp:411] data -> label
I0323 16:51:23.042510 1378 net.cpp:150] Setting up data
I0323 16:51:23.042541 1378 net.cpp:157] Top shape: 64 1 28 28 (50176)
I0323 16:51:23.042549 1378 net.cpp:157] Top shape: 64 (64)
I0323 16:51:23.042553 1378 net.cpp:165] Memory required for data: 200960
I0323 16:51:23.042563 1378 layer_factory.hpp:77] Creating layer conv1
I0323 16:51:23.042587 1378 net.cpp:106] Creating Layer conv1
I0323 16:51:23.042595 1378 net.cpp:454] conv1 <- data
I0323 16:51:23.042613 1378 net.cpp:411] conv1 -> conv1
I0323 16:51:23.045305 1378 net.cpp:150] Setting up conv1
I0323 16:51:23.045325 1378 net.cpp:157] Top shape: 64 20 24 24 (737280)
I0323 16:51:23.045331 1378 net.cpp:165] Memory required for data: 3150080
I0323 16:51:23.045349 1378 layer_factory.hpp:77] Creating layer pool1
I0323 16:51:23.045363 1378 net.cpp:106] Creating Layer pool1
I0323 16:51:23.045368 1378 net.cpp:454] pool1 <- conv1
I0323 16:51:23.045374 1378 net.cpp:411] pool1 -> pool1
I0323 16:51:23.045497 1378 net.cpp:150] Setting up pool1
I0323 16:51:23.045508 1378 net.cpp:157] Top shape: 64 20 12 12 (184320)
I0323 16:51:23.045512 1378 net.cpp:165] Memory required for data: 3887360
I0323 16:51:23.045516 1378 layer_factory.hpp:77] Creating layer conv2
I0323 16:51:23.045531 1378 net.cpp:106] Creating Layer conv2
I0323 16:51:23.045536 1378 net.cpp:454] conv2 <- pool1
I0323 16:51:23.045543 1378 net.cpp:411] conv2 -> conv2
I0323 16:51:23.046510 1378 net.cpp:150] Setting up conv2
I0323 16:51:23.046527 1378 net.cpp:157] Top shape: 64 50 8 8 (204800)
I0323 16:51:23.046532 1378 net.cpp:165] Memory required for data: 4706560
I0323 16:51:23.046542 1378 layer_factory.hpp:77] Creating layer pool2
I0323 16:51:23.046553 1378 net.cpp:106] Creating Layer pool2
I0323 16:51:23.046558 1378 net.cpp:454] pool2 <- conv2
I0323 16:51:23.046564 1378 net.cpp:411] pool2 -> pool2
I0323 16:51:23.046670 1378 net.cpp:150] Setting up pool2
I0323 16:51:23.046687 1378 net.cpp:157] Top shape: 64 50 4 4 (51200)
I0323 16:51:23.046692 1378 net.cpp:165] Memory required for data: 4911360
I0323 16:51:23.046696 1378 layer_factory.hpp:77] Creating layer ip1
I0323 16:51:23.046708 1378 net.cpp:106] Creating Layer ip1
I0323 16:51:23.046713 1378 net.cpp:454] ip1 <- pool2
I0323 16:51:23.046722 1378 net.cpp:411] ip1 -> ip1
I0323 16:51:23.051942 1378 net.cpp:150] Setting up ip1
I0323 16:51:23.051961 1378 net.cpp:157] Top shape: 64 500 (32000)
I0323 16:51:23.051966 1378 net.cpp:165] Memory required for data: 5039360
I0323 16:51:23.051977 1378 layer_factory.hpp:77] Creating layer relu1
I0323 16:51:23.051988 1378 net.cpp:106] Creating Layer relu1
I0323 16:51:23.051995 1378 net.cpp:454] relu1 <- ip1
I0323 16:51:23.052000 1378 net.cpp:397] relu1 -> ip1 (in-place)
I0323 16:51:23.052012 1378 net.cpp:150] Setting up relu1
I0323 16:51:23.052018 1378 net.cpp:157] Top shape: 64 500 (32000)
I0323 16:51:23.052022 1378 net.cpp:165] Memory required for data: 5167360
I0323 16:51:23.052026 1378 layer_factory.hpp:77] Creating layer ip2
I0323 16:51:23.052037 1378 net.cpp:106] Creating Layer ip2
I0323 16:51:23.052042 1378 net.cpp:454] ip2 <- ip1
I0323 16:51:23.052050 1378 net.cpp:411] ip2 -> ip2
I0323 16:51:23.053828 1378 net.cpp:150] Setting up ip2
I0323 16:51:23.053848 1378 net.cpp:157] Top shape: 64 10 (640)
I0323 16:51:23.053853 1378 net.cpp:165] Memory required for data: 5169920
I0323 16:51:23.053860 1378 layer_factory.hpp:77] Creating layer loss
I0323 16:51:23.053875 1378 net.cpp:106] Creating Layer loss
I0323 16:51:23.053881 1378 net.cpp:454] loss <- ip2
I0323 16:51:23.053886 1378 net.cpp:454] loss <- label
I0323 16:51:23.053896 1378 net.cpp:411] loss -> loss
I0323 16:51:23.053915 1378 layer_factory.hpp:77] Creating layer loss
I0323 16:51:23.054229 1378 net.cpp:150] Setting up loss
I0323 16:51:23.054244 1378 net.cpp:157] Top shape: (1)
I0323 16:51:23.054250 1378 net.cpp:160] with loss weight 1
I0323 16:51:23.054270 1378 net.cpp:165] Memory required for data: 5169924
I0323 16:51:23.054275 1378 net.cpp:226] loss needs backward computation.
I0323 16:51:23.054280 1378 net.cpp:226] ip2 needs backward computation.
I0323 16:51:23.054283 1378 net.cpp:226] relu1 needs backward computation.
I0323 16:51:23.054287 1378 net.cpp:226] ip1 needs backward computation.
I0323 16:51:23.054291 1378 net.cpp:226] pool2 needs backward computation.
I0323 16:51:23.054296 1378 net.cpp:226] conv2 needs backward computation.
I0323 16:51:23.054299 1378 net.cpp:226] pool1 needs backward computation.
I0323 16:51:23.054303 1378 net.cpp:226] conv1 needs backward computation.
I0323 16:51:23.054307 1378 net.cpp:228] data does not need backward computation.
I0323 16:51:23.054311 1378 net.cpp:270] This network produces output loss
I0323 16:51:23.054325 1378 net.cpp:283] Network initialization done.
I0323 16:51:23.054383 1378 solver.cpp:181] Creating test net (#0) specified by net_param
I0323 16:51:23.054406 1378 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer data
I0323 16:51:23.054507 1378 net.cpp:49] Initializing net from parameters:
name: "LeNet"
state {
phase: TEST
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
memory_data_param {
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_test_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0323 16:51:23.054594 1378 layer_factory.hpp:77] Creating layer data
I0323 16:51:23.054605 1378 net.cpp:106] Creating Layer data
I0323 16:51:23.054611 1378 net.cpp:411] data -> data
I0323 16:51:23.054621 1378 net.cpp:411] data -> label
I0323 16:51:23.056326 1378 net.cpp:150] Setting up data
I0323 16:51:23.056345 1378 net.cpp:157] Top shape: 100 1 28 28 (78400)
I0323 16:51:23.056352 1378 net.cpp:157] Top shape: 100 (100)
I0323 16:51:23.056356 1378 net.cpp:165] Memory required for data: 314000
I0323 16:51:23.056361 1378 layer_factory.hpp:77] Creating layer label_data_1_split
I0323 16:51:23.056372 1378 net.cpp:106] Creating Layer label_data_1_split
I0323 16:51:23.056377 1378 net.cpp:454] label_data_1_split <- label
I0323 16:51:23.056386 1378 net.cpp:411] label_data_1_split -> label_data_1_split_0
I0323 16:51:23.056396 1378 net.cpp:411] label_data_1_split -> label_data_1_split_1
I0323 16:51:23.056507 1378 net.cpp:150] Setting up label_data_1_split
I0323 16:51:23.056517 1378 net.cpp:157] Top shape: 100 (100)
I0323 16:51:23.056522 1378 net.cpp:157] Top shape: 100 (100)
I0323 16:51:23.056526 1378 net.cpp:165] Memory required for data: 314800
I0323 16:51:23.056530 1378 layer_factory.hpp:77] Creating layer conv1
I0323 16:51:23.056543 1378 net.cpp:106] Creating Layer conv1
I0323 16:51:23.056548 1378 net.cpp:454] conv1 <- data
I0323 16:51:23.056556 1378 net.cpp:411] conv1 -> conv1
I0323 16:51:23.057343 1378 net.cpp:150] Setting up conv1
I0323 16:51:23.057359 1378 net.cpp:157] Top shape: 100 20 24 24 (1152000)
I0323 16:51:23.057364 1378 net.cpp:165] Memory required for data: 4922800
I0323 16:51:23.057375 1378 layer_factory.hpp:77] Creating layer pool1
I0323 16:51:23.057385 1378 net.cpp:106] Creating Layer pool1
I0323 16:51:23.057389 1378 net.cpp:454] pool1 <- conv1
I0323 16:51:23.057396 1378 net.cpp:411] pool1 -> pool1
I0323 16:51:23.057507 1378 net.cpp:150] Setting up pool1
I0323 16:51:23.057517 1378 net.cpp:157] Top shape: 100 20 12 12 (288000)
I0323 16:51:23.057521 1378 net.cpp:165] Memory required for data: 6074800
I0323 16:51:23.057525 1378 layer_factory.hpp:77] Creating layer conv2
I0323 16:51:23.057538 1378 net.cpp:106] Creating Layer conv2
I0323 16:51:23.057543 1378 net.cpp:454] conv2 <- pool1
I0323 16:51:23.057549 1378 net.cpp:411] conv2 -> conv2
I0323 16:51:23.058568 1378 net.cpp:150] Setting up conv2
I0323 16:51:23.058583 1378 net.cpp:157] Top shape: 100 50 8 8 (320000)
I0323 16:51:23.058588 1378 net.cpp:165] Memory required for data: 7354800
I0323 16:51:23.058601 1378 layer_factory.hpp:77] Creating layer pool2
I0323 16:51:23.058609 1378 net.cpp:106] Creating Layer pool2
I0323 16:51:23.058614 1378 net.cpp:454] pool2 <- conv2
I0323 16:51:23.058619 1378 net.cpp:411] pool2 -> pool2
I0323 16:51:23.058724 1378 net.cpp:150] Setting up pool2
I0323 16:51:23.058733 1378 net.cpp:157] Top shape: 100 50 4 4 (80000)
I0323 16:51:23.058743 1378 net.cpp:165] Memory required for data: 7674800
I0323 16:51:23.058748 1378 layer_factory.hpp:77] Creating layer ip1
I0323 16:51:23.058759 1378 net.cpp:106] Creating Layer ip1
I0323 16:51:23.058764 1378 net.cpp:454] ip1 <- pool2
I0323 16:51:23.058770 1378 net.cpp:411] ip1 -> ip1
I0323 16:51:23.064004 1378 net.cpp:150] Setting up ip1
I0323 16:51:23.064024 1378 net.cpp:157] Top shape: 100 500 (50000)
I0323 16:51:23.064029 1378 net.cpp:165] Memory required for data: 7874800
I0323 16:51:23.064040 1378 layer_factory.hpp:77] Creating layer relu1
I0323 16:51:23.064051 1378 net.cpp:106] Creating Layer relu1
I0323 16:51:23.064056 1378 net.cpp:454] relu1 <- ip1
I0323 16:51:23.064062 1378 net.cpp:397] relu1 -> ip1 (in-place)
I0323 16:51:23.064070 1378 net.cpp:150] Setting up relu1
I0323 16:51:23.064076 1378 net.cpp:157] Top shape: 100 500 (50000)
I0323 16:51:23.064079 1378 net.cpp:165] Memory required for data: 8074800
I0323 16:51:23.064084 1378 layer_factory.hpp:77] Creating layer ip2
I0323 16:51:23.064092 1378 net.cpp:106] Creating Layer ip2
I0323 16:51:23.064096 1378 net.cpp:454] ip2 <- ip1
I0323 16:51:23.064106 1378 net.cpp:411] ip2 -> ip2
I0323 16:51:23.064492 1378 net.cpp:150] Setting up ip2
I0323 16:51:23.064503 1378 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:23.064508 1378 net.cpp:165] Memory required for data: 8078800
I0323 16:51:23.064515 1378 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
I0323 16:51:23.064525 1378 net.cpp:106] Creating Layer ip2_ip2_0_split
I0323 16:51:23.064530 1378 net.cpp:454] ip2_ip2_0_split <- ip2
I0323 16:51:23.064537 1378 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_0
I0323 16:51:23.064546 1378 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_1
I0323 16:51:23.064656 1378 net.cpp:150] Setting up ip2_ip2_0_split
I0323 16:51:23.064664 1378 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:23.064671 1378 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:23.064673 1378 net.cpp:165] Memory required for data: 8086800
I0323 16:51:23.064678 1378 layer_factory.hpp:77] Creating layer accuracy
I0323 16:51:23.064690 1378 net.cpp:106] Creating Layer accuracy
I0323 16:51:23.064697 1378 net.cpp:454] accuracy <- ip2_ip2_0_split_0
I0323 16:51:23.064702 1378 net.cpp:454] accuracy <- label_data_1_split_0
I0323 16:51:23.064707 1378 net.cpp:411] accuracy -> accuracy
I0323 16:51:23.064721 1378 net.cpp:150] Setting up accuracy
I0323 16:51:23.064728 1378 net.cpp:157] Top shape: (1)
I0323 16:51:23.064733 1378 net.cpp:165] Memory required for data: 8086804
I0323 16:51:23.064736 1378 layer_factory.hpp:77] Creating layer loss
I0323 16:51:23.064744 1378 net.cpp:106] Creating Layer loss
I0323 16:51:23.064749 1378 net.cpp:454] loss <- ip2_ip2_0_split_1
I0323 16:51:23.064754 1378 net.cpp:454] loss <- label_data_1_split_1
I0323 16:51:23.064759 1378 net.cpp:411] loss -> loss
I0323 16:51:23.064769 1378 layer_factory.hpp:77] Creating layer loss
I0323 16:51:23.065067 1378 net.cpp:150] Setting up loss
I0323 16:51:23.065081 1378 net.cpp:157] Top shape: (1)
I0323 16:51:23.065085 1378 net.cpp:160] with loss weight 1
I0323 16:51:23.065093 1378 net.cpp:165] Memory required for data: 8086808
I0323 16:51:23.065098 1378 net.cpp:226] loss needs backward computation.
I0323 16:51:23.065102 1378 net.cpp:228] accuracy does not need backward computation.
I0323 16:51:23.065107 1378 net.cpp:226] ip2_ip2_0_split needs backward computation.
I0323 16:51:23.065111 1378 net.cpp:226] ip2 needs backward computation.
I0323 16:51:23.065115 1378 net.cpp:226] relu1 needs backward computation.
I0323 16:51:23.065119 1378 net.cpp:226] ip1 needs backward computation.
I0323 16:51:23.065122 1378 net.cpp:226] pool2 needs backward computation.
I0323 16:51:23.065127 1378 net.cpp:226] conv2 needs backward computation.
I0323 16:51:23.065131 1378 net.cpp:226] pool1 needs backward computation.
I0323 16:51:23.065135 1378 net.cpp:226] conv1 needs backward computation.
I0323 16:51:23.065140 1378 net.cpp:228] label_data_1_split does not need backward computation.
I0323 16:51:23.065150 1378 net.cpp:228] data does not need backward computation.
I0323 16:51:23.065155 1378 net.cpp:270] This network produces output accuracy
I0323 16:51:23.065160 1378 net.cpp:270] This network produces output loss
I0323 16:51:23.065173 1378 net.cpp:283] Network initialization done.
I0323 16:51:23.065218 1378 solver.cpp:60] Solver scaffolding done.
I0323 16:51:23.066324 1378 socket.cpp:224] Waiting for valid port [0]
I0323 16:51:23.066359 1414 socket.cpp:163] Assigned socket server port [57402]
I0323 16:51:23.068367 1414 socket.cpp:176] Socket Server ready [0.0.0.0]
I0323 16:51:23.076409 1378 socket.cpp:224] Waiting for valid port [57402]
I0323 16:51:23.076423 1378 socket.cpp:232] Valid port found [57402]
I0323 16:51:23.076437 1378 CaffeNet.cpp:186] Socket adapter: dlgpu19.ai.bjcc.qihoo.net:57402
I0323 16:51:23.076686 1378 CaffeNet.cpp:325] 0-th Socket addr:
I0323 16:51:23.076701 1378 CaffeNet.cpp:325] 1-th Socket addr: dlgpu19.ai.bjcc.qihoo.net:57402
I0323 16:51:23.076707 1378 CaffeNet.cpp:325] 2-th Socket addr: dlgpu19.ai.bjcc.qihoo.net:57402
I0323 16:51:23.076711 1378 CaffeNet.cpp:325] 3-th Socket addr: dlgpu19.ai.bjcc.qihoo.net:57402
I0323 16:51:23.076715 1378 CaffeNet.cpp:325] 4-th Socket addr: dlgpu19.ai.bjcc.qihoo.net:57402
I0323 16:51:23.076730 1378 JniCaffeNet.cpp:110] 0-th local addr:
I0323 16:51:23.076736 1378 JniCaffeNet.cpp:110] 1-th local addr: dlgpu19.ai.bjcc.qihoo.net:57402
I0323 16:51:23.076740 1378 JniCaffeNet.cpp:110] 2-th local addr: dlgpu19.ai.bjcc.qihoo.net:57402
I0323 16:51:23.076745 1378 JniCaffeNet.cpp:110] 3-th local addr: dlgpu19.ai.bjcc.qihoo.net:57402
I0323 16:51:23.076748 1378 JniCaffeNet.cpp:110] 4-th local addr: dlgpu19.ai.bjcc.qihoo.net:57402
16/03/23 16:51:23 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1069 bytes result sent to driver
16/03/23 16:51:27 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 5
16/03/23 16:51:27 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 5)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1622.0 B, free 6.9 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 17 ms
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 9.5 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 235.0 B, free 9.7 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 13 ms
I0323 16:51:27.553076 1414 socket.cpp:189] Accepted the connection from client [dlgpu10.ai.bjcc.qihoo.net]
I0323 16:51:27.554393 1414 socket.cpp:189] Accepted the connection from client [dlgpu10.ai.bjcc.qihoo.net]
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 12.8 KB)
I0323 16:51:27.556990 1414 socket.cpp:189] Accepted the connection from client [dlgpu20.ai.bjcc.qihoo.net]
I0323 16:51:27.558328 1378 common.cpp:61] 0-th string is NULL
I0323 16:51:27.558416 1378 socket.cpp:255] Trying to connect with ...[dlgpu19.ai.bjcc.qihoo.net:41630]
I0323 16:51:27.558573 1378 socket.cpp:314] Connected to server [dlgpu19.ai.bjcc.qihoo.net:41630] with client_fd [318]
I0323 16:51:27.560863 1414 socket.cpp:189] Accepted the connection from client [dlgpu19.ai.bjcc.qihoo.net]
I0323 16:51:37.558701 1378 socket.cpp:255] Trying to connect with ...[dlgpu10.ai.bjcc.qihoo.net:58067]
I0323 16:51:37.559094 1378 socket.cpp:314] Connected to server [dlgpu10.ai.bjcc.qihoo.net:58067] with client_fd [320]
I0323 16:51:47.559253 1378 socket.cpp:255] Trying to connect with ...[dlgpu20.ai.bjcc.qihoo.net:47155]
I0323 16:51:47.559638 1378 socket.cpp:314] Connected to server [dlgpu20.ai.bjcc.qihoo.net:47155] with client_fd [321]
I0323 16:51:57.559800 1378 socket.cpp:255] Trying to connect with ...[dlgpu10.ai.bjcc.qihoo.net:46110]
I0323 16:51:57.560225 1378 socket.cpp:314] Connected to server [dlgpu10.ai.bjcc.qihoo.net:46110] with client_fd [322]
I0323 16:52:07.604482 1378 parallel.cpp:392] GPUs pairs 0:1, 2:3, 0:2, 0:4
I0323 16:52:07.675818 1378 parallel.cpp:234] GPU 4 does not have p2p access to GPU 0
I0323 16:52:07.740978 1756 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
16/03/23 16:52:08 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 5). 918 bytes result sent to driver
I0323 16:52:08.425240 1758 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.427775 1763 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.447262 1772 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.454452 1774 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
16/03/23 16:52:08 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 12
16/03/23 16:52:08 INFO executor.Executor: Running task 2.0 in stage 2.0 (TID 12)
16/03/23 16:52:08 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
16/03/23 16:52:08 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1450.0 B, free 4.7 KB)
16/03/23 16:52:08 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 18 ms
16/03/23 16:52:08 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.2 KB, free 6.9 KB)
16/03/23 16:52:08 INFO spark.CacheManager: Partition rdd_5_2 not found, computing it
16/03/23 16:52:08 INFO spark.CacheManager: Partition rdd_0_2 not found, computing it
16/03/23 16:52:08 INFO caffe.LmdbRDD: Processing partition 2
16/03/23 16:52:09 INFO caffe.LmdbRDD: Completed partition 2
16/03/23 16:52:09 INFO storage.BlockManager: Found block rdd_0_2 locally
I0323 16:52:10.079346 1424 socket.cpp:58] receive message: 3 0
I0323 16:52:10.079407 1424 socket.cpp:59] socket: 12 message header size: 12
I0323 16:52:10.159688 1378 socket.cpp:44] send essage: 0 0 344864
I0323 16:52:10.161568 1378 socket.cpp:44] send essage: 0 0 344864
I0323 16:52:10.162847 1427 socket.cpp:58] receive message: 1 0
I0323 16:52:10.162863 1427 socket.cpp:59] socket: 12 message header size: 12
I0323 16:52:10.169373 1378 socket.cpp:44] send essage: 0 0 344864
I0323 16:52:10.214555 1378 socket.cpp:44] send essage: 0 0 344864
I0323 16:52:10.938413 1423 socket.cpp:58] receive message: 2 0
I0323 16:52:10.938458 1423 socket.cpp:59] socket: 12 message header size: 12
I0323 16:52:11.040642 1422 socket.cpp:58] receive message: 4 0
I0323 16:52:11.040699 1422 socket.cpp:59] socket: 12 message header size: 12
16/03/23 16:53:58 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
16/03/23 16:53:58 INFO storage.MemoryStore: MemoryStore cleared
16/03/23 16:53:58 INFO storage.BlockManager: BlockManager stopped
16/03/23 16:53:58 WARN executor.CoarseGrainedExecutorBackend: An unknown (dlgpu20.ai.bjcc.qihoo.net:45855) driver disconnected.
16/03/23 16:53:58 ERROR executor.CoarseGrainedExecutorBackend: Driver 10.142.118.172:45855 disassociated! Shutting down.
16/03/23 16:53:58 INFO util.ShutdownHookManager: Shutdown hook called
16/03/23 16:53:58 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

LogType:stdout
Log Upload Time:23-Mar-2016 16:53:59
LogLength:0
Log Contents:

Container: container_1458635993280_0010_01_000004 on dlgpu19.ai.bjcc.qihoo.net_33935

LogType:stderr
Log Upload Time:23-Mar-2016 16:53:59
LogLength:32045
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/dev/hadoop/nm-local-dir/usercache/hadoop/filecache/11/spark-assembly-1.6.0-hadoop2.6.4-U4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/software/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/03/23 16:51:13 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/03/23 16:51:13 INFO yarn.YarnSparkHadoopUtil: Set ugi in YarnConf to: hadoop,hadoop
16/03/23 16:51:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/23 16:51:14 INFO yarn.YarnSparkHadoopUtil: running as user: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: Changing view acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/03/23 16:51:14 INFO spark.SecurityManager: Changing view acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/03/23 16:51:15 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/03/23 16:51:15 INFO Remoting: Starting remoting
16/03/23 16:51:15 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:35824]
16/03/23 16:51:15 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 35824.
16/03/23 16:51:15 INFO storage.DiskBlockManager: Created local directory at /home/hadoop/yarn/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-b5a31fe0-8b93-47bd-b631-81a11b0a4124
16/03/23 16:51:15 INFO storage.DiskBlockManager: Created local directory at /dev/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-159de318-affd-49f0-839f-46ea110a371b
16/03/23 16:51:15 INFO storage.DiskBlockManager: Created local directory at /run/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-1801ddca-5d6e-4b58-913e-cf72dd8cb4d8
16/03/23 16:51:15 INFO storage.MemoryStore: MemoryStore started with capacity 24.9 GB
16/03/23 16:51:15 INFO yarn.YarnSparkHadoopUtil: Set ugi in YarnConf to: hadoop,hadoop
16/03/23 16:51:15 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://[email protected]:45855
16/03/23 16:51:15 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
16/03/23 16:51:15 INFO executor.Executor: Starting executor ID 3 on host dlgpu19.ai.bjcc.qihoo.net
16/03/23 16:51:15 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41146.
16/03/23 16:51:15 INFO netty.NettyBlockTransferService: Server created on 41146
16/03/23 16:51:15 INFO storage.BlockManager: external shuffle service port = 7337
16/03/23 16:51:15 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/03/23 16:51:15 INFO storage.BlockManagerMaster: Registered BlockManager
16/03/23 16:51:15 INFO storage.BlockManager: Registering executor with local external shuffle service.
16/03/23 16:51:19 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1
16/03/23 16:51:19 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
16/03/23 16:51:19 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
16/03/23 16:51:20 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.1 KB, free 2.1 KB)
16/03/23 16:51:20 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 283 ms
16/03/23 16:51:20 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.2 KB, free 5.3 KB)
16/03/23 16:51:20 INFO caffe.CaffeProcessor: my rank is 1
16/03/23 16:51:20 INFO caffe.LMDB: Batch size:64
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0323 16:51:22.752178 1377 CaffeNet.cpp:78] set root solver device id to 0
I0323 16:51:22.980456 1377 solver.cpp:48] Initializing solver from parameters:
test_iter: 1
test_interval: 10001
base_lr: 0.01
display: 100
max_iter: 10000
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 10001
snapshot_prefix: "mnist_lenet"
solver_mode: GPU
device_id: 0
net_param {
name: "LeNet"
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_train_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
memory_data_param {
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_test_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
}
test_initialization: false
I0323 16:51:22.980608 1377 solver.cpp:86] Creating training net specified in net_param.
I0323 16:51:22.980680 1377 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer data
I0323 16:51:22.980692 1377 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0323 16:51:22.980774 1377 net.cpp:49] Initializing net from parameters:
name: "LeNet"
state {
phase: TRAIN
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_train_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0323 16:51:22.980877 1377 layer_factory.hpp:77] Creating layer data
I0323 16:51:22.980906 1377 net.cpp:106] Creating Layer data
I0323 16:51:22.980917 1377 net.cpp:411] data -> data
I0323 16:51:22.980954 1377 net.cpp:411] data -> label
I0323 16:51:22.984134 1377 net.cpp:150] Setting up data
I0323 16:51:22.984165 1377 net.cpp:157] Top shape: 64 1 28 28 (50176)
I0323 16:51:22.984174 1377 net.cpp:157] Top shape: 64 (64)
I0323 16:51:22.984177 1377 net.cpp:165] Memory required for data: 200960
I0323 16:51:22.984187 1377 layer_factory.hpp:77] Creating layer conv1
I0323 16:51:22.984207 1377 net.cpp:106] Creating Layer conv1
I0323 16:51:22.984215 1377 net.cpp:454] conv1 <- data
I0323 16:51:22.984228 1377 net.cpp:411] conv1 -> conv1
I0323 16:51:22.986796 1377 net.cpp:150] Setting up conv1
I0323 16:51:22.986815 1377 net.cpp:157] Top shape: 64 20 24 24 (737280)
I0323 16:51:22.986821 1377 net.cpp:165] Memory required for data: 3150080
I0323 16:51:22.986840 1377 layer_factory.hpp:77] Creating layer pool1
I0323 16:51:22.986851 1377 net.cpp:106] Creating Layer pool1
I0323 16:51:22.986856 1377 net.cpp:454] pool1 <- conv1
I0323 16:51:22.986863 1377 net.cpp:411] pool1 -> pool1
I0323 16:51:22.986991 1377 net.cpp:150] Setting up pool1
I0323 16:51:22.987004 1377 net.cpp:157] Top shape: 64 20 12 12 (184320)
I0323 16:51:22.987009 1377 net.cpp:165] Memory required for data: 3887360
I0323 16:51:22.987013 1377 layer_factory.hpp:77] Creating layer conv2
I0323 16:51:22.987023 1377 net.cpp:106] Creating Layer conv2
I0323 16:51:22.987027 1377 net.cpp:454] conv2 <- pool1
I0323 16:51:22.987035 1377 net.cpp:411] conv2 -> conv2
I0323 16:51:22.987975 1377 net.cpp:150] Setting up conv2
I0323 16:51:22.987990 1377 net.cpp:157] Top shape: 64 50 8 8 (204800)
I0323 16:51:22.987995 1377 net.cpp:165] Memory required for data: 4706560
I0323 16:51:22.988005 1377 layer_factory.hpp:77] Creating layer pool2
I0323 16:51:22.988013 1377 net.cpp:106] Creating Layer pool2
I0323 16:51:22.988018 1377 net.cpp:454] pool2 <- conv2
I0323 16:51:22.988024 1377 net.cpp:411] pool2 -> pool2
I0323 16:51:22.988122 1377 net.cpp:150] Setting up pool2
I0323 16:51:22.988136 1377 net.cpp:157] Top shape: 64 50 4 4 (51200)
I0323 16:51:22.988140 1377 net.cpp:165] Memory required for data: 4911360
I0323 16:51:22.988144 1377 layer_factory.hpp:77] Creating layer ip1
I0323 16:51:22.988157 1377 net.cpp:106] Creating Layer ip1
I0323 16:51:22.988160 1377 net.cpp:454] ip1 <- pool2
I0323 16:51:22.988168 1377 net.cpp:411] ip1 -> ip1
I0323 16:51:22.993342 1377 net.cpp:150] Setting up ip1
I0323 16:51:22.993361 1377 net.cpp:157] Top shape: 64 500 (32000)
I0323 16:51:22.993366 1377 net.cpp:165] Memory required for data: 5039360
I0323 16:51:22.993377 1377 layer_factory.hpp:77] Creating layer relu1
I0323 16:51:22.993387 1377 net.cpp:106] Creating Layer relu1
I0323 16:51:22.993393 1377 net.cpp:454] relu1 <- ip1
I0323 16:51:22.993399 1377 net.cpp:397] relu1 -> ip1 (in-place)
I0323 16:51:22.993410 1377 net.cpp:150] Setting up relu1
I0323 16:51:22.993415 1377 net.cpp:157] Top shape: 64 500 (32000)
I0323 16:51:22.993419 1377 net.cpp:165] Memory required for data: 5167360
I0323 16:51:22.993423 1377 layer_factory.hpp:77] Creating layer ip2
I0323 16:51:22.993430 1377 net.cpp:106] Creating Layer ip2
I0323 16:51:22.993434 1377 net.cpp:454] ip2 <- ip1
I0323 16:51:22.993441 1377 net.cpp:411] ip2 -> ip2
I0323 16:51:22.995192 1377 net.cpp:150] Setting up ip2
I0323 16:51:22.995210 1377 net.cpp:157] Top shape: 64 10 (640)
I0323 16:51:22.995216 1377 net.cpp:165] Memory required for data: 5169920
I0323 16:51:22.995225 1377 layer_factory.hpp:77] Creating layer loss
I0323 16:51:22.995237 1377 net.cpp:106] Creating Layer loss
I0323 16:51:22.995242 1377 net.cpp:454] loss <- ip2
I0323 16:51:22.995247 1377 net.cpp:454] loss <- label
I0323 16:51:22.995256 1377 net.cpp:411] loss -> loss
I0323 16:51:22.995275 1377 layer_factory.hpp:77] Creating layer loss
I0323 16:51:22.995570 1377 net.cpp:150] Setting up loss
I0323 16:51:22.995580 1377 net.cpp:157] Top shape: (1)
I0323 16:51:22.995584 1377 net.cpp:160] with loss weight 1
I0323 16:51:22.995605 1377 net.cpp:165] Memory required for data: 5169924
I0323 16:51:22.995610 1377 net.cpp:226] loss needs backward computation.
I0323 16:51:22.995615 1377 net.cpp:226] ip2 needs backward computation.
I0323 16:51:22.995620 1377 net.cpp:226] relu1 needs backward computation.
I0323 16:51:22.995623 1377 net.cpp:226] ip1 needs backward computation.
I0323 16:51:22.995626 1377 net.cpp:226] pool2 needs backward computation.
I0323 16:51:22.995630 1377 net.cpp:226] conv2 needs backward computation.
I0323 16:51:22.995635 1377 net.cpp:226] pool1 needs backward computation.
I0323 16:51:22.995638 1377 net.cpp:226] conv1 needs backward computation.
I0323 16:51:22.995643 1377 net.cpp:228] data does not need backward computation.
I0323 16:51:22.995646 1377 net.cpp:270] This network produces output loss
I0323 16:51:22.995657 1377 net.cpp:283] Network initialization done.
I0323 16:51:22.995712 1377 solver.cpp:181] Creating test net (#0) specified by net_param
I0323 16:51:22.995733 1377 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer data
I0323 16:51:22.995823 1377 net.cpp:49] Initializing net from parameters:
name: "LeNet"
state {
phase: TEST
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
memory_data_param {
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_test_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0323 16:51:22.995904 1377 layer_factory.hpp:77] Creating layer data
I0323 16:51:22.995915 1377 net.cpp:106] Creating Layer data
I0323 16:51:22.995921 1377 net.cpp:411] data -> data
I0323 16:51:22.995930 1377 net.cpp:411] data -> label
I0323 16:51:22.997612 1377 net.cpp:150] Setting up data
I0323 16:51:22.997632 1377 net.cpp:157] Top shape: 100 1 28 28 (78400)
I0323 16:51:22.997638 1377 net.cpp:157] Top shape: 100 (100)
I0323 16:51:22.997643 1377 net.cpp:165] Memory required for data: 314000
I0323 16:51:22.997648 1377 layer_factory.hpp:77] Creating layer label_data_1_split
I0323 16:51:22.997659 1377 net.cpp:106] Creating Layer label_data_1_split
I0323 16:51:22.997664 1377 net.cpp:454] label_data_1_split <- label
I0323 16:51:22.997670 1377 net.cpp:411] label_data_1_split -> label_data_1_split_0
I0323 16:51:22.997679 1377 net.cpp:411] label_data_1_split -> label_data_1_split_1
I0323 16:51:22.997789 1377 net.cpp:150] Setting up label_data_1_split
I0323 16:51:22.997797 1377 net.cpp:157] Top shape: 100 (100)
I0323 16:51:22.997802 1377 net.cpp:157] Top shape: 100 (100)
I0323 16:51:22.997807 1377 net.cpp:165] Memory required for data: 314800
I0323 16:51:22.997810 1377 layer_factory.hpp:77] Creating layer conv1
I0323 16:51:22.997822 1377 net.cpp:106] Creating Layer conv1
I0323 16:51:22.997827 1377 net.cpp:454] conv1 <- data
I0323 16:51:22.997833 1377 net.cpp:411] conv1 -> conv1
I0323 16:51:22.998605 1377 net.cpp:150] Setting up conv1
I0323 16:51:22.998620 1377 net.cpp:157] Top shape: 100 20 24 24 (1152000)
I0323 16:51:22.998625 1377 net.cpp:165] Memory required for data: 4922800
I0323 16:51:22.998636 1377 layer_factory.hpp:77] Creating layer pool1
I0323 16:51:22.998642 1377 net.cpp:106] Creating Layer pool1
I0323 16:51:22.998647 1377 net.cpp:454] pool1 <- conv1
I0323 16:51:22.998653 1377 net.cpp:411] pool1 -> pool1
I0323 16:51:22.998760 1377 net.cpp:150] Setting up pool1
I0323 16:51:22.998769 1377 net.cpp:157] Top shape: 100 20 12 12 (288000)
I0323 16:51:22.998772 1377 net.cpp:165] Memory required for data: 6074800
I0323 16:51:22.998777 1377 layer_factory.hpp:77] Creating layer conv2
I0323 16:51:22.998787 1377 net.cpp:106] Creating Layer conv2
I0323 16:51:22.998792 1377 net.cpp:454] conv2 <- pool1
I0323 16:51:22.998800 1377 net.cpp:411] conv2 -> conv2
I0323 16:51:22.999790 1377 net.cpp:150] Setting up conv2
I0323 16:51:22.999804 1377 net.cpp:157] Top shape: 100 50 8 8 (320000)
I0323 16:51:22.999809 1377 net.cpp:165] Memory required for data: 7354800
I0323 16:51:22.999819 1377 layer_factory.hpp:77] Creating layer pool2
I0323 16:51:22.999825 1377 net.cpp:106] Creating Layer pool2
I0323 16:51:22.999830 1377 net.cpp:454] pool2 <- conv2
I0323 16:51:22.999836 1377 net.cpp:411] pool2 -> pool2
I0323 16:51:22.999944 1377 net.cpp:150] Setting up pool2
I0323 16:51:22.999956 1377 net.cpp:157] Top shape: 100 50 4 4 (80000)
I0323 16:51:22.999968 1377 net.cpp:165] Memory required for data: 7674800
I0323 16:51:22.999972 1377 layer_factory.hpp:77] Creating layer ip1
I0323 16:51:22.999980 1377 net.cpp:106] Creating Layer ip1
I0323 16:51:22.999985 1377 net.cpp:454] ip1 <- pool2
I0323 16:51:22.999992 1377 net.cpp:411] ip1 -> ip1
I0323 16:51:23.005182 1377 net.cpp:150] Setting up ip1
I0323 16:51:23.005201 1377 net.cpp:157] Top shape: 100 500 (50000)
I0323 16:51:23.005208 1377 net.cpp:165] Memory required for data: 7874800
I0323 16:51:23.005218 1377 layer_factory.hpp:77] Creating layer relu1
I0323 16:51:23.005225 1377 net.cpp:106] Creating Layer relu1
I0323 16:51:23.005229 1377 net.cpp:454] relu1 <- ip1
I0323 16:51:23.005235 1377 net.cpp:397] relu1 -> ip1 (in-place)
I0323 16:51:23.005244 1377 net.cpp:150] Setting up relu1
I0323 16:51:23.005249 1377 net.cpp:157] Top shape: 100 500 (50000)
I0323 16:51:23.005252 1377 net.cpp:165] Memory required for data: 8074800
I0323 16:51:23.005256 1377 layer_factory.hpp:77] Creating layer ip2
I0323 16:51:23.005264 1377 net.cpp:106] Creating Layer ip2
I0323 16:51:23.005269 1377 net.cpp:454] ip2 <- ip1
I0323 16:51:23.005275 1377 net.cpp:411] ip2 -> ip2
I0323 16:51:23.005650 1377 net.cpp:150] Setting up ip2
I0323 16:51:23.005661 1377 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:23.005664 1377 net.cpp:165] Memory required for data: 8078800
I0323 16:51:23.005671 1377 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
I0323 16:51:23.005678 1377 net.cpp:106] Creating Layer ip2_ip2_0_split
I0323 16:51:23.005683 1377 net.cpp:454] ip2_ip2_0_split <- ip2
I0323 16:51:23.005689 1377 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_0
I0323 16:51:23.005697 1377 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_1
I0323 16:51:23.005798 1377 net.cpp:150] Setting up ip2_ip2_0_split
I0323 16:51:23.005806 1377 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:23.005811 1377 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:23.005815 1377 net.cpp:165] Memory required for data: 8086800
I0323 16:51:23.005820 1377 layer_factory.hpp:77] Creating layer accuracy
I0323 16:51:23.005830 1377 net.cpp:106] Creating Layer accuracy
I0323 16:51:23.005833 1377 net.cpp:454] accuracy <- ip2_ip2_0_split_0
I0323 16:51:23.005838 1377 net.cpp:454] accuracy <- label_data_1_split_0
I0323 16:51:23.005846 1377 net.cpp:411] accuracy -> accuracy
I0323 16:51:23.005856 1377 net.cpp:150] Setting up accuracy
I0323 16:51:23.005862 1377 net.cpp:157] Top shape: (1)
I0323 16:51:23.005867 1377 net.cpp:165] Memory required for data: 8086804
I0323 16:51:23.005870 1377 layer_factory.hpp:77] Creating layer loss
I0323 16:51:23.005877 1377 net.cpp:106] Creating Layer loss
I0323 16:51:23.005880 1377 net.cpp:454] loss <- ip2_ip2_0_split_1
I0323 16:51:23.005885 1377 net.cpp:454] loss <- label_data_1_split_1
I0323 16:51:23.005892 1377 net.cpp:411] loss -> loss
I0323 16:51:23.005899 1377 layer_factory.hpp:77] Creating layer loss
I0323 16:51:23.006204 1377 net.cpp:150] Setting up loss
I0323 16:51:23.006217 1377 net.cpp:157] Top shape: (1)
I0323 16:51:23.006222 1377 net.cpp:160] with loss weight 1
I0323 16:51:23.006229 1377 net.cpp:165] Memory required for data: 8086808
I0323 16:51:23.006233 1377 net.cpp:226] loss needs backward computation.
I0323 16:51:23.006238 1377 net.cpp:228] accuracy does not need backward computation.
I0323 16:51:23.006243 1377 net.cpp:226] ip2_ip2_0_split needs backward computation.
I0323 16:51:23.006247 1377 net.cpp:226] ip2 needs backward computation.
I0323 16:51:23.006252 1377 net.cpp:226] relu1 needs backward computation.
I0323 16:51:23.006255 1377 net.cpp:226] ip1 needs backward computation.
I0323 16:51:23.006259 1377 net.cpp:226] pool2 needs backward computation.
I0323 16:51:23.006263 1377 net.cpp:226] conv2 needs backward computation.
I0323 16:51:23.006268 1377 net.cpp:226] pool1 needs backward computation.
I0323 16:51:23.006271 1377 net.cpp:226] conv1 needs backward computation.
I0323 16:51:23.006276 1377 net.cpp:228] label_data_1_split does not need backward computation.
I0323 16:51:23.006288 1377 net.cpp:228] data does not need backward computation.
I0323 16:51:23.006291 1377 net.cpp:270] This network produces output accuracy
I0323 16:51:23.006295 1377 net.cpp:270] This network produces output loss
I0323 16:51:23.006310 1377 net.cpp:283] Network initialization done.
I0323 16:51:23.006355 1377 solver.cpp:60] Solver scaffolding done.
I0323 16:51:23.007391 1377 socket.cpp:224] Waiting for valid port [0]
I0323 16:51:23.007429 1413 socket.cpp:163] Assigned socket server port [41630]
I0323 16:51:23.009763 1413 socket.cpp:176] Socket Server ready [0.0.0.0]
I0323 16:51:23.017469 1377 socket.cpp:224] Waiting for valid port [41630]
I0323 16:51:23.017478 1377 socket.cpp:232] Valid port found [41630]
I0323 16:51:23.017490 1377 CaffeNet.cpp:186] Socket adapter: dlgpu19.ai.bjcc.qihoo.net:41630
I0323 16:51:23.017716 1377 CaffeNet.cpp:325] 0-th Socket addr: dlgpu19.ai.bjcc.qihoo.net:41630
I0323 16:51:23.017729 1377 CaffeNet.cpp:325] 1-th Socket addr:
I0323 16:51:23.017734 1377 CaffeNet.cpp:325] 2-th Socket addr: dlgpu19.ai.bjcc.qihoo.net:41630
I0323 16:51:23.017737 1377 CaffeNet.cpp:325] 3-th Socket addr: dlgpu19.ai.bjcc.qihoo.net:41630
I0323 16:51:23.017742 1377 CaffeNet.cpp:325] 4-th Socket addr: dlgpu19.ai.bjcc.qihoo.net:41630
I0323 16:51:23.017755 1377 JniCaffeNet.cpp:110] 0-th local addr: dlgpu19.ai.bjcc.qihoo.net:41630
I0323 16:51:23.017761 1377 JniCaffeNet.cpp:110] 1-th local addr:
I0323 16:51:23.017765 1377 JniCaffeNet.cpp:110] 2-th local addr: dlgpu19.ai.bjcc.qihoo.net:41630
I0323 16:51:23.017768 1377 JniCaffeNet.cpp:110] 3-th local addr: dlgpu19.ai.bjcc.qihoo.net:41630
I0323 16:51:23.017772 1377 JniCaffeNet.cpp:110] 4-th local addr: dlgpu19.ai.bjcc.qihoo.net:41630
16/03/23 16:51:23 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 1069 bytes result sent to driver
16/03/23 16:51:27 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 9
16/03/23 16:51:27 INFO executor.Executor: Running task 4.0 in stage 1.0 (TID 9)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1622.0 B, free 6.9 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 16 ms
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 9.5 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 235.0 B, free 9.7 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 13 ms
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 12.8 KB)
I0323 16:51:27.558655 1413 socket.cpp:189] Accepted the connection from client [dlgpu19.ai.bjcc.qihoo.net]
I0323 16:51:27.560506 1377 common.cpp:61] 1-th string is NULL
I0323 16:51:27.560600 1377 socket.cpp:255] Trying to connect with ...[dlgpu19.ai.bjcc.qihoo.net:57402]
I0323 16:51:27.560781 1377 socket.cpp:314] Connected to server [dlgpu19.ai.bjcc.qihoo.net:57402] with client_fd [316]
I0323 16:51:37.553697 1413 socket.cpp:189] Accepted the connection from client [dlgpu10.ai.bjcc.qihoo.net]
I0323 16:51:37.554890 1413 socket.cpp:189] Accepted the connection from client [dlgpu10.ai.bjcc.qihoo.net]
I0323 16:51:37.557639 1413 socket.cpp:189] Accepted the connection from client [dlgpu20.ai.bjcc.qihoo.net]
I0323 16:51:37.560904 1377 socket.cpp:255] Trying to connect with ...[dlgpu10.ai.bjcc.qihoo.net:58067]
I0323 16:51:37.561306 1377 socket.cpp:314] Connected to server [dlgpu10.ai.bjcc.qihoo.net:58067] with client_fd [320]
I0323 16:51:47.561446 1377 socket.cpp:255] Trying to connect with ...[dlgpu20.ai.bjcc.qihoo.net:47155]
I0323 16:51:47.561790 1377 socket.cpp:314] Connected to server [dlgpu20.ai.bjcc.qihoo.net:47155] with client_fd [321]
I0323 16:51:57.561954 1377 socket.cpp:255] Trying to connect with ...[dlgpu10.ai.bjcc.qihoo.net:46110]
I0323 16:51:57.562314 1377 socket.cpp:314] Connected to server [dlgpu10.ai.bjcc.qihoo.net:46110] with client_fd [322]
I0323 16:52:07.606834 1377 parallel.cpp:392] GPUs pairs 0:1, 2:3, 0:2, 0:4
I0323 16:52:07.678092 1377 parallel.cpp:234] GPU 4 does not have p2p access to GPU 0
I0323 16:52:07.715546 1759 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
16/03/23 16:52:07 INFO executor.Executor: Finished task 4.0 in stage 1.0 (TID 9). 918 bytes result sent to driver
I0323 16:52:08.365959 1761 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.395136 1765 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.398850 1767 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.408880 1769 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
16/03/23 16:52:08 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 13
16/03/23 16:52:08 INFO executor.Executor: Running task 3.0 in stage 2.0 (TID 13)
16/03/23 16:52:08 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
16/03/23 16:52:08 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1450.0 B, free 4.7 KB)
16/03/23 16:52:08 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 17 ms
16/03/23 16:52:08 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.2 KB, free 6.9 KB)
16/03/23 16:52:08 INFO spark.CacheManager: Partition rdd_5_3 not found, computing it
16/03/23 16:52:08 INFO spark.CacheManager: Partition rdd_0_3 not found, computing it
16/03/23 16:52:08 INFO caffe.LmdbRDD: Processing partition 3
16/03/23 16:52:09 INFO caffe.LmdbRDD: Completed partition 3
16/03/23 16:52:09 INFO storage.BlockManager: Found block rdd_0_3 locally
I0323 16:52:10.081959 1452 socket.cpp:58] receive message: 3 0
I0323 16:52:10.081997 1452 socket.cpp:59] socket: 12 message header size: 12
I0323 16:52:10.128891 1377 socket.cpp:44] send essage: 1 0 344864
I0323 16:52:10.140643 1377 socket.cpp:44] send essage: 1 0 344864
I0323 16:52:10.149338 1377 socket.cpp:44] send essage: 1 0 344864
I0323 16:52:10.159688 1425 socket.cpp:58] receive message: 0 0
I0323 16:52:10.159714 1425 socket.cpp:59] socket: 12 message header size: 12
I0323 16:52:10.162822 1377 socket.cpp:44] send essage: 1 0 344864
I0323 16:52:10.940675 1451 socket.cpp:58] receive message: 2 0
I0323 16:52:10.940747 1451 socket.cpp:59] socket: 12 message header size: 12
I0323 16:52:11.121742 1450 socket.cpp:58] receive message: 4 0
I0323 16:52:11.121800 1450 socket.cpp:59] socket: 12 message header size: 12
16/03/23 16:53:58 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown
16/03/23 16:53:58 INFO storage.MemoryStore: MemoryStore cleared
16/03/23 16:53:58 INFO storage.BlockManager: BlockManager stopped
16/03/23 16:53:58 WARN executor.CoarseGrainedExecutorBackend: An unknown (dlgpu20.ai.bjcc.qihoo.net:45855) driver disconnected.
16/03/23 16:53:58 ERROR executor.CoarseGrainedExecutorBackend: Driver 10.142.118.172:45855 disassociated! Shutting down.
16/03/23 16:53:58 INFO util.ShutdownHookManager: Shutdown hook called
16/03/23 16:53:58 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

LogType:stdout
Log Upload Time:23-Mar-2016 16:53:59
LogLength:0
Log Contents:

Container: container_1458635993280_0010_01_000002 on dlgpu20.ai.bjcc.qihoo.net_35190

LogType:stderr
Log Upload Time:23-Mar-2016 16:53:58
LogLength:30799
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/run/hadoop/nm-local-dir/usercache/hadoop/filecache/12/spark-assembly-1.6.0-hadoop2.6.4-U4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/software/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/03/23 16:51:12 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/03/23 16:51:12 INFO yarn.YarnSparkHadoopUtil: Set ugi in YarnConf to: hadoop,hadoop
16/03/23 16:51:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/23 16:51:13 INFO yarn.YarnSparkHadoopUtil: running as user: hadoop
16/03/23 16:51:13 INFO spark.SecurityManager: Changing view acls to: hadoop
16/03/23 16:51:13 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/03/23 16:51:13 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/03/23 16:51:14 INFO spark.SecurityManager: Changing view acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/03/23 16:51:14 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/03/23 16:51:14 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/03/23 16:51:14 INFO Remoting: Starting remoting
16/03/23 16:51:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:45365]
16/03/23 16:51:14 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 45365.
16/03/23 16:51:14 INFO storage.DiskBlockManager: Created local directory at /home/hadoop/yarn/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-3159017c-e654-4541-8292-47d1e6039345
16/03/23 16:51:14 INFO storage.DiskBlockManager: Created local directory at /dev/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-4cf722e0-fd5e-42e3-af3a-05ff17d88dc1
16/03/23 16:51:14 INFO storage.DiskBlockManager: Created local directory at /run/hadoop/nm-local-dir/usercache/hadoop/appcache/application_1458635993280_0010/blockmgr-1ed10eb6-8a21-4b13-b983-fea3b40c7120
16/03/23 16:51:14 INFO storage.MemoryStore: MemoryStore started with capacity 24.9 GB
16/03/23 16:51:14 INFO yarn.YarnSparkHadoopUtil: Set ugi in YarnConf to: hadoop,hadoop
16/03/23 16:51:15 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://[email protected]:45855
16/03/23 16:51:15 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
16/03/23 16:51:15 INFO executor.Executor: Starting executor ID 1 on host dlgpu20.ai.bjcc.qihoo.net
16/03/23 16:51:15 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60205.
16/03/23 16:51:15 INFO netty.NettyBlockTransferService: Server created on 60205
16/03/23 16:51:15 INFO storage.BlockManager: external shuffle service port = 7337
16/03/23 16:51:15 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/03/23 16:51:15 INFO storage.BlockManagerMaster: Registered BlockManager
16/03/23 16:51:15 INFO storage.BlockManager: Registering executor with local external shuffle service.
16/03/23 16:51:19 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 3
16/03/23 16:51:19 INFO executor.Executor: Running task 3.0 in stage 0.0 (TID 3)
16/03/23 16:51:19 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
16/03/23 16:51:20 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.1 KB, free 2.1 KB)
16/03/23 16:51:20 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 310 ms
16/03/23 16:51:20 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.2 KB, free 5.3 KB)
16/03/23 16:51:20 INFO caffe.CaffeProcessor: my rank is 3
16/03/23 16:51:20 INFO caffe.LMDB: Batch size:64
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0323 16:51:26.091801 29967 CaffeNet.cpp:78] set root solver device id to 0
I0323 16:51:26.250738 29967 solver.cpp:48] Initializing solver from parameters:
test_iter: 1
test_interval: 10001
base_lr: 0.01
display: 100
max_iter: 10000
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 10001
snapshot_prefix: "mnist_lenet"
solver_mode: GPU
device_id: 0
net_param {
name: "LeNet"
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_train_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
memory_data_param {
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_test_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
}
test_initialization: false
I0323 16:51:26.250965 29967 solver.cpp:86] Creating training net specified in net_param.
I0323 16:51:26.251049 29967 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer data
I0323 16:51:26.251065 29967 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0323 16:51:26.251157 29967 net.cpp:49] Initializing net from parameters:
name: "LeNet"
state {
phase: TRAIN
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TRAIN
}
memory_data_param {
batch_size: 64
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_train_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0323 16:51:26.251276 29967 layer_factory.hpp:77] Creating layer data
I0323 16:51:26.251309 29967 net.cpp:106] Creating Layer data
I0323 16:51:26.251322 29967 net.cpp:411] data -> data
I0323 16:51:26.251359 29967 net.cpp:411] data -> label
I0323 16:51:26.254645 29967 net.cpp:150] Setting up data
I0323 16:51:26.254678 29967 net.cpp:157] Top shape: 64 1 28 28 (50176)
I0323 16:51:26.254685 29967 net.cpp:157] Top shape: 64 (64)
I0323 16:51:26.254689 29967 net.cpp:165] Memory required for data: 200960
I0323 16:51:26.254700 29967 layer_factory.hpp:77] Creating layer conv1
I0323 16:51:26.254724 29967 net.cpp:106] Creating Layer conv1
I0323 16:51:26.254732 29967 net.cpp:454] conv1 <- data
I0323 16:51:26.254747 29967 net.cpp:411] conv1 -> conv1
I0323 16:51:26.257411 29967 net.cpp:150] Setting up conv1
I0323 16:51:26.257429 29967 net.cpp:157] Top shape: 64 20 24 24 (737280)
I0323 16:51:26.257434 29967 net.cpp:165] Memory required for data: 3150080
I0323 16:51:26.257454 29967 layer_factory.hpp:77] Creating layer pool1
I0323 16:51:26.257468 29967 net.cpp:106] Creating Layer pool1
I0323 16:51:26.257473 29967 net.cpp:454] pool1 <- conv1
I0323 16:51:26.257479 29967 net.cpp:411] pool1 -> pool1
I0323 16:51:26.257601 29967 net.cpp:150] Setting up pool1
I0323 16:51:26.257608 29967 net.cpp:157] Top shape: 64 20 12 12 (184320)
I0323 16:51:26.257612 29967 net.cpp:165] Memory required for data: 3887360
I0323 16:51:26.257616 29967 layer_factory.hpp:77] Creating layer conv2
I0323 16:51:26.257628 29967 net.cpp:106] Creating Layer conv2
I0323 16:51:26.257632 29967 net.cpp:454] conv2 <- pool1
I0323 16:51:26.257642 29967 net.cpp:411] conv2 -> conv2
I0323 16:51:26.258602 29967 net.cpp:150] Setting up conv2
I0323 16:51:26.258618 29967 net.cpp:157] Top shape: 64 50 8 8 (204800)
I0323 16:51:26.258622 29967 net.cpp:165] Memory required for data: 4706560
I0323 16:51:26.258632 29967 layer_factory.hpp:77] Creating layer pool2
I0323 16:51:26.258641 29967 net.cpp:106] Creating Layer pool2
I0323 16:51:26.258644 29967 net.cpp:454] pool2 <- conv2
I0323 16:51:26.258651 29967 net.cpp:411] pool2 -> pool2
I0323 16:51:26.258749 29967 net.cpp:150] Setting up pool2
I0323 16:51:26.258764 29967 net.cpp:157] Top shape: 64 50 4 4 (51200)
I0323 16:51:26.258769 29967 net.cpp:165] Memory required for data: 4911360
I0323 16:51:26.258772 29967 layer_factory.hpp:77] Creating layer ip1
I0323 16:51:26.258790 29967 net.cpp:106] Creating Layer ip1
I0323 16:51:26.258795 29967 net.cpp:454] ip1 <- pool2
I0323 16:51:26.258800 29967 net.cpp:411] ip1 -> ip1
I0323 16:51:26.264073 29967 net.cpp:150] Setting up ip1
I0323 16:51:26.264094 29967 net.cpp:157] Top shape: 64 500 (32000)
I0323 16:51:26.264099 29967 net.cpp:165] Memory required for data: 5039360
I0323 16:51:26.264112 29967 layer_factory.hpp:77] Creating layer relu1
I0323 16:51:26.264122 29967 net.cpp:106] Creating Layer relu1
I0323 16:51:26.264127 29967 net.cpp:454] relu1 <- ip1
I0323 16:51:26.264133 29967 net.cpp:397] relu1 -> ip1 (in-place)
I0323 16:51:26.264145 29967 net.cpp:150] Setting up relu1
I0323 16:51:26.264152 29967 net.cpp:157] Top shape: 64 500 (32000)
I0323 16:51:26.264154 29967 net.cpp:165] Memory required for data: 5167360
I0323 16:51:26.264158 29967 layer_factory.hpp:77] Creating layer ip2
I0323 16:51:26.264165 29967 net.cpp:106] Creating Layer ip2
I0323 16:51:26.264169 29967 net.cpp:454] ip2 <- ip1
I0323 16:51:26.264178 29967 net.cpp:411] ip2 -> ip2
I0323 16:51:26.265939 29967 net.cpp:150] Setting up ip2
I0323 16:51:26.265956 29967 net.cpp:157] Top shape: 64 10 (640)
I0323 16:51:26.265961 29967 net.cpp:165] Memory required for data: 5169920
I0323 16:51:26.265969 29967 layer_factory.hpp:77] Creating layer loss
I0323 16:51:26.265980 29967 net.cpp:106] Creating Layer loss
I0323 16:51:26.265985 29967 net.cpp:454] loss <- ip2
I0323 16:51:26.265990 29967 net.cpp:454] loss <- label
I0323 16:51:26.266002 29967 net.cpp:411] loss -> loss
I0323 16:51:26.266022 29967 layer_factory.hpp:77] Creating layer loss
I0323 16:51:26.266337 29967 net.cpp:150] Setting up loss
I0323 16:51:26.266350 29967 net.cpp:157] Top shape: (1)
I0323 16:51:26.266355 29967 net.cpp:160] with loss weight 1
I0323 16:51:26.266376 29967 net.cpp:165] Memory required for data: 5169924
I0323 16:51:26.266381 29967 net.cpp:226] loss needs backward computation.
I0323 16:51:26.266386 29967 net.cpp:226] ip2 needs backward computation.
I0323 16:51:26.266389 29967 net.cpp:226] relu1 needs backward computation.
I0323 16:51:26.266392 29967 net.cpp:226] ip1 needs backward computation.
I0323 16:51:26.266396 29967 net.cpp:226] pool2 needs backward computation.
I0323 16:51:26.266401 29967 net.cpp:226] conv2 needs backward computation.
I0323 16:51:26.266404 29967 net.cpp:226] pool1 needs backward computation.
I0323 16:51:26.266407 29967 net.cpp:226] conv1 needs backward computation.
I0323 16:51:26.266412 29967 net.cpp:228] data does not need backward computation.
I0323 16:51:26.266415 29967 net.cpp:270] This network produces output loss
I0323 16:51:26.266427 29967 net.cpp:283] Network initialization done.
I0323 16:51:26.266485 29967 solver.cpp:181] Creating test net (#0) specified by net_param
I0323 16:51:26.266506 29967 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer data
I0323 16:51:26.266605 29967 net.cpp:49] Initializing net from parameters:
name: "LeNet"
state {
phase: TEST
}
layer {
name: "data"
type: "MemoryData"
top: "data"
top: "label"
include {
phase: TEST
}
memory_data_param {
batch_size: 100
channels: 1
height: 28
width: 28
share_in_parallel: false
source: "hdfs:///projects/machine_learning/image_dataset/mnist_test_lmdb/"
}
source_class: "com.yahoo.ml.caffe.LMDB"
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0323 16:51:26.266691 29967 layer_factory.hpp:77] Creating layer data
I0323 16:51:26.266703 29967 net.cpp:106] Creating Layer data
I0323 16:51:26.266710 29967 net.cpp:411] data -> data
I0323 16:51:26.266718 29967 net.cpp:411] data -> label
I0323 16:51:26.268417 29967 net.cpp:150] Setting up data
I0323 16:51:26.268438 29967 net.cpp:157] Top shape: 100 1 28 28 (78400)
I0323 16:51:26.268445 29967 net.cpp:157] Top shape: 100 (100)
I0323 16:51:26.268448 29967 net.cpp:165] Memory required for data: 314000
I0323 16:51:26.268453 29967 layer_factory.hpp:77] Creating layer label_data_1_split
I0323 16:51:26.268463 29967 net.cpp:106] Creating Layer label_data_1_split
I0323 16:51:26.268468 29967 net.cpp:454] label_data_1_split <- label
I0323 16:51:26.268474 29967 net.cpp:411] label_data_1_split -> label_data_1_split_0
I0323 16:51:26.268482 29967 net.cpp:411] label_data_1_split -> label_data_1_split_1
I0323 16:51:26.268594 29967 net.cpp:150] Setting up label_data_1_split
I0323 16:51:26.268602 29967 net.cpp:157] Top shape: 100 (100)
I0323 16:51:26.268607 29967 net.cpp:157] Top shape: 100 (100)
I0323 16:51:26.268610 29967 net.cpp:165] Memory required for data: 314800
I0323 16:51:26.268615 29967 layer_factory.hpp:77] Creating layer conv1
I0323 16:51:26.268626 29967 net.cpp:106] Creating Layer conv1
I0323 16:51:26.268630 29967 net.cpp:454] conv1 <- data
I0323 16:51:26.268641 29967 net.cpp:411] conv1 -> conv1
I0323 16:51:26.269433 29967 net.cpp:150] Setting up conv1
I0323 16:51:26.269448 29967 net.cpp:157] Top shape: 100 20 24 24 (1152000)
I0323 16:51:26.269451 29967 net.cpp:165] Memory required for data: 4922800
I0323 16:51:26.269461 29967 layer_factory.hpp:77] Creating layer pool1
I0323 16:51:26.269476 29967 net.cpp:106] Creating Layer pool1
I0323 16:51:26.269481 29967 net.cpp:454] pool1 <- conv1
I0323 16:51:26.269489 29967 net.cpp:411] pool1 -> pool1
I0323 16:51:26.269598 29967 net.cpp:150] Setting up pool1
I0323 16:51:26.269605 29967 net.cpp:157] Top shape: 100 20 12 12 (288000)
I0323 16:51:26.269609 29967 net.cpp:165] Memory required for data: 6074800
I0323 16:51:26.269613 29967 layer_factory.hpp:77] Creating layer conv2
I0323 16:51:26.269626 29967 net.cpp:106] Creating Layer conv2
I0323 16:51:26.269631 29967 net.cpp:454] conv2 <- pool1
I0323 16:51:26.269639 29967 net.cpp:411] conv2 -> conv2
I0323 16:51:26.270656 29967 net.cpp:150] Setting up conv2
I0323 16:51:26.270670 29967 net.cpp:157] Top shape: 100 50 8 8 (320000)
I0323 16:51:26.270674 29967 net.cpp:165] Memory required for data: 7354800
I0323 16:51:26.270684 29967 layer_factory.hpp:77] Creating layer pool2
I0323 16:51:26.270690 29967 net.cpp:106] Creating Layer pool2
I0323 16:51:26.270695 29967 net.cpp:454] pool2 <- conv2
I0323 16:51:26.270702 29967 net.cpp:411] pool2 -> pool2
I0323 16:51:26.270805 29967 net.cpp:150] Setting up pool2
I0323 16:51:26.270812 29967 net.cpp:157] Top shape: 100 50 4 4 (80000)
I0323 16:51:26.270823 29967 net.cpp:165] Memory required for data: 7674800
I0323 16:51:26.270828 29967 layer_factory.hpp:77] Creating layer ip1
I0323 16:51:26.270838 29967 net.cpp:106] Creating Layer ip1
I0323 16:51:26.270841 29967 net.cpp:454] ip1 <- pool2
I0323 16:51:26.270849 29967 net.cpp:411] ip1 -> ip1
I0323 16:51:26.276105 29967 net.cpp:150] Setting up ip1
I0323 16:51:26.276124 29967 net.cpp:157] Top shape: 100 500 (50000)
I0323 16:51:26.276127 29967 net.cpp:165] Memory required for data: 7874800
I0323 16:51:26.276139 29967 layer_factory.hpp:77] Creating layer relu1
I0323 16:51:26.276149 29967 net.cpp:106] Creating Layer relu1
I0323 16:51:26.276154 29967 net.cpp:454] relu1 <- ip1
I0323 16:51:26.276159 29967 net.cpp:397] relu1 -> ip1 (in-place)
I0323 16:51:26.276166 29967 net.cpp:150] Setting up relu1
I0323 16:51:26.276171 29967 net.cpp:157] Top shape: 100 500 (50000)
I0323 16:51:26.276175 29967 net.cpp:165] Memory required for data: 8074800
I0323 16:51:26.276178 29967 layer_factory.hpp:77] Creating layer ip2
I0323 16:51:26.276188 29967 net.cpp:106] Creating Layer ip2
I0323 16:51:26.276192 29967 net.cpp:454] ip2 <- ip1
I0323 16:51:26.276198 29967 net.cpp:411] ip2 -> ip2
I0323 16:51:26.276582 29967 net.cpp:150] Setting up ip2
I0323 16:51:26.276592 29967 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:26.276595 29967 net.cpp:165] Memory required for data: 8078800
I0323 16:51:26.276602 29967 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
I0323 16:51:26.276608 29967 net.cpp:106] Creating Layer ip2_ip2_0_split
I0323 16:51:26.276612 29967 net.cpp:454] ip2_ip2_0_split <- ip2
I0323 16:51:26.276621 29967 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_0
I0323 16:51:26.276628 29967 net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_1
I0323 16:51:26.276732 29967 net.cpp:150] Setting up ip2_ip2_0_split
I0323 16:51:26.276739 29967 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:26.276744 29967 net.cpp:157] Top shape: 100 10 (1000)
I0323 16:51:26.276748 29967 net.cpp:165] Memory required for data: 8086800
I0323 16:51:26.276751 29967 layer_factory.hpp:77] Creating layer accuracy
I0323 16:51:26.276762 29967 net.cpp:106] Creating Layer accuracy
I0323 16:51:26.276765 29967 net.cpp:454] accuracy <- ip2_ip2_0_split_0
I0323 16:51:26.276770 29967 net.cpp:454] accuracy <- label_data_1_split_0
I0323 16:51:26.276778 29967 net.cpp:411] accuracy -> accuracy
I0323 16:51:26.276792 29967 net.cpp:150] Setting up accuracy
I0323 16:51:26.276796 29967 net.cpp:157] Top shape: (1)
I0323 16:51:26.276800 29967 net.cpp:165] Memory required for data: 8086804
I0323 16:51:26.276803 29967 layer_factory.hpp:77] Creating layer loss
I0323 16:51:26.276809 29967 net.cpp:106] Creating Layer loss
I0323 16:51:26.276813 29967 net.cpp:454] loss <- ip2_ip2_0_split_1
I0323 16:51:26.276818 29967 net.cpp:454] loss <- label_data_1_split_1
I0323 16:51:26.276830 29967 net.cpp:411] loss -> loss
I0323 16:51:26.276839 29967 layer_factory.hpp:77] Creating layer loss
I0323 16:51:26.277124 29967 net.cpp:150] Setting up loss
I0323 16:51:26.277137 29967 net.cpp:157] Top shape: (1)
I0323 16:51:26.277140 29967 net.cpp:160] with loss weight 1
I0323 16:51:26.277148 29967 net.cpp:165] Memory required for data: 8086808
I0323 16:51:26.277153 29967 net.cpp:226] loss needs backward computation.
I0323 16:51:26.277156 29967 net.cpp:228] accuracy does not need backward computation.
I0323 16:51:26.277161 29967 net.cpp:226] ip2_ip2_0_split needs backward computation.
I0323 16:51:26.277165 29967 net.cpp:226] ip2 needs backward computation.
I0323 16:51:26.277169 29967 net.cpp:226] relu1 needs backward computation.
I0323 16:51:26.277173 29967 net.cpp:226] ip1 needs backward computation.
I0323 16:51:26.277176 29967 net.cpp:226] pool2 needs backward computation.
I0323 16:51:26.277181 29967 net.cpp:226] conv2 needs backward computation.
I0323 16:51:26.277187 29967 net.cpp:226] pool1 needs backward computation.
I0323 16:51:26.277191 29967 net.cpp:226] conv1 needs backward computation.
I0323 16:51:26.277195 29967 net.cpp:228] label_data_1_split does not need backward computation.
I0323 16:51:26.277205 29967 net.cpp:228] data does not need backward computation.
I0323 16:51:26.277209 29967 net.cpp:270] This network produces output accuracy
I0323 16:51:26.277212 29967 net.cpp:270] This network produces output loss
I0323 16:51:26.277226 29967 net.cpp:283] Network initialization done.
I0323 16:51:26.277271 29967 solver.cpp:60] Solver scaffolding done.
I0323 16:51:26.278825 29967 socket.cpp:219] Waiting for valid port [0]
I0323 16:51:26.278879 29991 socket.cpp:158] Assigned socket server port [47155]
I0323 16:51:26.281419 29991 socket.cpp:171] Socket Server ready [0.0.0.0]
I0323 16:51:26.288908 29967 socket.cpp:219] Waiting for valid port [47155]
I0323 16:51:26.288918 29967 socket.cpp:227] Valid port found [47155]
I0323 16:51:26.288931 29967 CaffeNet.cpp:186] Socket adapter: dlgpu20.ai.bjcc.qihoo.net:47155
I0323 16:51:26.289212 29967 CaffeNet.cpp:325] 0-th Socket addr: dlgpu20.ai.bjcc.qihoo.net:47155
I0323 16:51:26.289224 29967 CaffeNet.cpp:325] 1-th Socket addr: dlgpu20.ai.bjcc.qihoo.net:47155
I0323 16:51:26.289228 29967 CaffeNet.cpp:325] 2-th Socket addr: dlgpu20.ai.bjcc.qihoo.net:47155
I0323 16:51:26.289233 29967 CaffeNet.cpp:325] 3-th Socket addr:
I0323 16:51:26.289237 29967 CaffeNet.cpp:325] 4-th Socket addr: dlgpu20.ai.bjcc.qihoo.net:47155
I0323 16:51:26.289249 29967 JniCaffeNet.cpp:110] 0-th local addr: dlgpu20.ai.bjcc.qihoo.net:47155
I0323 16:51:26.289255 29967 JniCaffeNet.cpp:110] 1-th local addr: dlgpu20.ai.bjcc.qihoo.net:47155
I0323 16:51:26.289259 29967 JniCaffeNet.cpp:110] 2-th local addr: dlgpu20.ai.bjcc.qihoo.net:47155
I0323 16:51:26.289263 29967 JniCaffeNet.cpp:110] 3-th local addr:
I0323 16:51:26.289266 29967 JniCaffeNet.cpp:110] 4-th local addr: dlgpu20.ai.bjcc.qihoo.net:47155
16/03/23 16:51:26 INFO executor.Executor: Finished task 3.0 in stage 0.0 (TID 3). 1069 bytes result sent to driver
16/03/23 16:51:27 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
16/03/23 16:51:27 INFO executor.Executor: Running task 3.0 in stage 1.0 (TID 8)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1622.0 B, free 6.9 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 16 ms
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 9.5 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 235.0 B, free 9.7 KB)
16/03/23 16:51:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 12 ms
16/03/23 16:51:27 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 12.8 KB)
I0323 16:51:27.556499 29967 common.cpp:61] 3-th string is NULL
I0323 16:51:27.556599 29967 socket.cpp:250] Trying to connect with ...[dlgpu19.ai.bjcc.qihoo.net:57402]
I0323 16:51:27.556901 29967 socket.cpp:309] Connected to server [dlgpu19.ai.bjcc.qihoo.net:57402] with client_fd [315]
I0323 16:51:37.557061 29967 socket.cpp:250] Trying to connect with ...[dlgpu19.ai.bjcc.qihoo.net:41630]
I0323 16:51:37.557479 29967 socket.cpp:309] Connected to server [dlgpu19.ai.bjcc.qihoo.net:41630] with client_fd [316]
I0323 16:51:47.555366 29991 socket.cpp:184] Accepted the connection from client [dlgpu10.ai.bjcc.qihoo.net]
I0323 16:51:47.557629 29967 socket.cpp:250] Trying to connect with ...[dlgpu10.ai.bjcc.qihoo.net:58067]
I0323 16:51:47.557984 29967 socket.cpp:309] Connected to server [dlgpu10.ai.bjcc.qihoo.net:58067] with client_fd [318]
I0323 16:51:47.559761 29991 socket.cpp:184] Accepted the connection from client [dlgpu19.ai.bjcc.qihoo.net]
I0323 16:51:47.561936 29991 socket.cpp:184] Accepted the connection from client [dlgpu19.ai.bjcc.qihoo.net]
I0323 16:51:57.554638 29991 socket.cpp:184] Accepted the connection from client [dlgpu10.ai.bjcc.qihoo.net]
I0323 16:51:57.558133 29967 socket.cpp:250] Trying to connect with ...[dlgpu10.ai.bjcc.qihoo.net:46110]
I0323 16:51:57.558480 29967 socket.cpp:309] Connected to server [dlgpu10.ai.bjcc.qihoo.net:46110] with client_fd [322]
I0323 16:52:07.586596 29967 parallel.cpp:392] GPUs pairs 0:1, 2:3, 0:2, 0:4
I0323 16:52:07.660895 29967 parallel.cpp:234] GPU 4 does not have p2p access to GPU 0
I0323 16:52:07.694207 30111 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
16/03/23 16:52:07 INFO executor.Executor: Finished task 3.0 in stage 1.0 (TID 8). 918 bytes result sent to driver
I0323 16:52:08.215100 30113 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.250818 30115 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.287663 30117 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
I0323 16:52:08.299569 30119 MemoryInputAdapter.cpp:15] MemoryInputAdapter is used
16/03/23 16:52:08 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 11
16/03/23 16:52:08 INFO executor.Executor: Running task 1.0 in stage 2.0 (TID 11)
16/03/23 16:52:08 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
16/03/23 16:52:08 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1450.0 B, free 4.7 KB)
16/03/23 16:52:08 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 11 ms
16/03/23 16:52:08 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.2 KB, free 6.9 KB)
16/03/23 16:52:08 INFO spark.CacheManager: Partition rdd_5_1 not found, computing it
16/03/23 16:52:08 INFO spark.CacheManager: Partition rdd_0_1 not found, computing it
16/03/23 16:52:08 INFO caffe.LmdbRDD: Processing partition 1
16/03/23 16:52:09 INFO caffe.LmdbRDD: Completed partition 1
16/03/23 16:52:09 INFO storage.BlockManager: Found block rdd_0_1 locally
16/03/23 16:53:58 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
16/03/23 16:53:58 INFO storage.DiskBlockManager: Shutdown hook called
16/03/23 16:53:58 INFO util.ShutdownHookManager: Shutdown hook called

LogType:stdout
Log Upload Time:23-Mar-2016 16:53:58
LogLength:0
Log Contents:

Container: container_1458635993280_0010_01_000001 on dlgpu20.ai.bjcc.qihoo.net_35190

LogType:stderr
Log Upload Time:23-Mar-2016 16:53:58
LogLength:1577
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/run/hadoop/nm-local-dir/usercache/hadoop/filecache/12/spark-assembly-1.6.0-hadoop2.6.4-U4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/software/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

[Stage 0:> (0 + 0) / 5]
[Stage 0:> (0 + 5) / 5]
[Stage 0:===========> (1 + 4) / 5]
[Stage 0:=======================> (2 + 3) / 5]
[Stage 0:===================================> (3 + 2) / 5]
[Stage 0:===========================================================(5 + 0) / 5]

[Stage 1:> (0 + 5) / 5]
[Stage 1:===================================> (3 + 2) / 5]

[Stage 2:> (0 + 5) / 5]
[Stage 2:> (0 + 5) / 5]16/03/23 16:53:58 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM

LogType:stdout
Log Upload Time:23-Mar-2016 16:53:58
LogLength:0
Log Contents:

from caffeonspark.

anfeng avatar anfeng commented on August 22, 2024

Your CLI asks for 5 executors, each need 5 GPUs. Each your server has 8 GPUs. 2 executors were allocated to 1 server, and these 2 executors will need 10 GPUs. As a result, we see the error messages such as:
I0323 16:52:07.676470 26145 parallel.cpp:234] GPU 4 does not have p2p access to GPU 0
That will disable the peer-to-peer communication among GPUs.

Please adjust your settings.

from caffeonspark.

ydm2011 avatar ydm2011 commented on August 22, 2024

@anfeng Thanks, this problem has been solved. I change the settings as your desciption.

from caffeonspark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.