GithubHelp home page GithubHelp logo

Comments (16)

leewyang avatar leewyang commented on August 20, 2024 1

Yes, we require that each executor only runs one task at a time (and no dynamic allocation). The exact configuration depends on your spark version/setup, but you might be able to try --executor-cores 1.

from tensorflowonspark.

leewyang avatar leewyang commented on August 20, 2024

@jzhusc that line just installs pip into the Python distro that we're trying to package up for Spark. If Google dataproc already has Python installed, can you just try something like pip install pydoop to see if that works?

Unfortunately, I don't have any experience with this environment, so it's hard to say if your Python dependencies will be available on the executors. But if they are, you may not need to supply a Python.zip to the Spark command line.

from tensorflowonspark.

jzhusc avatar jzhusc commented on August 20, 2024

@leewyang hi, installing libssl-dev works.
btw, in Convert the MNIST zip files into HDFS files
I find that
mnist.zip in this line --archives hdfs:///user/${USER}/Python.zip#Python,mnist/mnist.zip#mnist
doesn't appear before.
Is this just .a zip file contain all 4 mnist data?

from tensorflowonspark.

leewyang avatar leewyang commented on August 20, 2024

Yes, sorry, that line was missing in the wiki. I've corrected it now.

from tensorflowonspark.

jzhusc avatar jzhusc commented on August 20, 2024

@leewyang another problem I think is in
Convert the MNIST zip files into HDFS files
export PYTHON_ROOT=~/Python
should be
export PYTHON_ROOT=Python

from tensorflowonspark.

jzhusc avatar jzhusc commented on August 20, 2024

@leewyang
Also I met this problem when training

17/04/08 00:50:20 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-08 00:50:22,233 INFO (MainThread-19895) connected to server at ('snaprec-deep-w-1', 60302)
2017-04-08 00:50:22,234 INFO (MainThread-19895) TFSparkNode.reserve: {'authkey': "\x13\x01=\x08#\xe5@J\xba'\xe4\xebh\x81n\xa1", 'worker_num': 1, 'host': 'snaprec-deep-w-1', 'tb_port': 0, 'addr': '/tmp/pymp-pKz80z/listener-nGc7sH', 'ppid': 19889, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 49855}
2017-04-08 00:50:22,389 INFO (MainThread-19896) connected to server at ('snaprec-deep-w-1', 60302)
2017-04-08 00:50:22,390 INFO (MainThread-19896) node: {'addr': '/tmp/pymp-pKz80z/listener-nGc7sH', 'task_index': 0, 'job_name': 'worker', 'authkey': "\x13\x01=\x08#\xe5@J\xba'\xe4\xebh\x81n\xa1", 'worker_num': 1, 'host': 'snaprec-deep-w-1', 'ppid': 19889, 'port': 49855, 'tb_pid': 0, 'tb_port': 0}
2017-04-08 00:50:22,524 INFO (MainThread-19896) Starting TensorFlow ps:0 on cluster node 0 on background process
2017-04-08 00:50:28,374 INFO (MainThread-19943) 0: ======== ps:0 ========
2017-04-08 00:50:28,374 INFO (MainThread-19943) 0: Cluster spec: {'worker': ['snaprec-deep-w-1:49855']}
2017-04-08 00:50:28,375 INFO (MainThread-19943) 0: Using CPU
Process Process-2:
Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/__pyfiles__/mnist_dist.py", line 42, in map_fun
    cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
  File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
    server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
    self._server_def.SerializeToString(), status)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0026/container_1491523243553_0026_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
InternalError: Job "ps" was not defined in cluster

Do you have any suggestion?
Thx

from tensorflowonspark.

leewyang avatar leewyang commented on August 20, 2024

@jzhusc according to this line, your cluster_spec only defined one worker (and no PS):

2017-04-08 00:50:28,374 INFO (MainThread-19943) 0: Cluster spec: {'worker': ['snaprec-deep-w-1:49855']}

Can you provide the command line that you used to launch this job?

from tensorflowonspark.

jzhusc avatar jzhusc commented on August 20, 2024

@leewyang

spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 5G --py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --archives hdfs:///user/${USER}/Python.zip#Python --conf spark.executorEnv.LD_LIBRARY_PATH="$JAVA_HOME/jre/lib/amd64/server" TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py --images mnist/tfr/train --format tfr --mode train --model mnist_model
17/04/10 17:02:31 INFO com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase: GHFS version: 1.6.0-hadoop2

from tensorflowonspark.

leewyang avatar leewyang commented on August 20, 2024

from tensorflowonspark.

jzhusc avatar jzhusc commented on August 20, 2024

@leewyang

args: Namespace(cluster_size=10, epochs=0, format='tfr', images='mnist/tfr/train', labels=None, mode='train', model='mnist_model', output='predictions', rdma=False, readers=1, steps=1000, tensorboard=False)
2017-04-10T17:49:35.496284 ===== Start
2017-04-10 17:49:35,496 INFO (MainThread-2086) Reserving TFSparkNodes 
2017-04-10 17:49:35,500 INFO (MainThread-2086) listening for reservations at ('snaprec-deep-w-4', 39247)
2017-04-10 17:49:35,501 INFO (MainThread-2086) Starting TensorFlow on executors
2017-04-10 17:49:35,859 INFO (MainThread-2086) Waiting for TFSparkNodes to start
2017-04-10 17:49:35,859 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:36,860 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:37,862 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:38,863 INFO (MainThread-2086) waiting for 10 reservations
2017-04-10 17:49:39,865 INFO (MainThread-2086) waiting for 8 reservations
2017-04-10 17:49:40,866 INFO (MainThread-2086) waiting for 6 reservations
2017-04-10 17:49:41,867 INFO (MainThread-2086) waiting for 3 reservations
2017-04-10 17:49:42,869 INFO (MainThread-2086) waiting for 2 reservations
2017-04-10 17:49:43,870 INFO (MainThread-2086) waiting for 2 reservations
2017-04-10 17:49:44,871 INFO (MainThread-2086) waiting for 2 reservations
2017-04-10 17:49:45,873 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:46,874 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:47,875 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:48,877 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:49,878 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:50,879 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:51,881 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:52,882 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:53,883 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:54,885 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:55,886 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:56,887 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:57,889 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:58,890 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:49:59,891 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:00,893 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:01,893 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:02,895 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:03,896 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:04,897 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:05,899 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:06,900 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:07,901 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:08,903 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:09,904 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:10,905 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:11,907 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:12,908 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:13,909 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:14,911 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:15,912 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:16,913 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:17,915 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:18,916 INFO (MainThread-2086) waiting for 1 reservations
2017-04-10 17:50:19,917 INFO (MainThread-2086) waiting for 1 reservations

Also I find a error occurred in the driver

                           (0 + 10) / 10]17/04/10 17:49:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, snaprec-deep-w-9.c.snap-brain.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/pyspark.zip/pyspark/rdd.py", line 762, in func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000001/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 411, in _mapfn
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/__pyfiles__/mnist_dist.py", line 140, in map_fun
    x, y_ = read_tfr_examples(images, 100, num_epochs, index, workers)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/__pyfiles__/mnist_dist.py", line 83, in read_tfr_examples
    files = tf.gfile.Glob(tf_record_pattern)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 269, in get_matching_files
    compat.as_bytes(filename), status)]
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0034/container_1491523243553_0034_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
UnimplementedError: File system scheme hdfs not implemented

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:86)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


[Stage 0:>                                                        (0 + 10) / 10]

But sometimes another error occurred

a.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

17/04/10 18:01:46 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 7.0 in stage 0.0 (TID 7, snaprec-deep-w-8.c.snap-brain.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/pyspark.zip/pyspark/rdd.py", line 2371, in pipeline_func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/pyspark.zip/pyspark/rdd.py", line 762, in func
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000001/tfspark.zip/com/yahoo/ml/tf/TFSparkNode.py", line 411, in _mapfn
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/__pyfiles__/mnist_dist.py", line 42, in map_fun
    cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
  File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
    server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
    self._server_def.SerializeToString(), status)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0035/container_1491523243553_0035_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
InvalidArgumentError: Task 6 was not defined in job "worker"

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:86)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

from tensorflowonspark.

leewyang avatar leewyang commented on August 20, 2024

from tensorflowonspark.

jzhusc avatar jzhusc commented on August 20, 2024

@leewyang I expend the cluster size to 10 but still have the same question.
Seems like some question related to hdfs. But setting logdir=None doesn't work for me.

Also I find out that there are two tasks in each executor which ended up taking 5 executors instead of 10. is that the reason?

from tensorflowonspark.

jzhusc avatar jzhusc commented on August 20, 2024

@leewyang
Now my job is stuck at 2/10 or 4/10 for a long time. I have set the logdir=None
There are three types of errors in the executor

17/04/10 21:23:37 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-10 21:23:39,672 INFO (MainThread-8481) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,674 INFO (MainThread-8481) TFSparkNode.reserve: {'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'tb_port': 0, 'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'ppid': 8473, 'task_index': 2, 'job_name': 'worker', 'tb_pid': 0, 'port': 45394}
2017-04-10 21:23:39,674 INFO (MainThread-8480) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,676 INFO (MainThread-8480) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,676 INFO (MainThread-8480) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,855 INFO (MainThread-8480) Starting TensorFlow worker:1 on cluster node 2 on background process
2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: ======== worker:1 ========
2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394']}
2017-04-10 21:23:40,727 INFO (MainThread-8530) 2: Using CPU
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> snaprec-deep-w-5:40473, 1 -> localhost:45394}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:241] Started server with target: grpc://localhost:45394
tensorflow model path: None
Process Process-2:
Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/__pyfiles__/mnist_dist.py", line 122, in map_fun
    save_model_secs=10)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 336, in __init__
    self._verify_setup()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000004/Python/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 881, in _verify_setup
    "their device set: %s" % op)
ValueError: When using replicas, all Variables must have their device set: name: "hid_w"
op: "VariableV2"
attr {
  key: "container"
  value {
    s: ""
  }
}
attr {
  key: "dtype"
  value {
    type: DT_FLOAT
  }
}
attr {
  key: "shape"
  value {
    shape {
      dim {
        size: 784
      }
      dim {
        size: 128
      }
    }
  }
}
attr {
  key: "shared_name"
  value {
    s: ""
  }
}

and

17/04/10 21:23:37 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-10 21:23:39,562 INFO (MainThread-8999) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,563 INFO (MainThread-8999) TFSparkNode.reserve: {'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'tb_port': 0, 'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'ppid': 8992, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 40473}
2017-04-10 21:23:39,711 INFO (MainThread-8998) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,713 INFO (MainThread-8998) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,713 INFO (MainThread-8998) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,861 INFO (MainThread-8998) Starting TensorFlow ps:0 on cluster node 0 on background process
2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: ======== ps:0 ========
2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394']}
2017-04-10 21:23:45,679 INFO (MainThread-9048) 0: Using CPU
Process Process-2:
Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/__pyfiles__/mnist_dist.py", line 39, in map_fun
    cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
  File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
    server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
    self._server_def.SerializeToString(), status)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
InternalError: Job "ps" was not defined in cluster

and

17/04/10 21:23:38 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-10 21:23:39,827 INFO (MainThread-21382) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,828 INFO (MainThread-21382) TFSparkNode.reserve: {'authkey': 'TY\xaa\xfe-aGZ\xa5\xb8\xbe\x8b\x8ca&\xd1', 'worker_num': 5, 'host': 'snaprec-deep-w-0', 'tb_port': 0, 'addr': '/tmp/pymp-tYaBWE/listener-_ISeHu', 'ppid': 21376, 'task_index': 4, 'job_name': 'worker', 'tb_pid': 0, 'port': 48886}
2017-04-10 21:23:39,831 INFO (MainThread-21383) connected to server at ('snaprec-deep-w-9', 58463)
2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-uisw6e/listener-UjwPsq', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xf2\x1bv\xac\x9a\xeaL\x90\xa7\x99\xa1\x08\x88\xd0\xc0\x06', 'worker_num': 1, 'host': 'snaprec-deep-w-5', 'ppid': 8992, 'port': 40473, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-IorQjq/listener-3OXvo3', 'task_index': 2, 'job_name': 'worker', 'authkey': '\x8a\x15\xab4\xcb{O\xd1\x93\x0c\x91\x8f\xab\xa2p+', 'worker_num': 3, 'host': 'snaprec-deep-w-6', 'ppid': 8473, 'port': 45394, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,833 INFO (MainThread-21383) node: {'addr': '/tmp/pymp-tYaBWE/listener-_ISeHu', 'task_index': 4, 'job_name': 'worker', 'authkey': 'TY\xaa\xfe-aGZ\xa5\xb8\xbe\x8b\x8ca&\xd1', 'worker_num': 5, 'host': 'snaprec-deep-w-0', 'ppid': 21376, 'port': 48886, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:23:39,978 INFO (MainThread-21383) Starting TensorFlow worker:3 on cluster node 4 on background process
2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: ======== worker:3 ========
2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: Cluster spec: {'worker': ['snaprec-deep-w-5:40473', 'snaprec-deep-w-6:45394', 'snaprec-deep-w-0:48886']}
2017-04-10 21:23:40,825 INFO (MainThread-21432) 4: Using CPU
Process Process-2:
Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/__pyfiles__/mnist_dist.py", line 39, in map_fun
    cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
  File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
    server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
    self._server_def.SerializeToString(), status)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0058/container_1491523243553_0058_01_000002/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
InvalidArgumentError: Task 3 was not defined in job "worker"

from tensorflowonspark.

leewyang avatar leewyang commented on August 20, 2024

from tensorflowonspark.

jzhusc avatar jzhusc commented on August 20, 2024

@leewyang I use only 2 executors and get this errors

17/04/10 21:55:19 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-04-10 21:55:20,813 INFO (MainThread-9005) connected to server at ('snaprec-deep-w-7', 48312)
2017-04-10 21:55:20,814 INFO (MainThread-9005) TFSparkNode.reserve: {'authkey': '\xb2\x94\x91\xdd\xb6d@}\xa4%\x96s^\xec\x07\xfb', 'worker_num': 1, 'host': 'snaprec-deep-w-8', 'tb_port': 0, 'addr': '/tmp/pymp-_VGtFS/listener-tWsOQQ', 'ppid': 8999, 'task_index': 0, 'job_name': 'worker', 'tb_pid': 0, 'port': 41254}
2017-04-10 21:55:20,973 INFO (MainThread-9006) connected to server at ('snaprec-deep-w-7', 48312)
2017-04-10 21:55:20,975 INFO (MainThread-9006) node: {'addr': '/tmp/pymp-_VGtFS/listener-tWsOQQ', 'task_index': 0, 'job_name': 'worker', 'authkey': '\xb2\x94\x91\xdd\xb6d@}\xa4%\x96s^\xec\x07\xfb', 'worker_num': 1, 'host': 'snaprec-deep-w-8', 'ppid': 8999, 'port': 41254, 'tb_pid': 0, 'tb_port': 0}
2017-04-10 21:55:21,120 INFO (MainThread-9006) Starting TensorFlow ps:0 on cluster node 0 on background process
2017-04-10 21:55:27,017 INFO (MainThread-9055) 0: ======== ps:0 ========
2017-04-10 21:55:27,017 INFO (MainThread-9055) 0: Cluster spec: {'worker': ['snaprec-deep-w-8:41254']}
2017-04-10 21:55:27,017 INFO (MainThread-9055) 0: Using CPU
Process Process-2:
Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/Python/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/__pyfiles__/mnist_dist.py", line 39, in map_fun
    cluster, server = TFNode.start_cluster_server(ctx, 1, args.rdma)
  File "./tfspark.zip/com/yahoo/ml/tf/TFNode.py", line 88, in start_cluster_server
    server = tf.train.Server(cluster, ctx.job_name, ctx.task_index)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/training/server_lib.py", line 144, in __init__
    self._server_def.SerializeToString(), status)
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/Python/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/hadoop/yarn/nm-local-dir/usercache/jiaxu.zhu/appcache/application_1491523243553_0061/container_1491523243553_0061_01_000003/Python/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
InternalError: Job "ps" was not defined in cluster

Seems like that's the reason.
Seem like ps is not launched due to some reason.
And in another executor, the stderr is

17/04/10 22:07:53 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

from tensorflowonspark.

jzhusc avatar jzhusc commented on August 20, 2024

@leewyang
I think I find the reason.
in 1 worker and 1 ps mode. they are 2 tasks in the one executor so that they have same ppid and will only create one node so we cannot start both ps and worker

by limiting one task for each executor. The problem is solved

from tensorflowonspark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.