GithubHelp home page GithubHelp logo

KeyError: 'input' error about tensorflowonspark HOT 3 CLOSED

yahoo avatar yahoo commented on August 20, 2024
KeyError: 'input' error

from tensorflowonspark.

Comments (3)

leewyang avatar leewyang commented on August 20, 2024

Hi @xiaoyongzhu, this error generally occurs when a data feeding task is assigned to the executor running the PS node. And, this can only occur if you're configured to run more than one task per executor.

So, for example, you have two hosts, and you started a PS node on one executor and a worker node on the other, per:

17/06/17 05:20:02 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.0.0.11, executor 1, partition 0, PROCESS_LOCAL, 6088 bytes)
17/06/17 05:20:02 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.0.0.14, executor 2, partition 1, PROCESS_LOCAL, 6088 bytes)
17/06/17 05:20:02 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.0.0.14:36036 (size: 7.8 KB, free: 2004.6 MB)
17/06/17 05:20:02 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.0.0.11:44572 (size: 7.8 KB, free: 2004.6 MB)
2017-06-17 05:20:03,426 INFO (MainThread-15141) waiting for 1 reservations
2017-06-17 05:20:04,427 INFO (MainThread-15141) all reservations completed
2017-06-17 05:20:04,427 INFO (MainThread-15141) All TFSparkNodes started
2017-06-17 05:20:04,427 INFO (MainThread-15141) {'addr': '/tmp/pymp-mWKGgU/listener-dQ0ETu', 'task_index': 0, 'port': 43064, 'authkey': '\xb6\x05\x10\x08\xef\x94@&\x97\x89a\x16\x90\x98\xd0\xbd', 'worker_num': 1, 'host': 'wn1-xiaoyz', 'ppid': 28794, 'job_name': 'worker', 'tb_pid': 0, 'tb_port': 0}
2017-06-17 05:20:04,427 INFO (MainThread-15141) {'addr': ('wn0-xiaoyz', 35537), 'task_index': 0, 'port': 39157, 'authkey': '+\xa5\x05\x17\xd93Db\x80\x16\x19\xa9\x13\x9b%U', 'worker_num': 0, 'host': 'wn0-xiaoyz', 'ppid': 31410, 'job_name': 'ps', 'tb_pid': 0, 'tb_port': 0}

However, the data feeding job reports:

17/06/17 05:20:04 INFO YarnScheduler: Adding task set 1.0 with 10 tasks
17/06/17 05:20:04 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, 10.0.0.11, executor 1, partition 0, PROCESS_LOCAL, 6768 bytes)
17/06/17 05:20:04 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, 10.0.0.14, executor 2, partition 1, PROCESS_LOCAL, 6768 bytes)
17/06/17 05:20:04 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 4, 10.0.0.11, executor 1, partition 2, PROCESS_LOCAL, 6768 bytes)
17/06/17 05:20:04 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 5, 10.0.0.14, executor 2, partition 3, PROCESS_LOCAL, 6768 bytes)
...
17/06/17 05:20:05 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 6, 10.0.0.14, executor 2, partition 4, PROCESS_LOCAL, 6768 bytes)
...
17/06/17 05:20:05 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 7, 10.0.0.11, executor 1, partition 5, PROCESS_LOCAL, 6768 bytes)
17/06/17 05:20:05 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 8, 10.0.0.11, executor 1, partition 6, PROCESS_LOCAL, 6768 bytes)

So you will need to configure spark to run one task per executor, e.g. setting --conf spark.executor.cores=1.

from tensorflowonspark.

renato2099 avatar renato2099 commented on August 20, 2024

Hi @leewyang
Does this mean that we always have to run with --conf spark.executor.cores=1 and scale with the number of executors instead? i.e. create more executors rather than assigning more cpus to them? Is this correct?

from tensorflowonspark.

leewyang avatar leewyang commented on August 20, 2024

@renato2099 yes. Keep in mind that the spark.executor.cores is a resource allocation hint and not a hardware limit. It just tells Spark that you only want to run one task at a time on each executor, which in the TFoS setting means that we only want to run one TensorFlow node on each executor. And this was chosen as the simplest/easiest-to-reason-about "level of abstraction" (vs. one TensorFlow node per task)... for example, each executor's log will only contain log statements from one TensorFlow node.

from tensorflowonspark.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.