GithubHelp home page GithubHelp logo

Cannot place the graph because a reference or resource edge connects colocation groups with incompatible resource devices about deeprec HOT 3 CLOSED

formath avatar formath commented on June 11, 2024
Cannot place the graph because a reference or resource edge connects colocation groups with incompatible resource devices

from deeprec.

Comments (3)

Mesilenceki avatar Mesilenceki commented on June 11, 2024

Hi @formath .
There is two things I want to confirm.

  1. It seems that you are using previous DeepRec version which is older than 2302.
  2. you have customized tf.train.replica_device_setter worker worker device which may conflict with ev placement.

Could you plz turn on logging of the placement of variable? And set the worker_device to '/job:worker/task:%d' % task_index instead.

from deeprec.

formath avatar formath commented on June 11, 2024

@Mesilenceki

  1. DeepRec's version is in waiting.
  2. For replica_device_setter, I used default round-robin strategy.
  3. I set log_device_placement=True. When the ps num is 10, I got the error and the job exit before the whole placement information show. From the restricted log, I see model/input_layer/userid_embedding/embedding_weights/part_0 and its save/SaveV2 are both placed on /job:ps/task:7 which meets expectation. However, model/input_layer/userid_embedding/embedding_weights/part_1 is placed on /job:ps/task:8 but I can't find its log of save/SaveV2placement. I guess it is placed on /job:ps/task:7 which violates the colocation condition so the error occurs. And, it is not the ev problem because when I closed ev this error also exists.
2023-09-24 14:54:23.876226: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_0/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_0' is a reference connection and already has a device field set to /job:ps/task:7
2023-09-24 14:54:23.876278: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_1/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_1' is a reference connection and already has a device field set to /job:ps/task:8
2023-09-24 14:54:23.876287: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_2/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_2' is a reference connection and already has a device field set to /job:ps/task:9
2023-09-24 14:54:23.876295: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_3/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_3' is a reference connection and already has a device field set to /job:ps/task:0
2023-09-24 14:54:23.876302: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_4/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_4' is a reference connection and already has a device field set to /job:ps/task:1
2023-09-24 14:54:23.876311: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_5/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_5' is a reference connection and already has a device field set to /job:ps/task:2
2023-09-24 14:54:23.876320: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_6/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_6' is a reference connection and already has a device field set to /job:ps/task:3
2023-09-24 14:54:23.876328: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_7/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_7' is a reference connection and already has a device field set to /job:ps/task:4
2023-09-24 14:54:23.876336: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_8/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_8' is a reference connection and already has a device field set to /job:ps/task:5
2023-09-24 14:54:23.876345: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_9/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_9' is a reference connection and already has a device field set to /job:ps/task:6

2023-09-24 14:54:23.923634: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'save/SaveV2' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_0' is a reference connection and already has a device field set to /job:ps/task:7
  1. When I set ps num to 1, the job works fine and the whole placement information will show.
  2. /job:worker/task:%d has the same error as /job:worker/task:%d/cpu:0.

from deeprec.

formath avatar formath commented on June 11, 2024

The reason is tf.train.Saver(sharded=False). Changing it to sharded=True fix my problem.

from deeprec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.