Comments (3)
Hi @formath .
There is two things I want to confirm.
- It seems that you are using previous DeepRec version which is older than 2302.
- you have customized tf.train.replica_device_setter worker worker device which may conflict with ev placement.
Could you plz turn on logging of the placement of variable? And set the worker_device
to '/job:worker/task:%d' % task_index instead.
from deeprec.
- DeepRec's version is in waiting.
- For
replica_device_setter
, I used defaultround-robin strategy
. - I set
log_device_placement=True
. When the ps num is 10, I got the error and the job exit before the whole placement information show. From the restricted log, I seemodel/input_layer/userid_embedding/embedding_weights/part_0
and itssave/SaveV2
are both placed on/job:ps/task:7
which meets expectation. However,model/input_layer/userid_embedding/embedding_weights/part_1
is placed on/job:ps/task:8
but I can't find its log ofsave/SaveV2
placement. I guess it is placed on/job:ps/task:7
which violates the colocation condition so the error occurs. And, it is not theev
problem because when I closedev
this error also exists.
2023-09-24 14:54:23.876226: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_0/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_0' is a reference connection and already has a device field set to /job:ps/task:7
2023-09-24 14:54:23.876278: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_1/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_1' is a reference connection and already has a device field set to /job:ps/task:8
2023-09-24 14:54:23.876287: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_2/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_2' is a reference connection and already has a device field set to /job:ps/task:9
2023-09-24 14:54:23.876295: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_3/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_3' is a reference connection and already has a device field set to /job:ps/task:0
2023-09-24 14:54:23.876302: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_4/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_4' is a reference connection and already has a device field set to /job:ps/task:1
2023-09-24 14:54:23.876311: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_5/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_5' is a reference connection and already has a device field set to /job:ps/task:2
2023-09-24 14:54:23.876320: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_6/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_6' is a reference connection and already has a device field set to /job:ps/task:3
2023-09-24 14:54:23.876328: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_7/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_7' is a reference connection and already has a device field set to /job:ps/task:4
2023-09-24 14:54:23.876336: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_8/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_8' is a reference connection and already has a device field set to /job:ps/task:5
2023-09-24 14:54:23.876345: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'model/input_layer/userid_embedding/embedding_weights/part_9/IsInitialized/KvVarIsInitializedOp' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_9' is a reference connection and already has a device field set to /job:ps/task:6
2023-09-24 14:54:23.923634: I tensorflow/core/common_runtime/colocation_graph.cc:241] Ignoring device specification /job:chief/task:0/device:CPU:0 for node 'save/SaveV2' because the input edge from 'model/input_layer/userid_embedding/embedding_weights/part_0' is a reference connection and already has a device field set to /job:ps/task:7
- When I set ps num to 1, the job works fine and the whole placement information will show.
/job:worker/task:%d
has the same error as/job:worker/task:%d/cpu:0
.
from deeprec.
The reason is tf.train.Saver(sharded=False)
. Changing it to sharded=True
fix my problem.
from deeprec.
Related Issues (20)
- 编译问题 HOT 3
- [BUILD] build failed with GPU configuration HOT 1
- [Performance] GPU的 Op schedule threads 只有2个 HOT 1
- CUDA Graph open fail HOT 3
- structure sparsity HOT 1
- Can deeprec processor be used in triton inference server? HOT 2
- Can not collect filtered string features information HOT 3
- Can not join dingtalk group or wechat group HOT 1
- Build Issue while build SOK HOT 1
- tensor not found when using tf.estimator.WarmStartSettings HOT 3
- Parquet dataset reader throw InternalError: Read uninitialized Dataset variant. HOT 1
- [BUILD] build failed with GPU configuration HOT 3
- Is GPU 1080Ti supported? HOT 1
- 进群微信 or 钉钉 HOT 2
- 编译sok会找不到tensorflow/core/kernels/gpu_device_array.h路径 HOT 3
- A protocol message was rejected because it was too big (more than 1073741824 bytes) HOT 3
- enable TF_USE_CUBLASLT on gpu for _FusedMatMul, but coredump
- warm_up和EV连用的时候,ckpt里面找不到Ev HOT 1
- 有升级到TF 2.x版本的计划和排期吗? HOT 1
- 使用动态长度embedding时,如何获取id对应的实际blocknum? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deeprec.