Comments (15)
It might be worth using the long-form kubectl kustomize manifests | kubectl apply -f -, for those that may not have an alias k set up for kubectl.
Thanks! That is fixed in the book and I just fixed them in the README file in the repo as well.
invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined
It seems like the template does not exist, could you run a kubectl apply -f
of this file https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/manifests/e2e-demo/workflows-templates-tfjob.yaml?
from distributed-ml-patterns.
See https://github.com/terrytangyuan/distributed-ml-patterns/tree/main/code/project#run-workflow
from distributed-ml-patterns.
Actually can you try the lastest version in main branch? That data ingestion template should already exist in https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/code/workflow.yaml#L26
from distributed-ml-patterns.
@terrytangyuan .. thanks for the prompt response. I have been using the latest code as of today from this repo.
Yes, those e2e_demo
steps didn't appear anywhere in the book ... and I had overlooked those commands in that README in the project folder. Let me try to get it working.
I also hadn't focused on the README at distributed-ml-patterns/tree/main/code/project/code, which contains some stuff that hasn't made it into the book version (7) I had been following. Maybe that has since been addressed.
Chapter 9 is full of great content, but I'm sometimes struggling in getting the code-snippets from the book working ... as you explain how each function works ... it seems they need to be run as the full python script (all functions together) ... like https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/code/multi-worker-distributed-training.py.
Getting things working in Chapter 9 requires you to have run certain steps in Chapter 8 of the book. If it's not too late, it might be worth recapping those required steps at the beginning of Chapter 9, since some readers might wish to plunge directly into the end-to-end in Chapter 9.
from distributed-ml-patterns.
Getting things working in Chapter 9 requires you to have run certain steps in Chapter 8 of the book. If it's not too late, it might be worth recapping those required steps at the beginning of Chapter 9, since some readers might wish to plunge directly into the end-to-end in Chapter 9.
That's great feedback. Thank you! If you have specific recommendation on what prerequisites are missing to follow the code snippets in the last chapter, please let me know.
Yes, those e2e_demo steps didn't appear anywhere in the book ... and I had overlooked those commands in that README in the project folder. Let me try to get it working.
They should not be part of the book. Please follow the workflow.yaml
in the repo for now.
from distributed-ml-patterns.
So I ran these, per the README, and they appeared to run OK (see middle job in screenshot below).
kubectl create -f manifests/e2e-demo/workflows-templates-tfjob.yaml
kubectl create -f manifests/e2e-demo/e2e-workflow.yaml
However, each time I try to run kubectl create -f workflow.yaml
using this file workflow.yaml, it is failing, per the error at the top of this thread.
It seems I am missing the various templates referenced here. It somehow expects these workflow-templates to pre-exist, but I can't find them anywhere in the code. Am I missing something?
from distributed-ml-patterns.
Thank you! I just fixed it. Could you try again?
from distributed-ml-patterns.
That change permitted the workflow to begin running. It seemed like a small tweak. Do you mind explaining what the fix was?
However, I'm seeing that the multi-worker-training-*
pods get stuck in pending.
If I look into one of them I see the following 'event' - Warning FailedScheduling 5m31s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "strategy-volume" not found. preemption: 0/1 nodes are avail ││ able: 1 Preemption is not helpful for scheduling.
Maybe this is a namespace issue. My persistentvolume
'strategy-volume' does exist ... but in the default
namespace. I think it was generated from here. Perhaps it needs to be in the kubeflow
namespace for this to work?
from distributed-ml-patterns.
Did you run the following to change to current namespace?
kns kubeflow
Once it's ran, all your manifests without specifying the namespace will use the current namespace.
from distributed-ml-patterns.
Is kns
an alias for something else?
from distributed-ml-patterns.
I see it at blendle/kns.
I suppose the kubectl-native way would be kubectl config set-context --current --namespace=kubeflow
. I'm not often doing that.
from distributed-ml-patterns.
Ran this:
% kubectl config set-context --current --namespace=kubeflow
Context "k3d-distml" modified.
% kubectl create -f workflow.yaml
workflow.argoproj.io/tfjob-wf-k45fh created
Same issue with Warning FailedScheduling 5m31s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "strategy-volume" not found. preemption: 0/1 nodes are avail ││ able: 1 Preemption is not helpful for scheduling.
Will have to try to re-create strategy-volume
in kubeflow
namespace tomorrow ... and see if that fixes.
from distributed-ml-patterns.
It should be covered in previous chapter. Instead of switching current namespace, you can also add -n Kubeflow in your kubectl commands to specify namespace explicitly.
from distributed-ml-patterns.
@terrytangyuan ... got things working by recreating the strategy-volume
in kubeflow
namespace.
I noticed here, that it should be "--model_type", "batch_norm"
... rather than dropout
that is repeated for models 2 and 3.
I'll close out this for now and maybe raise a separate issue ref. other feedback on Chap 9.
from distributed-ml-patterns.
Great thanks!
from distributed-ml-patterns.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from distributed-ml-patterns.