<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Should workflow templates pre-exist prior to running `kubectl create -f workflow.yaml`? about distributed-ml-patterns HOT 15 CLOSED

Analect commented on June 15, 2024

Should workflow templates pre-exist prior to running `kubectl create -f workflow.yaml`?

from distributed-ml-patterns.

Comments (15)

terrytangyuan commented on June 15, 2024

It might be worth using the long-form kubectl kustomize manifests | kubectl apply -f -, for those that may not have an alias k set up for kubectl.

Thanks! That is fixed in the book and I just fixed them in the README file in the repo as well.

invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined

It seems like the template does not exist, could you run a kubectl apply -f of this file https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/manifests/e2e-demo/workflows-templates-tfjob.yaml?

from distributed-ml-patterns.

terrytangyuan commented on June 15, 2024

See https://github.com/terrytangyuan/distributed-ml-patterns/tree/main/code/project#run-workflow

from distributed-ml-patterns.

terrytangyuan commented on June 15, 2024

Actually can you try the lastest version in main branch? That data ingestion template should already exist in https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/code/workflow.yaml#L26

from distributed-ml-patterns.

Analect commented on June 15, 2024

@terrytangyuan .. thanks for the prompt response. I have been using the latest code as of today from this repo.

Yes, those e2e_demo steps didn't appear anywhere in the book ... and I had overlooked those commands in that README in the project folder. Let me try to get it working.

I also hadn't focused on the README at distributed-ml-patterns/tree/main/code/project/code, which contains some stuff that hasn't made it into the book version (7) I had been following. Maybe that has since been addressed.

Chapter 9 is full of great content, but I'm sometimes struggling in getting the code-snippets from the book working ... as you explain how each function works ... it seems they need to be run as the full python script (all functions together) ... like https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/code/multi-worker-distributed-training.py.

Getting things working in Chapter 9 requires you to have run certain steps in Chapter 8 of the book. If it's not too late, it might be worth recapping those required steps at the beginning of Chapter 9, since some readers might wish to plunge directly into the end-to-end in Chapter 9.

from distributed-ml-patterns.

terrytangyuan commented on June 15, 2024

Getting things working in Chapter 9 requires you to have run certain steps in Chapter 8 of the book. If it's not too late, it might be worth recapping those required steps at the beginning of Chapter 9, since some readers might wish to plunge directly into the end-to-end in Chapter 9.

That's great feedback. Thank you! If you have specific recommendation on what prerequisites are missing to follow the code snippets in the last chapter, please let me know.

Yes, those e2e_demo steps didn't appear anywhere in the book ... and I had overlooked those commands in that README in the project folder. Let me try to get it working.

They should not be part of the book. Please follow the workflow.yaml in the repo for now.

from distributed-ml-patterns.

Analect commented on June 15, 2024

So I ran these, per the README, and they appeared to run OK (see middle job in screenshot below).

kubectl create -f manifests/e2e-demo/workflows-templates-tfjob.yaml
kubectl create -f manifests/e2e-demo/e2e-workflow.yaml

However, each time I try to run kubectl create -f workflow.yaml using this file workflow.yaml, it is failing, per the error at the top of this thread.

It seems I am missing the various templates referenced here. It somehow expects these workflow-templates to pre-exist, but I can't find them anywhere in the code. Am I missing something?

from distributed-ml-patterns.

terrytangyuan commented on June 15, 2024

Thank you! I just fixed it. Could you try again?

from distributed-ml-patterns.

Analect commented on June 15, 2024

That change permitted the workflow to begin running. It seemed like a small tweak. Do you mind explaining what the fix was?

However, I'm seeing that the multi-worker-training-* pods get stuck in pending.

If I look into one of them I see the following 'event' - Warning FailedScheduling 5m31s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "strategy-volume" not found. preemption: 0/1 nodes are avail ││ able: 1 Preemption is not helpful for scheduling.

Maybe this is a namespace issue. My persistentvolume 'strategy-volume' does exist ... but in the default namespace. I think it was generated from here. Perhaps it needs to be in the kubeflow namespace for this to work?

from distributed-ml-patterns.

terrytangyuan commented on June 15, 2024

Did you run the following to change to current namespace?

kns kubeflow

Once it's ran, all your manifests without specifying the namespace will use the current namespace.

from distributed-ml-patterns.

Analect commented on June 15, 2024

Is kns an alias for something else?

from distributed-ml-patterns.

Analect commented on June 15, 2024

I see it at blendle/kns.

I suppose the kubectl-native way would be kubectl config set-context --current --namespace=kubeflow. I'm not often doing that.

from distributed-ml-patterns.

Analect commented on June 15, 2024

Ran this:

% kubectl config set-context --current --namespace=kubeflow
Context "k3d-distml" modified.
% kubectl create -f workflow.yaml
workflow.argoproj.io/tfjob-wf-k45fh created

Same issue with Warning FailedScheduling 5m31s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "strategy-volume" not found. preemption: 0/1 nodes are avail ││ able: 1 Preemption is not helpful for scheduling.

Will have to try to re-create strategy-volume in kubeflow namespace tomorrow ... and see if that fixes.

from distributed-ml-patterns.

terrytangyuan commented on June 15, 2024

It should be covered in previous chapter. Instead of switching current namespace, you can also add -n Kubeflow in your kubectl commands to specify namespace explicitly.

from distributed-ml-patterns.

Analect commented on June 15, 2024

@terrytangyuan ... got things working by recreating the strategy-volume in kubeflow namespace.

I noticed here, that it should be "--model_type", "batch_norm" ... rather than dropout that is repeated for models 2 and 3.

I'll close out this for now and maybe raise a separate issue ref. other feedback on Chap 9.

from distributed-ml-patterns.

terrytangyuan commented on June 15, 2024

Great thanks!

from distributed-ml-patterns.

Should workflow templates pre-exist prior to running `kubectl create -f workflow.yaml`? about distributed-ml-patterns HOT 15 CLOSED

Comments (15)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs