GithubHelp home page GithubHelp logo

Comments (15)

terrytangyuan avatar terrytangyuan commented on June 15, 2024

It might be worth using the long-form kubectl kustomize manifests | kubectl apply -f -, for those that may not have an alias k set up for kubectl.

Thanks! That is fixed in the book and I just fixed them in the README file in the repo as well.

invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined

It seems like the template does not exist, could you run a kubectl apply -f of this file https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/manifests/e2e-demo/workflows-templates-tfjob.yaml?

from distributed-ml-patterns.

terrytangyuan avatar terrytangyuan commented on June 15, 2024

See https://github.com/terrytangyuan/distributed-ml-patterns/tree/main/code/project#run-workflow

from distributed-ml-patterns.

terrytangyuan avatar terrytangyuan commented on June 15, 2024

Actually can you try the lastest version in main branch? That data ingestion template should already exist in https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/code/workflow.yaml#L26

from distributed-ml-patterns.

Analect avatar Analect commented on June 15, 2024

@terrytangyuan .. thanks for the prompt response. I have been using the latest code as of today from this repo.

Yes, those e2e_demo steps didn't appear anywhere in the book ... and I had overlooked those commands in that README in the project folder. Let me try to get it working.

I also hadn't focused on the README at distributed-ml-patterns/tree/main/code/project/code, which contains some stuff that hasn't made it into the book version (7) I had been following. Maybe that has since been addressed.

Chapter 9 is full of great content, but I'm sometimes struggling in getting the code-snippets from the book working ... as you explain how each function works ... it seems they need to be run as the full python script (all functions together) ... like https://github.com/terrytangyuan/distributed-ml-patterns/blob/main/code/project/code/multi-worker-distributed-training.py.

Getting things working in Chapter 9 requires you to have run certain steps in Chapter 8 of the book. If it's not too late, it might be worth recapping those required steps at the beginning of Chapter 9, since some readers might wish to plunge directly into the end-to-end in Chapter 9.

from distributed-ml-patterns.

terrytangyuan avatar terrytangyuan commented on June 15, 2024

Getting things working in Chapter 9 requires you to have run certain steps in Chapter 8 of the book. If it's not too late, it might be worth recapping those required steps at the beginning of Chapter 9, since some readers might wish to plunge directly into the end-to-end in Chapter 9.

That's great feedback. Thank you! If you have specific recommendation on what prerequisites are missing to follow the code snippets in the last chapter, please let me know.

Yes, those e2e_demo steps didn't appear anywhere in the book ... and I had overlooked those commands in that README in the project folder. Let me try to get it working.

They should not be part of the book. Please follow the workflow.yaml in the repo for now.

from distributed-ml-patterns.

Analect avatar Analect commented on June 15, 2024

So I ran these, per the README, and they appeared to run OK (see middle job in screenshot below).

kubectl create -f manifests/e2e-demo/workflows-templates-tfjob.yaml
kubectl create -f manifests/e2e-demo/e2e-workflow.yaml

image

However, each time I try to run kubectl create -f workflow.yaml using this file workflow.yaml, it is failing, per the error at the top of this thread.

It seems I am missing the various templates referenced here. It somehow expects these workflow-templates to pre-exist, but I can't find them anywhere in the code. Am I missing something?

from distributed-ml-patterns.

terrytangyuan avatar terrytangyuan commented on June 15, 2024

Thank you! I just fixed it. Could you try again?

from distributed-ml-patterns.

Analect avatar Analect commented on June 15, 2024

That change permitted the workflow to begin running. It seemed like a small tweak. Do you mind explaining what the fix was?

image

However, I'm seeing that the multi-worker-training-* pods get stuck in pending.

image

If I look into one of them I see the following 'event' - Warning FailedScheduling 5m31s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "strategy-volume" not found. preemption: 0/1 nodes are avail ││ able: 1 Preemption is not helpful for scheduling.

Maybe this is a namespace issue. My persistentvolume 'strategy-volume' does exist ... but in the default namespace. I think it was generated from here. Perhaps it needs to be in the kubeflow namespace for this to work?

image

from distributed-ml-patterns.

terrytangyuan avatar terrytangyuan commented on June 15, 2024

Did you run the following to change to current namespace?

kns kubeflow

Once it's ran, all your manifests without specifying the namespace will use the current namespace.

from distributed-ml-patterns.

Analect avatar Analect commented on June 15, 2024

Is kns an alias for something else?

from distributed-ml-patterns.

Analect avatar Analect commented on June 15, 2024

I see it at blendle/kns.

I suppose the kubectl-native way would be kubectl config set-context --current --namespace=kubeflow. I'm not often doing that.

from distributed-ml-patterns.

Analect avatar Analect commented on June 15, 2024

Ran this:

% kubectl config set-context --current --namespace=kubeflow
Context "k3d-distml" modified.
% kubectl create -f workflow.yaml
workflow.argoproj.io/tfjob-wf-k45fh created

Same issue with Warning FailedScheduling 5m31s default-scheduler 0/1 nodes are available: 1 persistentvolumeclaim "strategy-volume" not found. preemption: 0/1 nodes are avail ││ able: 1 Preemption is not helpful for scheduling.

Will have to try to re-create strategy-volume in kubeflow namespace tomorrow ... and see if that fixes.

from distributed-ml-patterns.

terrytangyuan avatar terrytangyuan commented on June 15, 2024

It should be covered in previous chapter. Instead of switching current namespace, you can also add -n Kubeflow in your kubectl commands to specify namespace explicitly.

from distributed-ml-patterns.

Analect avatar Analect commented on June 15, 2024

@terrytangyuan ... got things working by recreating the strategy-volume in kubeflow namespace.

I noticed here, that it should be "--model_type", "batch_norm" ... rather than dropout that is repeated for models 2 and 3.

I'll close out this for now and maybe raise a separate issue ref. other feedback on Chap 9.

from distributed-ml-patterns.

terrytangyuan avatar terrytangyuan commented on June 15, 2024

Great thanks!

from distributed-ml-patterns.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.