terrytangyuan / distributed-ml-patterns Goto Github PK

Distributed Machine Learning Patterns from Manning Publications by Yuan Tang https://bit.ly/2RKv8Zo

License: Apache License 2.0

Python 98.69% Dockerfile 1.31%

machine-learning distributed-systems cloud-computing cloud-native distributed-machine-learning large-scale-machine-learning python tensorflow kubernetes argo-workflows

distributed-ml-patterns's Introduction

Distributed Machine Learning Patterns

This repository contains references and code for the book Distributed Machine Learning Patterns from Manning Publications by Yuan Tang.

🔥 Both eBook and physical copies of the book are available!

Manning, Amazon, Barnes & Noble, Powell’s, Bookshop

In Distributed Machine Learning Patterns you will learn how to:

Apply patterns to build scalable and reliable machine learning systems.
Construct machine learning pipelines with data ingestion, distributed training, model serving, and more.
Automate machine learning tasks with Kubernetes, TensorFlow, Kubeflow, and Argo Workflows.
Make trade off decisions between different patterns and approaches.
Manage and monitor machine learning workloads at scale.

This book teaches you how to take machine learning models from your personal laptop to large distributed clusters. You’ll explore key concepts and patterns behind successful distributed machine learning systems, and learn technologies like TensorFlow, Kubernetes, Kubeflow, and Argo Workflows directly from a key maintainer and contributor. Real-world scenarios, hands-on projects, and clear, practical advice DevOps techniques and let you easily launch, manage, and monitor cloud-native distributed machine learning pipelines.

About the topic

Scaling up models from personal devices to large distributed clusters is one of the biggest challenges faced by modern machine learning practitioners. Distributing machine learning systems allow developers to handle extremely large datasets across multiple clusters, take advantage of automation tools, and benefit from hardware accelerations. In this book, Yuan Tang shares patterns, techniques, and experience gained from years spent building and managing cutting-edge distributed machine learning infrastructure.

About the book

Distributed Machine Learning Patterns is filled with practical patterns for running machine learning systems on distributed Kubernetes clusters in the cloud. Each pattern is designed to help solve common challenges faced when building distributed machine learning systems, including supporting distributed model training, handling unexpected failures, and dynamic model serving traffic. Real-world scenarios provide clear examples of how to apply each pattern, alongside the potential trade-offs for each approach. Once you’ve mastered these cutting-edge techniques, you’ll put them all into practice and finish up by building a comprehensive distributed machine learning system.

About the reader

For data analysts, data scientists, and software engineers familiar with the basics of machine learning algorithms and running machine learning in production. Readers should be familiar with the basics of Bash, Python, and Docker.

About the author

Yuan is a principal software engineer at Red Hat, working on OpenShift AI. Previously, he has led teams to build AI infrastructure and platforms at various companies, including Alibaba and Akuity. He's a project lead of Argo and Kubeflow, a maintainer of TensorFlow and XGBoost, and an author of many popular open source projects. In addition, Yuan authored three machine learning books and published numerous impactful papers. He's a regular conference speaker, technical advisor, leader, and mentor at various organizations.

Supporting Quotes

"This is a wonderful book for those wanting to understand how to be more effective with Machine Learning at scale, explained clearly and from first principles!"

-- Laurence Moroney, AI Developer Relations Lead at Google

"This book is an exceptionally timely and comprehensive guide to developing, running, and managing machine learning systems in a distributed environment. It covers essential topics such as data partitioning, ingestion, model training, serving, and workflow management. What truly sets this book apart is its discussion of these topics from a pattern perspective, accompanied by real-world examples and widely adopted systems like Kubernetes, Kubeflow, and Argo. I highly recommend it!"

-- Yuan Chen, Principal Software Engineer at Apple

"This book provides a high-level understanding of patterns with practical code examples needed for all MLOps engineering tasks. This is a must-read for anyone in the field."

-- Brian Ray, Global Head of Data Science and Artificial Intelligence at Eviden

"This book weaves together concepts from distributed systems, machine learning, and site reliability engineering in a way that’s approachable for beginners and that’ll excite and inspire experienced practitioners. As soon as I finished reading, I was ready to start building."

-- James Lamb, Staff Data Engineer at SpotHero

"Whatever your role is in the data ecosystem (scientist, analyst, or engineer), if you are looking to take your knowledge and skills to the next level, then this book is for you. This book is an amazing guide to the concepts and state-of-the-art when it comes to designing resilient and scalable, ML systems for both training and serving models. Regardless of what platform you may be working with, this book teaches you the patterns you should be familiar with when trying to scale out your systems."

-- Ryan Russon, Senior Manager of Model Training at Capital One

"AI is the new electricity, and distributed systems is the new power grid. Whether you are a research scientist, engineer, or product developer, you will find the best practices and recipes in this book to scale up your greatest endeavors."

-- Linxi "Jim" Fan, Senior AI Research Scientist at NVIDIA, Stanford PhD

"This book discusses various architectural approaches to tackle common data science problems such as scaling machine learning processes and building robust workflows and pipelines. It serves as an excellent introduction to the world of MLOps for data scientists and ML engineers who want to enhance their knowledge in this field."

-- Rami Krispin, Senior Data Science and Engineering Manager

distributed-ml-patterns's People

Contributors

Stargazers

Watchers

distributed-ml-patterns's Issues

求职AI金融产品经理

失业600天，求职AI金融产品经理

金融方面：2020届的山西大学金融学专业的毕业生，获得三好学生&学业奖学金，拥有会计、证券、基金从业证书，毕业论文是用Python的Statsmodel库里的VAR模型完成的《中美股市的联动性分析》，提出百度股价被低估能上300USD的想法，实现后。现注意到百度股票主打AI第一股重新去香港上市。
AI方面：我的整个经历，大一参加数学建模了解了神经网络算法，后在网络金融课分享了鱼书的ppt，得知金融方面的非结构信息常用深度学习算法，之后阅读过二十几本人工智能相关书籍，在20年用4个月自学完了《动手学深度学习》预览版并动手翻译了GAN从MXNet到PyTorch&TF2;DCGAN从MXNet到PyTorch，复现PyTorch、DJL等一系列深度学习移动端框架的Demo。
产品经理方面：虽然没有工作经历，但还是努力修炼了一系列的能力。战略能力上，在百度一面失败后，考虑到了百度开发者版的想法。沟通能力上，与MXNet的开发者交流无人社区冷淡情况。学习上，学习十余本产品有关的书，并自行总结了AI产品经理的工作内容和技能要求，用Docker部署成《To be AI PM》书。

Some feedback ref Chap 9

@terrytangyuan , as mentioned, for those looking to run the workflow.yaml end-to-end, there were a few infrastructure-related elements introduced in Chap 8 that might be worth recapping at the beginning of Chap 9, so as to ensure that someone focusing on that chapter has the possibility to run things. While you capture certain things in the sub README's on the repo, I have found that sometimes these aren't reflected back in the book .. when they probably should be.

cluster creation using k3d: per Chap 8 listing 8.21 k3d cluster create distml --image v1.25.3+k3s1 .... I think you need the 'rancher' part too for this to work k3d cluster create distml --image rancher/k3s:v1.25.3-k3s1.
creation of various kubeflow-related resource per 8.31 and 8.32. If I'm not mistaken, some of the service-accounts, roles and CRDs in distributed-ml-patterns/code/project/manifests/kubeflow-training are presumed to be in-place for Chap 9. I would maybe stick to using kubectl and not have expectations on readers to install other short-cut tooling like kns to work with their cluster.

kubectl create ns kubeflow
kubectl config set-context --current --namespace=kubeflow
kubectl kustomize manifests | kubectl apply -f -

in listing 9.13, it may be worth explicitly adding the 'kubeflow' namespace, as the workflow won't work otherwise, based on my own experimentation.

kubectl create -n kubeflow -f multi-worker-pvc.yaml

what exactly is happening in listing 9.20? ... maybe I'm missing something here, but a more explicit explanation in the context of understanding how this is being used for inferencing here would be great.
For the inference-service in 9.21, this gets replaced by an extended version in 9.28. Should the first one be explicitly deleted before the next one is created? In 9.37, the simpler version (9.21) without scaleTarget and scaleMetric is included as a manifest in the workflow. Is this intentional?
section 9.4.2 deals with step memoization ... I realise the dataset used for ingestion is a fixed fashion-MNIST and so it doesn't change. It would be great to see an example where that data is changed/augmented and have that trigger a training/model-selection re-run. Maybe the data ingestion can push to a Minio bucket and then Argo Events is used to detect changes to the bucket. That's probably beyond scope at this point.
while there is a reference to port-forwarding the argo-workflows service, I think it would benefit from a more explicit write-up of how to login. I found I ended up having to create my own user with the requisite cluster permissions to be able to get the UI working.
I saw some references in the code to an image for training-operator at public.ecr.aws/j1r0q0g6/training/training-operator, but also at https://github.com/orgs/kubeflow/packages/container/package/training%2Ftraining-operator. These are discussed here. It might be worth explicitly mentioning where the canonical source will be, going forward.

Overall, kudos for assembling this book and especially the end-to-end workflow in Chap 9. It offers a great blue-print for handling something 'real-world', which is rare to have in these sorts of text-books.

Best, Colum

Question on Argo Workflows UI

Hello, I am trying to follow the examples in Chapter 8. So far it works well except for running the UI locally. When I open the UI on my browser it is asking me to log in. I would like to visualize the workflows like the figures on the book, but I am not able to do anything in it. Any suggestions? Thanks.

`batch_norm` fix in `workflow.yaml` results in error

@terrytangyuan ... per the bottom of issue #3, when I try to 'fix' the workflow.yaml to specify --model_type batch_norm, the workflow errors-out each time. I have tried multiple times. It will only run-through if I revert this to dropout (but then two of the three models are using dropout, which is presumably wrong). I'm including logs from the workflow runs below. Are you experiencing the same with the batch_norm fix?

While it works with kubectl create -n kubeflow -f workflow.yaml, I notice when trying to use the argo cli to submit the workflow, it fails with this:

argo submit workflow.yaml
FATA[2023-08-23T16:09:28.397Z] Failed to parse workflow: json: unknown field "successCondition"

This is the section of workflow.yaml that has been modified to fix batch_norm.

# modified `workflow.yaml` with `batch_norm` in #L149
- name: cnn-model-with-batch-norm
    serviceAccountName: training-operator
    resource:
      action: create
      setOwnerReference: true
      successCondition: status.replicaStatuses.Worker.succeeded = 2
      failureCondition: status.replicaStatuses.Worker.failed > 0
      manifest: |
        apiVersion: kubeflow.org/v1
        kind: TFJob
        metadata:
          generateName: multi-worker-training-
        spec:
          runPolicy:
            cleanPodPolicy: None
          tfReplicaSpecs:
            Worker:
              replicas: 2
              restartPolicy: Never
              template:
                spec:
                  containers:
                    - name: tensorflow
                      image: kubeflow/multi-worker-strategy:v0.1
                      imagePullPolicy: IfNotPresent
                      command: ["python", "/multi-worker-distributed-training.py", "--saved_model_dir", "/trained_model/saved_model_versions/3/", "--checkpoint_dir", "/trained_model/checkpoint", "--model_type", "batch_norm"]
                      volumeMounts:
                        - mountPath: /trained_model
                          name: training
                      resources:
                        limits:
                          cpu: 500m
                  volumes:
                    - name: training
                      persistentVolumeClaim:
                        claimName: strategy-volume

These are the logs from the two worker nodes that error-out.

kubectl logs multi-worker-training-zckqg-worker-0 -n kubeflow
#returns
Dataset fashion_mnist downloaded and prepared to /root/tensorflow_datasets/fashion_mnist/3.0.1. Subsequent calls will reuse this data.
Training CNN model with batch normalization
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 conv2d (Conv2D)             (None, 26, 26, 32)        320

 batch_normalization (BatchN  (None, 26, 26, 32)       128
 ormalization)

 activation (Activation)     (None, 26, 26, 32)        0

 max_pooling2d (MaxPooling2D  (None, 13, 13, 32)       0
 )

 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496

 batch_normalization_1 (Batc  (None, 11, 11, 64)       256
 hNormalization)

 activation_1 (Activation)   (None, 11, 11, 64)        0

 max_pooling2d_1 (MaxPooling  (None, 5, 5, 64)         0
 2D)

 conv2d_2 (Conv2D)           (None, 3, 3, 64)          36928

 flatten (Flatten)           (None, 576)               0

 dense (Dense)               (None, 64)                36928

 dense_1 (Dense)             (None, 10)                650

=================================================================
Total params: 93,706
Trainable params: 93,514
Non-trainable params: 192
_________________________________________________________________
2023-08-23 14:39:29.851994: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
70/70 [==============================] - ETA: 0s - loss: 2.3749 - accuracy: 1.2304
Learning rate for epoch 1 is 0.0010000000474974513
70/70 [==============================] - 32s 307ms/step - loss: 1.1875 - accuracy: 0.6152 - lr: 0.0010
2023-08-23 14:40:16.091245: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2023-08-23 14:40:16.099838: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
WARNING:absl:Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _update_step_xla while saving (showing 4 of 4). These functions will not be directly callable after loading.
WARNING:tensorflow:From /usr/local/lib/python3.9/site-packages/tensorflow/python/util/deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
WARNING:tensorflow:From /usr/local/lib/python3.9/site-packages/tensorflow/python/util/deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
WARNING:absl:Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _update_step_xla while saving (showing 4 of 4). These functions will not be directly callable after loading.
Traceback (most recent call last):
  File "/multi-worker-distributed-training.py", line 234, in <module>
    main(parsed_args)
  File "/multi-worker-distributed-training.py", line 205, in main
    tf.saved_model.save(multi_worker_model, model_path, signatures=signatures)
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 1231, in save
    save_and_return_nodes(obj, export_dir, signatures, options)
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 1267, in save_and_return_nodes
    _build_meta_graph(obj, signatures, options, meta_graph_def))
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 1440, in _build_meta_graph
    return _build_meta_graph_impl(obj, signatures, options, meta_graph_def)
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 1395, in _build_meta_graph_impl
    asset_info, exported_graph = _fill_meta_graph_def(
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 793, in _fill_meta_graph_def
    object_map, tensor_map, asset_info = saveable_view.map_resources()
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 400, in map_resources
    tensors = obj._export_to_saved_model_graph(  # pylint: disable=protected-access
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 2277, in _export_to_saved_model_graph
    raise ValueError(
ValueError: Unable to save function b'__inference_signature_wrapper_146598' for the following reason(s):

ConcreteFunction that uses distributed variables in certain way cannot be saved.
If you're saving with

tf.saved_model.save(..., signatures=f.get_concrete_function())

do

@tf.function(input_signature=...)
def f_with_input_signature():
  ...

tf.saved_model.save(..., signatures=f_with_input_signature)`

instead.

kubectl logs multi-worker-training-zckqg-worker-1 -n kubeflow
# returns
Dataset fashion_mnist downloaded and prepared to /root/tensorflow_datasets/fashion_mnist/3.0.1. Subsequent calls will reuse this data.
Training CNN model with batch normalization
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 conv2d (Conv2D)             (None, 26, 26, 32)        320

 batch_normalization (BatchN  (None, 26, 26, 32)       128
 ormalization)

 activation (Activation)     (None, 26, 26, 32)        0

 max_pooling2d (MaxPooling2D  (None, 13, 13, 32)       0
 )

 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496

 batch_normalization_1 (Batc  (None, 11, 11, 64)       256
 hNormalization)

 activation_1 (Activation)   (None, 11, 11, 64)        0

 max_pooling2d_1 (MaxPooling  (None, 5, 5, 64)         0
 2D)

 conv2d_2 (Conv2D)           (None, 3, 3, 64)          36928

 flatten (Flatten)           (None, 576)               0

 dense (Dense)               (None, 64)                36928

 dense_1 (Dense)             (None, 10)                650

=================================================================
Total params: 93,706
Trainable params: 93,514
Non-trainable params: 192
_________________________________________________________________
2023-08-23 14:39:29.845471: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
70/70 [==============================] - ETA: 0s - loss: 2.3749 - accuracy: 1.2304
Learning rate for epoch 1 is 0.0010000000474974513
70/70 [==============================] - 32s 307ms/step - loss: 1.1875 - accuracy: 0.6152 - lr: 0.0010
2023-08-23 14:40:02.230116: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2023-08-23 14:40:02.236010: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
WARNING:absl:Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _update_step_xla while saving (showing 4 of 4). These functions will not be directly callable after loading.
WARNING:tensorflow:From /usr/local/lib/python3.9/site-packages/tensorflow/python/util/deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
WARNING:tensorflow:From /usr/local/lib/python3.9/site-packages/tensorflow/python/util/deprecation.py:629: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
WARNING:absl:Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _update_step_xla while saving (showing 4 of 4). These functions will not be directly callable after loading.
Traceback (most recent call last):
  File "/multi-worker-distributed-training.py", line 234, in <module>
    main(parsed_args)
  File "/multi-worker-distributed-training.py", line 205, in main
    tf.saved_model.save(multi_worker_model, model_path, signatures=signatures)
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 1231, in save
    save_and_return_nodes(obj, export_dir, signatures, options)
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 1267, in save_and_return_nodes
    _build_meta_graph(obj, signatures, options, meta_graph_def))
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 1440, in _build_meta_graph
    return _build_meta_graph_impl(obj, signatures, options, meta_graph_def)
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 1395, in _build_meta_graph_impl
    asset_info, exported_graph = _fill_meta_graph_def(
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 793, in _fill_meta_graph_def
    object_map, tensor_map, asset_info = saveable_view.map_resources()
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/save.py", line 400, in map_resources
    tensors = obj._export_to_saved_model_graph(  # pylint: disable=protected-access
  File "/usr/local/lib/python3.9/site-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 2277, in _export_to_saved_model_graph
    raise ValueError(
ValueError: Unable to save function b'__inference_signature_wrapper_146542' for the following reason(s):

ConcreteFunction that uses distributed variables in certain way cannot be saved.
If you're saving with

tf.saved_model.save(..., signatures=f.get_concrete_function())

do

@tf.function(input_signature=...)
def f_with_input_signature():
  ...

tf.saved_model.save(..., signatures=f_with_input_signature)`

instead.

Should workflow templates pre-exist prior to running `kubectl create -f workflow.yaml`?

@terrytangyuan .. I've been working my way through your book. Thanks for putting together such an informative book. The latest version available on Manning is v7 from May. Perhaps there is something more up to date ... that might explain why I'm hitting the issue described below?

From chapter 8, various CRDs are created with kubectl kustomize manifests | k apply -f -. It might be worth using the long-form kubectl kustomize manifests | kubectl apply -f -, for those that may not have an alias k set up for kubectl.

I notice that this calls the distributed-ml-patterns/code/project/manifests/kustomization.yaml file, which in turn activates various manifests in the argo-workflows and kubeflow-training folders.

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: kubeflow

resources:
- argo-workflows/
- kubeflow-training/

It seems when I try to run kubectl create -f workflow.yaml in Chapter 9, it fails (see below). I think it might be due to the absence of the correct workflow-templates pre-populated in argo. Could it be that manifests from the e2e-demo folder should have been included in the kustomization.yaml above, or is something else missing?

Appreciate your input. Thanks.

~ % kubectl get workflows -n kubeflow
NAME             STATUS   AGE   MESSAGE
tfjob-wf-lzwv9   Failed   60m   invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined
~ % kubectl describe workflows tfjob-wf-lzwv9 -n kubeflow
Name:         tfjob-wf-lzwv9
Namespace:    kubeflow
Labels:       workflows.argoproj.io/completed=true
              workflows.argoproj.io/phase=Failed
Annotations:  <none>
API Version:  argoproj.io/v1alpha1
Kind:         Workflow
Metadata:
  Creation Timestamp:  2023-08-21T14:56:06Z
  Generate Name:       tfjob-wf-
  Generation:          2
  Resource Version:    8141
  UID:                 3a573b29-ef27-4940-9c99-5f2c541850ea
Spec:
  Arguments:
  Entrypoint:  tfjob-wf
  Pod GC:
    Strategy:  OnPodSuccess
  Templates:
    Inputs:
    Metadata:
    Name:  tfjob-wf
    Outputs:
    Steps:
      [map[arguments:map[] name:data-ingestion-step template:data-ingestion-step]]
      [map[arguments:map[] name:distributed-tf-training-steps template:distributed-tf-training-steps]]
      [map[arguments:map[] name:model-selection-step template:model-selection-step]]
      [map[arguments:map[] name:create-model-serving-service template:create-model-serving-service]]
  Volumes:
    Name:  model
    Persistent Volume Claim:
      Claim Name:  strategy-volume
    Name:          data-ingestion-step
    Name:          distributed-tf-training-steps
    Name:          cnn-model
    Name:          cnn-model-with-dropout
    Name:          cnn-model-with-batch-norm
    Name:          model-selection-step
    Name:          create-model-serving-service
Status:
  Conditions:
    Status:     True
    Type:       Completed
  Finished At:  2023-08-21T14:56:06Z
  Message:      invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined
  Phase:        Failed
  Progress:     0/0
  Started At:   2023-08-21T14:56:06Z
Events:
  Type     Reason          Age   From                 Message
  ----     ------          ----  ----                 -------
  Warning  WorkflowFailed  60m   workflow-controller  invalid spec: templates.tfjob-wf.steps[0].data-ingestion-step template name 'data-ingestion-step' undefined