This example demonstrates how you can use kubeflow
end-to-end to train and
serve a distributed Pytorch model on an existing kubernetes cluster. This
tutorial is based upon the below projects:
- DDP training CPU and GPU in Pytorch-operator example
- Google Codelabs - "Introduction to Kubeflow on Google Kubernetes Engine"
- IBM FfDL - PyTorch MNIST Classifier
There are two primary goals for this tutorial:
- Demonstrate an End-to-End kubeflow example
- Present an End-to-End Pytorch model
By the end of this tutorial, you should learn how to:
- Setup a Kubeflow cluster on an existing Kubernetes deployment
- Spawn up a shared-persistent storage across the cluster to store models
- Train a distributed model using Pytorch and GPUs on the cluster
- Serve the model using Seldon Core
- Query the model from a simple front-end application
- Setup a Kubeflow cluster
- Training the model using PyTorchJob:
- Serving the model
- Querying the model
- Teardown
#TODO
- 01_setup_a_kubeflow_cluster
- 02_distributed_training
- 03_serving_the_model
- 04_querying_the_model
- 05_teardown