sayakpaul / ml-deployment-k8s-fastapi Goto Github PK

This project shows how to serve an ONNX-optimized image classification model as a web service with FastAPI, Docker, and Kubernetes.

Home Page: https://medium.com/google-developer-experts/load-testing-tensorflow-serving-and-fastapi-on-gke-411bc14d96b2

License: Apache License 2.0

Jupyter Notebook 83.81% Python 10.30% Dockerfile 5.88%

docker fastapi google-cloud-platform kubernetes onnx rest tensorflow

ml-deployment-k8s-fastapi's Introduction

Deploying ML models with FastAPI, Docker, and Kubernetes

By: Sayak Paul and Chansung Park

^{Figure developed by Chansung Park}

This project shows how to serve an ONNX-optimized image classification model as a RESTful web service with FastAPI, Docker, and Kubernetes (k8s). The idea is to first Dockerize the API and then deploy it on a k8s cluster running on Google Kubernetes Engine (GKE). We do this integration using GitHub Actions.

👋 Note: Even though this project uses an image classification its structure and techniques can be used to serve other models as well. We also worked on a TF Serving equivalent of this project. Check it out here.

Update July 19 2022: This project won the #TFCommunitySpotlight award.

Deploying the model as a service with k8s

We decouple the model optimization part from our API code. The optimization part is available within the notebooks/TF_to_ONNX.ipynb notebook.
Then we locally test the API. You can find the instructions within the api directory.
To deploy the API, we define our deployment.yaml workflow file inside .github/workflows. It does the following tasks:
- Looks for any changes in the specified directory. If there are any changes:
- Builds and pushes the latest Docker image to Google Container Register (GCR).
- Deploys the Docker container on the k8s cluster running on GKE.

Configurations needed beforehand

Create a k8s cluster on GKE. Here's a relevant resource. We used 8 nodes (each with 2 vCPUs and 4 GBs of RAM) for the cluster.
Create a service account key (JSON) file. It's a good practice to only grant it the roles required for the project. For example, for this project, we created a fresh service account and granted it permissions for the following: Storage Admin, GKE Developer, and GCR Developer.
Crete a secret named GCP_CREDENTIALS on your GitHub repository and copy paste the contents of the service account key file into the secret.

Configure bucket storage related permissions for the service account:

$ export PROJECT_ID=<PROJECT_ID>
$ export ACCOUNT=<ACCOUNT>

$ gcloud -q projects add-iam-policy-binding ${PROJECT_ID} \
    --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com \
    --role roles/storage.admin

$ gcloud -q projects add-iam-policy-binding ${PROJECT_ID} \
    --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com \
    --role roles/storage.objectAdmin

gcloud -q projects add-iam-policy-binding ${PROJECT_ID} \
    --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com \
    --role roles/storage.objectCreator

If you're on the main branch already then upon a new push, the worflow defined in .github/workflows/deployment.yaml should automatically run. Here's how the final outputs should look like (run link):

Notes

Since we use CPU-based pods within the k8s cluster, we use ONNX optimizations since they are known to provide performance speed-ups for CPU-based environments. If you are using GPU-based pods then look into TensorRT.
We use Kustomize to manage the deployment on k8s.
We conducted load-testing varying the number of workers, RAM, nodes, etc. From that experiment, we found out that for our setup, 8 nodes each having 2 vCPUs and 4 GBs of work the best in terms of throughput and latency. The figure below summarizes our results:

You can find the load-testing details under locust directory.

Querying the API endpoint

From workflow outputs, you should see something like so:

NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
fastapi-server   LoadBalancer   xxxxxxxxxx   xxxxxxxxxx        80:30768/TCP   23m
kubernetes       ClusterIP      xxxxxxxxxx     <none>          443/TCP        160m

Note the EXTERNAL-IP corresponding to fastapi-server (iff you have named your service like so). Then cURL it:

curl -X POST -F [email protected] -F with_resize=True -F with_post_process=True http://{EXTERNAL-IP}:80/predict/image

You should get the following output (if you're using the cat.jpg image present in the api directory):

"{\"Label\": \"tabby\", \"Score\": \"0.538\"}"

The request assumes that you have a file called cat.jpg present in your working directory.

Note that if you don't see any external IP address from your GitHub Actions console log, then after successful deployment, do the following:

# Authenticate to your GKE cluster.
$ gcloud container clusters get-credentials ${GKE_CLUSTER} --zone {GKE_ZONE} --project {GCP_PROJECT_ID}
$ kubectl get services -o wide

From there, note the external IP.

Acknowledgements

ML-GDE program for providing GCP credit support.
Hannes Hapke for providing might insightful points for conducting load-tests.

ml-deployment-k8s-fastapi's People

Contributors

Stargazers

Watchers

ml-deployment-k8s-fastapi's Issues

Deploy the Docker image on GKE

Provision namespacing in the k8s deployment

Find a better way to report the latest API endpoint.

Dockerize the API

Build a Docker image.
Push that to GCR.

Set up logging for the k8s pods.

Perform load testing with Locust

Resources:

Set up another API similarly with the pure TensorFlow model for comparison

Create k8s cluster on GKE

Setup TF Serving based deployment

In this new feature, the following works are expected

~~Update the notebook~~ Create a new notebook with the TF Serving prototype based on both gRPC(Ref) and RestAPI(Ref).
~~Update the notebook~~ Update the newly created notebook to check the %%timeit on the TF Serving server locally.
Build/Commit docker image based on TF Serving base image using this method.
Deploy the built docker image on GKE cluster
Check the deployed model's performance with a various scenarios (maybe the same ones applied to ONNX+FastAPI scenarios)

Automate deployment with GitHub Actions

Inquiry Regarding Scalability Best Practices for FastAPI ML Model Deployment on Kubernetes

Dear Sayak, Chansung, and Contributors,

First and foremost, I would like to extend my gratitude for the comprehensive guide on deploying machine learning models with FastAPI, Docker, and Kubernetes. The repository serves as an invaluable resource for practitioners aiming to operationalise their machine learning workflows in a cloud-native environment.

Upon perusing your documentation and workflow configurations, I have gathered substantial insights into the deployment process. However, I am particularly interested in understanding the scalability aspects of the deployment strategy in greater detail. As we are aware, machine learning workloads can be quite erratic in terms of resource consumption, and the ability to scale efficiently is paramount to maintaining performance and cost-effectiveness.

I am keen to learn about the following:

Auto-Scaling Practices: Could you elucidate on the auto-scaling strategies that one might employ with the current setup? Specifically, I am curious about the implementation of Horizontal Pod Autoscaling (HPA) and whether there are any recommended thresholds or metrics that we should monitor to trigger scaling events.
Load Balancing Considerations: With the deployment leveraging a LoadBalancer service type, how does the current configuration ensure even distribution of traffic amongst the pods, especially during a scaling event? Are there any particular load balancing algorithms or configurations that you would recommend?
Resource Quotas and Limits: In the context of Kubernetes, setting appropriate resource quotas and limits is crucial to prevent any single service from monopolising cluster resources. Could you provide guidance on setting these parameters in a way that balances resource utilisation and availability, particularly for machine learning inference services that may have variable resource demands?
Node Pool Management: The deployment utilises a cluster with a fixed number of nodes. In a production scenario, how would you approach the management of node pools to accommodate the scaling of pods? Is there a strategy in place to scale the node pool itself, and if so, what are the considerations for such a strategy?
Cost Management: Lastly, could you share any insights on managing costs associated with running such a deployment on GKE? Are there any best practices or tools that you would recommend for monitoring and optimising the costs of the compute resources utilised by the Kubernetes cluster?

I believe that addressing these queries would greatly benefit the community, providing a deeper understanding of how to manage and scale machine learning deployments effectively in Kubernetes.

Thank you for your time and consideration. I eagerly await your response and any further discussions this might engender.

Best regards,
yihong1120