GithubHelp home page GithubHelp logo

sayakpaul / ml-deployment-k8s-fastapi Goto Github PK

View Code? Open in Web Editor NEW
185.0 3.0 33.0 559 KB

This project shows how to serve an ONNX-optimized image classification model as a web service with FastAPI, Docker, and Kubernetes.

Home Page: https://medium.com/google-developer-experts/load-testing-tensorflow-serving-and-fastapi-on-gke-411bc14d96b2

License: Apache License 2.0

Jupyter Notebook 83.81% Python 10.30% Dockerfile 5.88%
docker fastapi google-cloud-platform kubernetes onnx rest tensorflow

ml-deployment-k8s-fastapi's Introduction

Deploying ML models with FastAPI, Docker, and Kubernetes

By: Sayak Paul and Chansung Park


Figure developed by Chansung Park

This project shows how to serve an ONNX-optimized image classification model as a RESTful web service with FastAPI, Docker, and Kubernetes (k8s). The idea is to first Dockerize the API and then deploy it on a k8s cluster running on Google Kubernetes Engine (GKE). We do this integration using GitHub Actions.

๐Ÿ‘‹ Note: Even though this project uses an image classification its structure and techniques can be used to serve other models as well. We also worked on a TF Serving equivalent of this project. Check it out here.

Update July 19 2022: This project won the #TFCommunitySpotlight award.

Deploying the model as a service with k8s

  • We decouple the model optimization part from our API code. The optimization part is available within the notebooks/TF_to_ONNX.ipynb notebook.

  • Then we locally test the API. You can find the instructions within the api directory.

  • To deploy the API, we define our deployment.yaml workflow file inside .github/workflows. It does the following tasks:

    • Looks for any changes in the specified directory. If there are any changes:
    • Builds and pushes the latest Docker image to Google Container Register (GCR).
    • Deploys the Docker container on the k8s cluster running on GKE.

Configurations needed beforehand

  • Create a k8s cluster on GKE. Here's a relevant resource. We used 8 nodes (each with 2 vCPUs and 4 GBs of RAM) for the cluster.

  • Create a service account key (JSON) file. It's a good practice to only grant it the roles required for the project. For example, for this project, we created a fresh service account and granted it permissions for the following: Storage Admin, GKE Developer, and GCR Developer.

  • Crete a secret named GCP_CREDENTIALS on your GitHub repository and copy paste the contents of the service account key file into the secret.

  • Configure bucket storage related permissions for the service account:

    $ export PROJECT_ID=<PROJECT_ID>
    $ export ACCOUNT=<ACCOUNT>
    
    $ gcloud -q projects add-iam-policy-binding ${PROJECT_ID} \
        --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com \
        --role roles/storage.admin
    
    $ gcloud -q projects add-iam-policy-binding ${PROJECT_ID} \
        --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com \
        --role roles/storage.objectAdmin
    
    gcloud -q projects add-iam-policy-binding ${PROJECT_ID} \
        --member=serviceAccount:${ACCOUNT}@${PROJECT_ID}.iam.gserviceaccount.com \
        --role roles/storage.objectCreator
  • If you're on the main branch already then upon a new push, the worflow defined in .github/workflows/deployment.yaml should automatically run. Here's how the final outputs should look like (run link):

Notes

  • Since we use CPU-based pods within the k8s cluster, we use ONNX optimizations since they are known to provide performance speed-ups for CPU-based environments. If you are using GPU-based pods then look into TensorRT.

  • We use Kustomize to manage the deployment on k8s.

  • We conducted load-testing varying the number of workers, RAM, nodes, etc. From that experiment, we found out that for our setup, 8 nodes each having 2 vCPUs and 4 GBs of work the best in terms of throughput and latency. The figure below summarizes our results:

    You can find the load-testing details under locust directory.

Querying the API endpoint

From workflow outputs, you should see something like so:

NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
fastapi-server   LoadBalancer   xxxxxxxxxx   xxxxxxxxxx        80:30768/TCP   23m
kubernetes       ClusterIP      xxxxxxxxxx     <none>          443/TCP        160m

Note the EXTERNAL-IP corresponding to fastapi-server (iff you have named your service like so). Then cURL it:

curl -X POST -F [email protected] -F with_resize=True -F with_post_process=True http://{EXTERNAL-IP}:80/predict/image

You should get the following output (if you're using the cat.jpg image present in the api directory):

"{\"Label\": \"tabby\", \"Score\": \"0.538\"}"

The request assumes that you have a file called cat.jpg present in your working directory.

Note that if you don't see any external IP address from your GitHub Actions console log, then after successful deployment, do the following:

# Authenticate to your GKE cluster.
$ gcloud container clusters get-credentials ${GKE_CLUSTER} --zone {GKE_ZONE} --project {GCP_PROJECT_ID}
$ kubectl get services -o wide

From there, note the external IP.

Acknowledgements

ml-deployment-k8s-fastapi's People

Contributors

deep-diver avatar sayakpaul avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ml-deployment-k8s-fastapi's Issues

Setup TF Serving based deployment

In this new feature, the following works are expected

  • Update the notebook Create a new notebook with the TF Serving prototype based on both gRPC(Ref) and RestAPI(Ref).

  • Update the notebook Update the newly created notebook to check the %%timeit on the TF Serving server locally.

  • Build/Commit docker image based on TF Serving base image using this method.

  • Deploy the built docker image on GKE cluster

  • Check the deployed model's performance with a various scenarios (maybe the same ones applied to ONNX+FastAPI scenarios)

Inquiry Regarding Scalability Best Practices for FastAPI ML Model Deployment on Kubernetes

Dear Sayak, Chansung, and Contributors,

First and foremost, I would like to extend my gratitude for the comprehensive guide on deploying machine learning models with FastAPI, Docker, and Kubernetes. The repository serves as an invaluable resource for practitioners aiming to operationalise their machine learning workflows in a cloud-native environment.

Upon perusing your documentation and workflow configurations, I have gathered substantial insights into the deployment process. However, I am particularly interested in understanding the scalability aspects of the deployment strategy in greater detail. As we are aware, machine learning workloads can be quite erratic in terms of resource consumption, and the ability to scale efficiently is paramount to maintaining performance and cost-effectiveness.

I am keen to learn about the following:

  1. Auto-Scaling Practices: Could you elucidate on the auto-scaling strategies that one might employ with the current setup? Specifically, I am curious about the implementation of Horizontal Pod Autoscaling (HPA) and whether there are any recommended thresholds or metrics that we should monitor to trigger scaling events.

  2. Load Balancing Considerations: With the deployment leveraging a LoadBalancer service type, how does the current configuration ensure even distribution of traffic amongst the pods, especially during a scaling event? Are there any particular load balancing algorithms or configurations that you would recommend?

  3. Resource Quotas and Limits: In the context of Kubernetes, setting appropriate resource quotas and limits is crucial to prevent any single service from monopolising cluster resources. Could you provide guidance on setting these parameters in a way that balances resource utilisation and availability, particularly for machine learning inference services that may have variable resource demands?

  4. Node Pool Management: The deployment utilises a cluster with a fixed number of nodes. In a production scenario, how would you approach the management of node pools to accommodate the scaling of pods? Is there a strategy in place to scale the node pool itself, and if so, what are the considerations for such a strategy?

  5. Cost Management: Lastly, could you share any insights on managing costs associated with running such a deployment on GKE? Are there any best practices or tools that you would recommend for monitoring and optimising the costs of the compute resources utilised by the Kubernetes cluster?

I believe that addressing these queries would greatly benefit the community, providing a deeper understanding of how to manage and scale machine learning deployments effectively in Kubernetes.

Thank you for your time and consideration. I eagerly await your response and any further discussions this might engender.

Best regards,
yihong1120

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.