ml-monitoring

Jeremy Jordan

This repository provides an example setup for monitoring an ML system deployed on Kubernetes.

Blog post: https://www.jeremyjordan.me/ml-monitoring/

Components:

ML model served via FastAPI
Export server metrics via prometheus-fastapi-instrumentator
Simulate production traffic via locust
Monitor and store metrics via Prometheus
Visualize metrics via Grafana

Setup

Ensure you can connect to a Kubernetes cluster and have kubectl and helm installed.
- You can easily spin up a Kubernetes cluster on your local machine using minikube.

minikube start --driver=docker --memory 4g --nodes 2

Deploy Prometheus and Grafana onto the cluster using the community Helm chart.

kubectl create namespace monitoring
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring

Verify the resources were deployed successfully.

kubectl get all -n monitoring

Connect to the Grafana dashboard.

kubectl port-forward svc/prometheus-stack-grafana 8000:80 -n monitoring

Go to http://127.0.0.1:8000/
Log in with the credentials:
- Username: admin
- Password: prom-operator
- (This password can be configured in the Helm chart values.yaml file)

Import the model dashboard.
- On the left sidebar, click the "+" and select "Import".
- Copy and paste the JSON defined in dashboards/model.json in the text area.

Deploy a model

This repository includes an example REST service which exposes an ML model trained on the UCI Wine Quality dataset.

You can launch the service on Kubernetes by running:

kubectl apply -f kubernetes/models/

You can also build and run the Docker container locally.

docker build -t wine-quality-model -f model/Dockerfile model/
docker run -d -p 3000:80 -e ENABLE_METRICS=true wine-quality-model

Note: In order for Prometheus to scrape metrics from this service, we need to define a ServiceMonitor resource. This resource must have the label release: prometheus-stack in order to be discovered. This is configured in the Prometheus resource spec via the serviceMonitorSelector attribute.

You can verify the label required by running:

kubectl get prometheuses.monitoring.coreos.com prometheus-stack-kube-prom-prometheus -n monitoring -o yaml

Simulate production traffic

We can simulate production traffic using a Python load testing tool called locust. This will make HTTP requests to our model server and provide us with data to view in the monitoring dashboard.

You can begin the load test by running:

kubectl apply -f kubernetes/load_tests/

By default, production traffic will be simulated for a duration of 5 minutes. This can be changed by updating the image arguments in the kubernetes/load_tests/locust_master.yaml manifest.

You can also modify the community Helm chart instead of using the manifests defined in this repo.

Uploading new images

This process can eventually be automated with a Github action, but remains manual for now.

Obtain a personal access token to connect with the Github container registry.

echo "INSERT_TOKEN_HERE" >> ~/.github/cr_token

Authenticate with the Github container registry.

cat ~/.github/cr_token | docker login ghcr.io -u jeremyjordan --password-stdin

Build and tag new Docker images.

MODEL_TAG=0.3
docker build -t wine-quality-model:$MODEL_TAG -f model/Dockerfile model/
docker tag wine-quality-model:$MODEL_TAG ghcr.io/jeremyjordan/wine-quality-model:$MODEL_TAG

LOAD_TAG=0.2
docker build -t locust-load-test:$LOAD_TAG -f load_test/Dockerfile load_test/
docker tag locust-load-test:$LOAD_TAG ghcr.io/jeremyjordan/locust-load-test:$LOAD_TAG

Push Docker images to container registery.

docker push ghcr.io/jeremyjordan/wine-quality-model:$MODEL_TAG
docker push ghcr.io/jeremyjordan/locust-load-test:$LOAD_TAG

Update Kubernetes manifests to use the new image tag.

Teardown instructions

To stop the model REST server, run:

kubectl delete -f kubernetes/models/

To stop the load tests, run:

kubectl delete -f kubernetes/load_tests/

To remove the Prometheus stack, run:

helm uninstall prometheus-stack -n monitoring

Latency & Counter Metrics Not Detected By Prometheus

Hello @jeremyjordan,

I've been following your fastapi ml-monitoring repository as a template for my own project and it's been super helpful! Thanks so much for setting this up. Unfortunately, I'm experiencing a lot of trouble getting prometheus to scrape my Counter metric and latency as well. Interestingly, when I run your wine-quality application and add a Counter metric though, it seems to be working fine, but mine which pretty much follows your same approach (only difference being that I set up my application using application factory design pattern) doesn't seem to be working. It seems like histogram and summary are going through though.

Do you have any insight as to what the issue could be? Would really appreciate your guidance as I've been trying to figure this out for 3 days.

Here is my monitoring.py file: https://github.com/rileyhun/fastapi-ml-example/blob/main/app/core/monitoring.py

Reproducible example:

git clone https://github.com/rileyhun/fastapi-ml-example.git

docker build -t ${IMAGE_NAME}:${IMAGE_TAG} -f Dockerfile .
docker tag ${IMAGE_NAME}:${IMAGE_TAG} rhun/${IMAGE_NAME}:${IMAGE_TAG}
docker push rhun/${IMAGE_NAME}:${IMAGE_TAG}

minikube start --driver=docker --memory 4g --nodes 2
kubectl create namespace monitoring
helm install prometheus-stack prometheus-community/kube-prometheus-stack -n monitoring

kubectl apply -f deployment/wine-model-local.yaml
kubectl port-forward svc/wine-model-service 8080:80

python api_call.py

jeremyjordan / ml-monitoring Goto Github PK

ml-monitoring's Introduction

ml-monitoring

Setup

Deploy a model

Simulate production traffic

Uploading new images

Teardown instructions

ml-monitoring's People

Contributors

Stargazers

Watchers

Forkers

ml-monitoring's Issues

Latency & Counter Metrics Not Detected By Prometheus

Adding prometheus instrumentation package is resulting in some requests taking a long amount of time

How to Monitor NLP Models?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs