spotahome / service-level-operator Goto Github PK
View Code? Open in Web Editor NEWManage application's SLI and SLO's easily with the application lifecycle inside a Kubernetes cluster
License: Apache License 2.0
Manage application's SLI and SLO's easily with the application lifecycle inside a Kubernetes cluster
License: Apache License 2.0
Hi,
I noticed that when you register an SLO for the first time, and their is no metrics in prometheus for it or the system stops sending metrics, the graph shows the SLO is compliant at 100%.
Is this by design?
Just want to make sure it is before i start to rely on it too much.
Thanks!
Hi, it would be very nice to be able to Configure Input backend in Operator instead of in CRD. IMO the the address
in the CRD makes all the services defining their SLO having to know the prometheus (or any input) location.
Otherwise, awesome work! I really like that operator :)
The alerts based on burn rate thresholds can be made easier if the operator exposes metrics based on the CRD thresholds.
My idea at this moment is having something like this on the CRD:
apiVersion: measure.slok.xyz/v1alpha1
kind: ServiceLevel
metadata:
name: awesome-service
spec:
serviceLevelObjectives:
# A typical 5xx request SLO.
- name: "9999_http_request_lt_500"
description: 99.99% of requests must be served with <500 status code.
disable: false
availabilityObjectivePercent: 99.99
burnRates:
- errorBudgetDays: 30
thresholds:
- timeRangeHours: 1
errorBudgetPercent: 2
- timeRangeHours: 6
errorBudgetPercent: 5
- timeRangeHours: 72
errorBudgetPercent: 10
serviceLevelIndicator:
prometheus:
address: http://127.0.0.1:9091
totalQuery: |
sum(
increase(skipper_serve_host_duration_seconds_count{host="www_spotahome_com"}[2m]))
errorQuery: |
sum(
increase(skipper_serve_host_duration_seconds_count{host="www_spotahome_com", code=~"5.."}[2m]))
output:
prometheus: {}
We could have multiple burnRates
and in each burn rate multiple thresholds
.
I have a branch that creates the threshold metrics and sets the threshold information on labels:
# HELP service_level_slo_burn_rate_threshold Is the threshold for a burn rate period.
# TYPE service_level_slo_burn_rate_threshold gauge
service_level_slo_burn_rate_threshold{burn_rate_range="168h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="24h",error_budget_spent="7%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 4.9
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="3%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 8.4
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 1
# HELP service_level_slo_objective_ratio Is the objective of the SLO in ratio unit.
Any thoughs? @ese
Kooper v2 has been released, we will update the operator to remove the CRD lifecycle from the operator and adapt to the new library APIs.
Hi,
How can I use the SLO operator to define a CRD for a simple SLO.
SLO to serve 95% of requests within 300ms
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
/
sum(rate(http_request_duration_seconds_count[5m])) by (job)
i am building on mac OSX, and have an error below, thank you,
make build
docker build -t service-level-operator --build-arg uid=501 --build-arg gid=20 -f ./docker/dev/Dockerfile .
Sending build context to Docker daemon 28.01MB
Step 1/13 : FROM golang:1.13-alpine
---> 3024b4e742b0
Step 2/13 : RUN apk --no-cache add bash git g++ curl openssl openssh-client
---> Using cache
---> acf3cd708880
Step 3/13 : RUN go get -u github.com/vektra/mockery/.../
---> Using cache
---> 0e91154912cf
Step 4/13 : RUN mkdir /src
---> Using cache
---> f36d86241b67
Step 5/13 : ARG uid=1000
---> Using cache
---> 9107ab54794b
Step 6/13 : ARG gid=1000
---> Using cache
---> 39996406cbad
Step 7/13 : RUN addgroup -g $gid service-level-operator && adduser -D -u $uid -G service-level-operator service-level-operator && chown service-level-operator:service-level-operator -R /src && chown service-level-operator:service-level-operator -R /go
---> Running in 3ba607420f62
addgroup: gid '20' in use
The command '/bin/sh -c addgroup -g $gid service-level-operator && adduser -D -u $uid -G service-level-operator service-level-operator && chown service-level-operator:service-level-operator -R /src && chown service-level-operator:service-level-operator -R /go' returned a non-zero code: 1
make: *** [build] Error 1
Hello,
Been trying this operator in our development environment. It was working fine until we started noticing a lot of restarts...
After digging through the logs I found this:
{"level":"error","msg":"error processing SLO: 2.843567 can't be higher than 0.000000","sl":"device-info-api","slo":"9999_http_request_lt_500","src":"asm_amd64.s:1357","time":"2020-01-15T17:37:17Z"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x106b4bc]
goroutine 673 [running]:
github.com/spotahome/service-level-operator/pkg/service/output.(*prometheusOutput).Collect(0xc000046280, 0xc0004d3680)
/src/pkg/service/output/prometheus.go:137 +0x3fc
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:445 +0x164
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe12
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.