spotahome / service-level-operator Goto Github PK

Manage application's SLI and SLO's easily with the application lifecycle inside a Kubernetes cluster

License: Apache License 2.0

Makefile 3.55% Go 93.23% Dockerfile 1.03% Shell 2.19%

service-level-objective service-level-indicator service-level slo sli sla kubernetes controller operator kubernetes-operator

service-level-operator's People

Contributors

Stargazers

Watchers

service-level-operator's Issues

Grafana - Error budget line changes

Hi,

Any ideas what would cause the following to have the error budget line change (This should be a constant)

Also notice the duplicate legends at the bottom.

Any tips are appreciated.

Thx

If SLI is missing SLO is 100%

Hi,

I noticed that when you register an SLO for the first time, and their is no metrics in prometheus for it or the system stops sending metrics, the graph shows the SLO is compliant at 100%.

Is this by design?

Is this project alive ?

Just want to make sure it is before i start to rely on it too much.

Thanks!

Configure Input backend in Operator instead of in CRD

Hi, it would be very nice to be able to Configure Input backend in Operator instead of in CRD. IMO the the address in the CRD makes all the services defining their SLO having to know the prometheus (or any input) location.

Otherwise, awesome work! I really like that operator :)

Add burn rate threshold levels for the SLO

The alerts based on burn rate thresholds can be made easier if the operator exposes metrics based on the CRD thresholds.

My idea at this moment is having something like this on the CRD:

apiVersion: measure.slok.xyz/v1alpha1
kind: ServiceLevel
metadata:
  name: awesome-service
spec:
  serviceLevelObjectives:
    # A typical 5xx request SLO.
    - name: "9999_http_request_lt_500"
      description: 99.99% of requests must be served with <500 status code.
      disable: false
      availabilityObjectivePercent: 99.99
      burnRates:
        - errorBudgetDays: 30
          thresholds:
            - timeRangeHours: 1
              errorBudgetPercent: 2
            - timeRangeHours: 6
              errorBudgetPercent: 5
            - timeRangeHours: 72
              errorBudgetPercent: 10
      serviceLevelIndicator:
        prometheus:
          address: http://127.0.0.1:9091
          totalQuery: |
            sum(
              increase(skipper_serve_host_duration_seconds_count{host="www_spotahome_com"}[2m]))
          errorQuery: |
            sum(
              increase(skipper_serve_host_duration_seconds_count{host="www_spotahome_com", code=~"5.."}[2m]))
      output:
        prometheus: {}

We could have multiple burnRates and in each burn rate multiple thresholds.

I have a branch that creates the threshold metrics and sets the threshold information on labels:

# HELP service_level_slo_burn_rate_threshold Is the threshold for a burn rate period.
# TYPE service_level_slo_burn_rate_threshold gauge
service_level_slo_burn_rate_threshold{burn_rate_range="168h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="24h",error_budget_spent="7%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 4.9
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="3%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 8.4
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 1
# HELP service_level_slo_objective_ratio Is the objective of the SLO in ratio unit.

Any thoughs? @ese

Migrate to Kooper v2

Kooper v2 has been released, we will update the operator to remove the CRD lifecycle from the operator and adapt to the new library APIs.

Update project building blocks structure (#17 ).
Add validation tags to CRDs.
Set up a way of generating CRD manifests.
Update Kooper libraries and adapt the handler/retrieval of resources.

Calculating Latency

Hi,

How can I use the SLO operator to define a CRD for a simple SLO.

SLO to serve 95% of requests within 300ms

sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
/
sum(rate(http_request_duration_seconds_count[5m])) by (job)

running make on mac OSX returns error "addgroup: gid '20' in use"

i am building on mac OSX, and have an error below, thank you,

make build
docker build -t service-level-operator --build-arg uid=501 --build-arg gid=20 -f ./docker/dev/Dockerfile .
Sending build context to Docker daemon 28.01MB
Step 1/13 : FROM golang:1.13-alpine
---> 3024b4e742b0
Step 2/13 : RUN apk --no-cache add bash git g++ curl openssl openssh-client
---> Using cache
---> acf3cd708880
Step 3/13 : RUN go get -u github.com/vektra/mockery/.../
---> Using cache
---> 0e91154912cf
Step 4/13 : RUN mkdir /src
---> Using cache
---> f36d86241b67
Step 5/13 : ARG uid=1000
---> Using cache
---> 9107ab54794b
Step 6/13 : ARG gid=1000
---> Using cache
---> 39996406cbad
Step 7/13 : RUN addgroup -g $gid service-level-operator && adduser -D -u $uid -G service-level-operator service-level-operator && chown service-level-operator:service-level-operator -R /src && chown service-level-operator:service-level-operator -R /go
---> Running in 3ba607420f62
addgroup: gid '20' in use
The command '/bin/sh -c addgroup -g $gid service-level-operator && adduser -D -u $uid -G service-level-operator service-level-operator && chown service-level-operator:service-level-operator -R /src && chown service-level-operator:service-level-operator -R /go' returned a non-zero code: 1
make: *** [build] Error 1

panic: runtime error: invalid memory address or nil pointer dereference

Hello,
Been trying this operator in our development environment. It was working fine until we started noticing a lot of restarts...
After digging through the logs I found this:

{"level":"error","msg":"error processing SLO: 2.843567 can't be higher than 0.000000","sl":"device-info-api","slo":"9999_http_request_lt_500","src":"asm_amd64.s:1357","time":"2020-01-15T17:37:17Z"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x106b4bc]

goroutine 673 [running]:
github.com/spotahome/service-level-operator/pkg/service/output.(*prometheusOutput).Collect(0xc000046280, 0xc0004d3680)
        /src/pkg/service/output/prometheus.go:137 +0x3fc
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
        /go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:445 +0x164
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
        /go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe12

spotahome / service-level-operator Goto Github PK

service-level-operator's People

Contributors

Stargazers

Watchers

Forkers

service-level-operator's Issues

Grafana - Error budget line changes

If SLI is missing SLO is 100%

Is this project alive ?

Configure Input backend in Operator instead of in CRD

Add burn rate threshold levels for the SLO

Migrate to Kooper v2

Calculating Latency

running make on mac OSX returns error "addgroup: gid '20' in use"

panic: runtime error: invalid memory address or nil pointer dereference

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs