GithubHelp home page GithubHelp logo

spotahome / service-level-operator Goto Github PK

View Code? Open in Web Editor NEW
278.0 38.0 35.0 256 KB

Manage application's SLI and SLO's easily with the application lifecycle inside a Kubernetes cluster

License: Apache License 2.0

Makefile 3.55% Go 93.23% Dockerfile 1.03% Shell 2.19%
service-level-objective service-level-indicator service-level slo sli sla kubernetes controller operator kubernetes-operator

service-level-operator's People

Contributors

ese avatar slok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

service-level-operator's Issues

Grafana - Error budget line changes

Hi,

Any ideas what would cause the following to have the error budget line change (This should be a constant)

image

Also notice the duplicate legends at the bottom.

Any tips are appreciated.

Thx

If SLI is missing SLO is 100%

Hi,

I noticed that when you register an SLO for the first time, and their is no metrics in prometheus for it or the system stops sending metrics, the graph shows the SLO is compliant at 100%.

Is this by design?

Configure Input backend in Operator instead of in CRD

Hi, it would be very nice to be able to Configure Input backend in Operator instead of in CRD. IMO the the address in the CRD makes all the services defining their SLO having to know the prometheus (or any input) location.

Otherwise, awesome work! I really like that operator :)

Add burn rate threshold levels for the SLO

The alerts based on burn rate thresholds can be made easier if the operator exposes metrics based on the CRD thresholds.

My idea at this moment is having something like this on the CRD:

apiVersion: measure.slok.xyz/v1alpha1
kind: ServiceLevel
metadata:
  name: awesome-service
spec:
  serviceLevelObjectives:
    # A typical 5xx request SLO.
    - name: "9999_http_request_lt_500"
      description: 99.99% of requests must be served with <500 status code.
      disable: false
      availabilityObjectivePercent: 99.99
      burnRates:
        - errorBudgetDays: 30
          thresholds:
            - timeRangeHours: 1
              errorBudgetPercent: 2
            - timeRangeHours: 6
              errorBudgetPercent: 5
            - timeRangeHours: 72
              errorBudgetPercent: 10
      serviceLevelIndicator:
        prometheus:
          address: http://127.0.0.1:9091
          totalQuery: |
            sum(
              increase(skipper_serve_host_duration_seconds_count{host="www_spotahome_com"}[2m]))
          errorQuery: |
            sum(
              increase(skipper_serve_host_duration_seconds_count{host="www_spotahome_com", code=~"5.."}[2m]))
      output:
        prometheus: {}

We could have multiple burnRates and in each burn rate multiple thresholds.

I have a branch that creates the threshold metrics and sets the threshold information on labels:

# HELP service_level_slo_burn_rate_threshold Is the threshold for a burn rate period.
# TYPE service_level_slo_burn_rate_threshold gauge
service_level_slo_burn_rate_threshold{burn_rate_range="168h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="1h",error_budget_spent="2%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 14.4
service_level_slo_burn_rate_threshold{burn_rate_range="24h",error_budget_spent="7%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 4.9
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="3%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="70d"} 8.4
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="6h",error_budget_spent="5%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 6
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo0",team="fake-team0",total_error_budget_range="30d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo1",team="fake-team1",total_error_budget_range="30d"} 1
service_level_slo_burn_rate_threshold{burn_rate_range="72h",error_budget_spent="10%",fake="true",namespace="ns0",service_level="fake-service0",slo="fake_slo2",team="fake-team2",total_error_budget_range="30d"} 1
# HELP service_level_slo_objective_ratio Is the objective of the SLO in ratio unit.

Any thoughs? @ese

Migrate to Kooper v2

Kooper v2 has been released, we will update the operator to remove the CRD lifecycle from the operator and adapt to the new library APIs.

  • Update project building blocks structure (#17 ).
  • Add validation tags to CRDs.
  • Set up a way of generating CRD manifests.
  • Update Kooper libraries and adapt the handler/retrieval of resources.

Calculating Latency

Hi,

How can I use the SLO operator to define a CRD for a simple SLO.

SLO to serve 95% of requests within 300ms

sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
/
sum(rate(http_request_duration_seconds_count[5m])) by (job)

running make on mac OSX returns error "addgroup: gid '20' in use"

i am building on mac OSX, and have an error below, thank you,

make build
docker build -t service-level-operator --build-arg uid=501 --build-arg gid=20 -f ./docker/dev/Dockerfile .
Sending build context to Docker daemon 28.01MB
Step 1/13 : FROM golang:1.13-alpine
---> 3024b4e742b0
Step 2/13 : RUN apk --no-cache add bash git g++ curl openssl openssh-client
---> Using cache
---> acf3cd708880
Step 3/13 : RUN go get -u github.com/vektra/mockery/.../
---> Using cache
---> 0e91154912cf
Step 4/13 : RUN mkdir /src
---> Using cache
---> f36d86241b67
Step 5/13 : ARG uid=1000
---> Using cache
---> 9107ab54794b
Step 6/13 : ARG gid=1000
---> Using cache
---> 39996406cbad
Step 7/13 : RUN addgroup -g $gid service-level-operator && adduser -D -u $uid -G service-level-operator service-level-operator && chown service-level-operator:service-level-operator -R /src && chown service-level-operator:service-level-operator -R /go
---> Running in 3ba607420f62
addgroup: gid '20' in use
The command '/bin/sh -c addgroup -g $gid service-level-operator && adduser -D -u $uid -G service-level-operator service-level-operator && chown service-level-operator:service-level-operator -R /src && chown service-level-operator:service-level-operator -R /go' returned a non-zero code: 1
make: *** [build] Error 1

panic: runtime error: invalid memory address or nil pointer dereference

Hello,
Been trying this operator in our development environment. It was working fine until we started noticing a lot of restarts...
After digging through the logs I found this:

{"level":"error","msg":"error processing SLO: 2.843567 can't be higher than 0.000000","sl":"device-info-api","slo":"9999_http_request_lt_500","src":"asm_amd64.s:1357","time":"2020-01-15T17:37:17Z"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x106b4bc]

goroutine 673 [running]:
github.com/spotahome/service-level-operator/pkg/service/output.(*prometheusOutput).Collect(0xc000046280, 0xc0004d3680)
        /src/pkg/service/output/prometheus.go:137 +0x3fc
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
        /go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:445 +0x164
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
        /go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:535 +0xe12

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.