GithubHelp home page GithubHelp logo

openslo / openslo Goto Github PK

View Code? Open in Web Editor NEW
1.3K 35.0 60.0 919 KB

Open specification for defining and expressing service level objectives (SLO)

Home Page: https://www.openslo.com

License: Apache License 2.0

Makefile 7.41% Go 86.06% Shell 4.39% JavaScript 1.57% Awk 0.57%
slo spec specification

openslo's Introduction

OpenSLO light theme

Table of Contents

Introduction

The intent of this document is to outline the OpenSLO specification.

The goal of this project is to provide an open specification for defining SLOs to enable a common, vendor–agnostic approach to tracking and interfacing with SLOs. Platform-specific implementation details are purposefully excluded from the scope of this specification.

OpenSLO is an open specification i.e., it is a specification created and controlled, in an open and fair process, by an association or a standardization body intending to achieve interoperability and interchangeability. An open specification is not controlled by a single company or individual or by a group with discriminatory membership criteria. Additionally, this specification is designed to be extended where needed to meet the needs of the implementation.

Before making a contribute please read our contribution guideline.

Specification

Goals

  • Compliance with the Kubernetes YAML format
  • Vendor-agnostic
  • Be flexible enough to be extended elsewhere

General Schema

apiVersion: openslo/v1
kind: DataSource | SLO | SLI | AlertPolicy | AlertCondition | AlertNotificationTarget | Service
metadata:
  name: string
  displayName: string # optional
  labels: # optional, it's allowed to assign multiple values to a single key
    # example labels
    organization: "acme"
    team:
      - "identity"
      - "rbac"
    costCentre: "project1"
    serviceTier:
      - "tier-1"
  annotations: # optional
    # example annotations
    openslo.com/key1: value1
    fooimplementation.com/key2: value2
spec:

Notes (General Schema)

  • kind string - required, one of: DataSource, SLO, SLI, AlertPolicy, AlertCondition, AlertNotificationTarget, Service
  • metadata.name: string - required field
    • all implementations must at least support object names that follow RFC1123:
      • are up to 63 characters in length
      • contain lowercase alphanumeric characters or -
      • start with an alphanumeric character
      • end with an alphanumeric character
    • implementations are additionally encouraged to support names that:
      • are up to 255 characters in length
      • contain lowercase alphanumeric characters or -, ., |, /, \
  • metadata.labels: map[string]string|string[] - optional field key <> value
    • the key segment is required and must contain at most 63 characters beginning and ending with an alphanumeric character [a-z0-9A-Z] with dashes -, underscores _, dots . and alphanumerics between.
    • the value of key segment can be a string or an array of strings
  • metadata.annotations: map[string]string - optional field key <> value
    • annotations should be used to define implementation / system specific metadata about the SLO. For example, it can be metadata about a dashboard url, or how to name a metric created by the SLI, etc.
    • key have two segments: an optional prefix and name, separated by a slash /
    • the name segment is required and must contain at most 63 characters beginning and ending with an alphanumeric character [a-z0-9A-Z] with dashes -, underscores _, dots . and alphanumerics between.
    • the prefix is optional and must be a DNS subdomain: a series of DNS labels separated by dots ., it must contain at most 253 characters, followed by a slash /.
    • the openslo.com/ is reserved for OpenSLO usage

Custom Data Types

duration-shorthand

The duration shorthand is specified as a single–word string (no whitespaces) consisting of a positive integer number followed by a case–sensitive single–character postfix.

Allowed postfixes are:

  • m – minutes
  • h – hours
  • d – days
  • w – weeks
  • M – months
  • Q – quarters
  • Y – years

Examples: 12h, 4w, 1M, 1Q, 365d, 1Y.

This specification does not put requirements on how (or whether) to implement each postfix, therefore implementers are free to pick an implementation that best suits their environments.

There is however the possibility that future versions of this spec will take a more prescriptive stance on this issue.

Object Types

💡 Note: Specific attributes are described in detail in the Notes subsection of each object type's section.

DataSource

A DataSource represents connection details with a particular metric source.

Check work in progress for v2.

apiVersion: openslo/v1
kind: DataSource
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters
  type: string # predefined type e.g. Prometheus, Datadog, etc.
  connectionDetails:
    # fields used for creating a connection with particular datasource e.g. AccessKeys, SecretKeys, etc.
    # everything that is valid YAML can be put here
Notes (DataSource)

DataSource enables reusing one source between many SLOs and moving connection specific details (e.g. authentication) away from SLO definitions.

This spec does not enforce naming conventions for data source types, however the OpenSLO project will publish guidelines in the form of supplementary materials once common patterns start emerging from implementations.

An example of the DataSource kind can be:

apiVersion: openslo/v1
kind: DataSource
metadata:
  name: string
  displayName: string # optional
spec:
  type: CloudWatch
  connectionDetails:
    accessKeyID: accessKey
    secretAccessKey: secretAccessKey

SLO

A service level objective (SLO) is a target value or a range of values for a service level that is described by a service level indicator (SLI).

Check work in progress for v2.

apiVersion: openslo/v1
kind: SLO
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters
  service: string # name of the service to associate this SLO with, may refer (depends on implementation) to existing object Kind: Service
  indicator: # see SLI below for details
  indicatorRef: string # name of the SLI. Required if indicator is not given.
  timeWindow:
    # exactly one item; one of possible: rolling or calendar–aligned time window
    ## rolling time window
    - duration: duration-shorthand # duration of the window eg 1d, 4w
      isRolling: true
    # or
    ## calendar–aligned time window
    - duration: duration-shorthand # duration of the window eg 1M, 1Q, 1Y
      calendar:
        startTime: 2020-01-21 12:30:00 # date with time in 24h format, format without time zone
        timeZone: America/New_York # name as in IANA Time Zone Database
      isRolling: false # if omitted assumed `false` if `calendar:` is present
  budgetingMethod: Occurrences | Timeslices | RatioTimeslices
  objectives: # see objectives below for details
  alertPolicies: # see alert policies below for details
Notes (SLO)
  • indicator optional, represents the Service Level Indicator (SLI), described in SLI section. One of indicator or indicatorRef must be given. If declaring composite SLO must be moved into objectives[].

  • indicatorRef optional, this is the name of Service Level Indicator (SLI). One of indicator or indicatorRef must be given. If declaring composite SLO must be moved into objectives[].

  • timeWindow[ ] optional, TimeWindow is a list but accepting only exactly one item, one of the rolling or calendar aligned time window:

    • Rolling time window. Duration should be provided in shorthand format e.g. 5m, 4w, 31d.
    • Calendar Aligned time window. Duration should be provided in shorthand format eg. 1d, 2M, 1Q, 366d.
  • description string optional field, contains at most 1050 characters

  • budgetingMethod enum(Occurrences | Timeslices | RatioTimeslices), required field

    • Occurrences method uses a ratio of counts of good events to the total count of the events.
    • Timeslices method uses a ratio of good time slices to total time slices in a budgeting period.
    • RatioTimeslices method uses an average of all time slices' success ratios in a budgeting period.
  • objectives[ ] Threshold, required field, described in Objectives section. If thresholdMetric has been defined, only one Threshold can be defined. However, if using ratioMetric then any number of Thresholds can be defined.

  • alertPolicies[ ] AlertPolicy, optional field. section. An alert policy can be defined inline or can refer to an Alert Policies object, in which case the following are required:

    • alertPolicyRef string: this is the name or path to the AlertPolicy
Objectives

Objectives are the thresholds for your SLOs. You can use objectives to define the tolerance levels for your metrics.

objectives:
  - displayName: string # optional
    op: lte | gte | lt | gt # conditional operator used to compare the SLI against the value. Only needed when using a thresholdMetric
    value: numeric # optional, value used to compare threshold metrics. Only needed when using a thresholdMetric
    target: numeric [0.0, 1.0) # budget target for given objective of the SLO, can't be used with targetPercent
    targetPercent: numeric [0.0, 100) # budget target for given objective of the SLO, can't be used with target
    timeSliceTarget: numeric (0.0, 1.0] # required only when budgetingMethod is set to TimeSlices
    timeSliceWindow: number | duration-shorthand # required only when budgetingMethod is set to TimeSlices or RatioTimeslices
    indicator: # required only when creating composite SLO, see SLI below for more details
    indicatorRef: string # required only when creating composite SLO, required if indicator is not given.
    compositeWeight: numeric (0.0, inf+] # optional, supported only when declaring multiple objectives, default value 1.

Example:

objectives:
  - displayName: Foo Total Errors
    target: 0.98
  - displayName: Bar Total Errors
    targetPercent: 99.99
Notes (Objectives)
  • op enum( lte | gte | lt | gt ), operator used to compare the SLI against the value. Only needed when using a thresholdMetric

  • value numeric, required field, used to compare values gathered from metric source. Only needed when using a thresholdMetric.

Either target or targetPercent must be used.

  • target numeric [0.0, 1.0), optional, but either this or targetPercent must be used. Budget target for a given objective of the SLO. A target: 0.9995 is equivalent to targetPercent: 99.95.

  • targetPercent: numeric [0.0, 100), optional, but either this or target must be used. Budget target for a given objective of the SLO. A targetPercent: 99.95 is equivalent to target: 0.9995.

  • timeSliceTarget numeric [0.0, 1.0], required only when budgeting method is set to TimeSlices

  • timeSliceWindow (numeric | duration-shorthand), required only when budgeting method is set to TimeSlices or RatioTimeslices. Denotes the size of a time slice for which data will be evaluated e.g. 5, 1m, 10m, 2h, 1d. Also ascertains the frequency at which to run the queries. Default interpretation of unit if specified as a number in minutes.

  • indicator optional, represents the Service Level Indicator (SLI), described in SLI section. One of indicator or indicatorRef must be given in objective when creating composite SLO.

  • indicatorRef optional, this is the name of Service Level Indicator (SLI). One of indicator or indicatorRef must be given when creating composite SLO.

Notes (Composite SLO)

Composite SLO goal of composite SLO is to enable user an end-to-end journey, it is done by defining many independent objectives. Each objective can have different queries, data sources and targets. The basic implementation assumes that the Composite Error Budget burns if the Error Budget for any of the SLO objectives within the Composite SLO is burning. The logic of those calculations is the same for Composite SLOs as for regular (standard) objectives and SLOs.

Weight allows the user to change the impact of a given SLO on the whole composite SLO. Weight is just multiplier, it means that if weight is 0.5, SLO will have half impact as default, on the other hand if weight is 100, this SLO will be 100 times more impactful. By default, weight has value 1 and doesn't need to be specified.

Calculations should be as simple as possible to make composite SLO intuitive and easy to implement. It is hard to compare different error budget calculating methods therefore all composite objectives need to be calculated with one type of error budget calculating method. Here is brief description how given budgeting method should impact composite SLO and how wight scale its impact:

  • Occurrences - if SLO burns its budget composite is burning its budget at the same rate. Each violation that consumed SLO's budget will impact Composite at the same rate. Weight multiplies the rate of burning of SLO (referenced as burn rate) that burns composite.
  • Timeslices - this is binary depending on whether it was a good or bad minute. If it was a bad minute for any individual objective, it's considered a bad minute for the Composite SLO.
  • Ratiotimeslices - it is the sum of missing up to 100 percent. If two SLOs have average of Ratiotimeslices on 95%, composite will have average of Ratiotimeslices on 90%. Weight multiplies missing part of given slo.

SLI

A service level indicator (SLI) represents how to read metrics from data sources.

Check work in progress for v2.

apiVersion: openslo/v1
kind: SLI
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters
  thresholdMetric: # either thresholdMetric or ratioMetric must be provided
    metricSource:
      metricSourceRef: string # optional, this field can be used to refer to DataSource object
      type: string # optional, this field describes predefined metric source type e.g. Prometheus, Datadog, etc.
      spec:
        # arbitrary chosen fields for every data source type to make it comfortable to use
        # anything that is valid YAML can be put here.
  ratioMetric: # either thresholdMetric or ratioMetric must be provided
    counter: true | false # true if the metric is a monotonically increasing counter,
                          # or false, if it is a single number that can arbitrarily go up or down
                          # ignored when using "raw"
    good: # the numerator, either "good" or "bad" must be provided if "total" is used
      metricSource:
        metricSourceRef: string # optional
        type: string # optional
        spec:
          # arbitrary chosen fields for every data source type to make it comfortable to use.
    bad: # the numerator, either "good" or "bad" must be provided if "total" is used
      metricSource:
        metricSourceRef: string # optional
        type: string # optional
        spec:
          # arbitrary chosen fields for every data source type to make it comfortable to use.
    total: # the denominator used with either "good" or "bad", either this or "raw" must be used
      metricSource:
        metricSourceRef: string # optional
        type: string # optional
        spec:
          # arbitrary chosen fields for every data source type to make it comfortable to use.

    rawType: success | failure # required with "raw", indicates how the stored ratio was calculated:
                               #  success – good/total
                               #  failure – bad/total
    raw: # the precomputed ratio stored as a metric, can't be used together with good/bad/total
      metricSource:
        metricSourceRef: string # optional
        type: string # optional
        spec:
          # arbitrary chosen fields for every data source type to make it comfortable to use.
Notes (SLI)
  • description string optional field, contains at most 1050 characters

Either ratioMetric or thresholdMetric must be used.

  • thresholdMetric Metric, represents the query used for gathering data from metric sources. Raw data is used to compare objectives (threshold) values.

  • ratioMetric Metric {good, total}, {bad, total} or raw.

    • counter enum(true | false), specifies whether the metric is a monotonically increasing counter. Has no effect when using a raw query.

    • good represents the query used for gathering data from metric sources used as the numerator. Received data is used to compare objectives (threshold) values to find good values. If bad is defined then good must not be set.

    • bad represents the query used for gathering data from metric sources used as the numerator. Received data is used to compare objectives (threshold) values to find bad values. If good is defined then bad must not be set.

    • total represents the query used for gathering data from metric sources that is used as the denominator. Received data is used to compare objectives (threshold) values to find total number of metrics.

    • rawType enum(success | failure), required when using raw, specifies whether the ratios represented by the "raw" ratio metric are of successes or failures. Not to be used with good and bad as picking one of those determines the type of ratio.

    • raw represents the query used for gathering already precomputed ratios. The type of ratio (success or failure) is specified using rawType.

An example of an SLO where SLI is inlined:

apiVersion: openslo/v1
kind: SLO
metadata:
  name: foo-slo
  displayName: Foo SLO
spec:
  service: foo
  indicator:
    metadata:
      name: foo-error
      displayName: Foo Error
    spec:
      ratioMetric:
        counter: true
        good:
          metricSource:
            metricSourceRef: datadog-datasource
            type: Datadog
            spec:
              query: sum:trace.http.request.hits.by_http_status{http.status_code:200}.as_count()
        total:
          metricSource:
            metricSourceRef: datadog-datasource
            type: Datadog
            spec:
              query: sum:trace.http.request.hits.by_http_status{*}.as_count()
  objectives:
    - displayName: Foo Total Errors
      target: 0.98
Ratio Metric

If a service level indicator has ratioMetric defined, the following maths can be used to calculate the value of the SLI. Below we describe the advised formulas for calculating the indicator value.

Good-Total queries If the good and total queries are given then following formula can be used to calculate the value:

indicatorValue = good / total

If we have 99 good requests out of a total of 100 requests, the calculated value for the indicator would be: 99 / 100 = 0.99. This represents 99% on a 0-100 scale using the formula 0.99 * 100 = 99.

Bad-Total queries If the bad and total queries are given then following formula can be used to calculate the value:

indicatorValue = ( total - bad ) / total

If we have 1 error out of a total of 100 requests, the calculated value for the indicator would be: (100 - 1) = 0.99. This represents 99% on a 0-100 scale using the formula 0.99 * 100 = 99.

💡 Note: As you can see for both query combinations we end up with the same calculated value for the service level indicator.

The required spec key will be used to pass extraneous data to the data source. The goal of this approach is to provide maximum flexibility when querying data from a particular source. In the following examples we can see that this works fine for both simple and more complex cases.

An example of ratioMetric:

ratioMetric:
  counter: true
  good:
    metricSource:
      type: Prometheus
      metricSourceRef: prometheus-datasource
      spec:
        query: sum(localhost_server_requests{code=~"2xx|3xx",host="*",instance="127.0.0.1:9090"})
  total:
    metricSource:
      type: Prometheus
      metricSourceRef: prometheus-datasource
      spec:
        query: localhost_server_requests{code="total",host="*",instance="127.0.0.1:9090"}

An example of thresholdMetric:

thresholdMetric:
  metricSource:
    metricSourceRef: redshift-datasource
    spec:
      region: eu-central-1
      clusterId: metrics-cluster
      databaseName: metrics-db
      query: SELECT value, timestamp FROM metrics WHERE timestamp BETWEEN :date_from AND :date_to

Field type can be omitted because the type will be inferred from the DataSource when metricSourceRef is specified.

An example thresholdMetric that does not reference a defined DataSource:

thresholdMetric:
  metricSource:
    type: Redshift
    spec:
      region: eu-central-1
      clusterId: metrics-cluster
      databaseName: metrics-db
      query: SELECT value, timestamp FROM metrics WHERE timestamp BETWEEN :date_from AND :date_to
      accessKeyID: accessKey
      secretAccessKey: secretAccessKey

Field type can't be omitted because the reference to an existing DataSource is not specified.


AlertPolicy

An Alert Policy allows you to define the alert conditions for an SLO.

apiVersion: openslo/v1
kind: AlertPolicy
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters
  alertWhenNoData: boolean
  alertWhenResolved: boolean
  alertWhenBreaching: boolean
  conditions: # list of alert conditions
    - conditionRef: # required when alert condition is not inlined
  notificationTargets:
  - targetRef: # required when alert notification target is not inlined
Notes (AlertPolicy)
  • description string, optional description about the alert policy, contains at most 1050 characters
  • alertWhenBreaching boolean, true, false, whether the alert should be triggered when the condition is breaching
  • alertWhenResolved boolean, true, false, whether the alert should be triggered when the condition is resolved
  • alertWhenNoData boolean, true, false, whether the alert should be triggered when the condition indicates that no data is available
  • conditions[ ] Alert Condition, an array, (max of one condition), required field. A condition can be defined inline or can refer to external Alert condition defined in this case the following are required:
    • conditionRef string: this is the name or path the Alert condition
  • notificationTargets[ ] Alert Notification Target, required field. A condition can be defined inline or can refer to an AlertNotificationTarget object, in which case the following are required:
    • targetRef string: this is the name or path to the AlertNotificationTarget

💡 Note: The conditions field is of the type array of AlertCondition but only allows one single condition to be defined. The use of an array is for future-proofing purposes.

An example of an Alert policy which refers to another Alert Condition:

apiVersion: openslo/v1
kind: AlertPolicy
metadata:
  name: AlertPolicy
  displayName: Alert Policy
spec:
  description: Alert policy for cpu usage breaches, notifies on-call devops via email
  alertWhenBreaching: true
  alertWhenResolved: false
  conditions:
    - conditionRef: cpu-usage-breach
  notificationTargets:
    - targetRef: OnCallDevopsMailNotification

An example of an Alert Policy were the Alert Condition is inlined:

apiVersion: openslo/v1
kind: AlertPolicy
metadata:
  name: AlertPolicy
  displayName: Alert Policy
spec:
  description: Alert policy for cpu usage breaches, notifies on-call devops via email
  alertWhenBreaching: true
  alertWhenResolved: false
  conditions:
    - kind: AlertCondition
      metadata:
        name: cpu-usage-breach
        displayName: CPU Usage breaching
      spec:
        description: SLO burn rate for cpu-usage-breach exceeds 2
        severity: page
        condition:
          kind: burnrate
          op: lte
          threshold: 2
          lookbackWindow: 1h
          alertAfter: 5m
  notificationTargets:
    - targetRef: OnCallDevopsMailNotification

AlertCondition

An Alert Condition allows you to define under which conditions an alert for an SLO needs to be triggered.

apiVersion: openslo/v1
kind: AlertCondition
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters
  severity: string # required
  condition: # required
    kind: string
    op: enum
    threshold: number
    lookbackWindow: duration-shorthand
    alertAfter: duration-shorthand
Notes (AlertCondition)
  • description string, optional description about the alert condition, contains at most 1050 characters
  • severity string, required field describing the severity level of the alert (ex. "sev1", "page", etc.)
  • condition, required field. Defines the conditions of the alert
    • kind enum(burnrate) the kind of alerting condition thats checked, defaults to burnrate

If the kind is burnrate the following fields are required:

  • op enum(lte | gte | lt | gt), required field, the conditional operator used to compare against the threshold
  • threshold number, required field, the threshold that you want alert on
  • lookbackWindow duration-shorthand, required field, the time-frame for which to calculate the threshold e.g. 5m
  • alertAfter duration-shorthand: required field, the duration the condition needs to be valid for before alerting, defaults to 0m

If the alert condition is breaching, and the alert policy has alertWhenBreaching set to true the alert will be triggered

If the alert condition is resolved, and the alert policy has alertWhenResolved set to true the alert will be triggered

If the service level objective associated with the alert condition returns no value for the burn rate, for example, due to the service level indicators missing data (e.g. no time series being returned) and the alertWhenNoData is set to true the alert will be triggered.

💡 Note: The alertWhenBreaching and alertWhenResolved, alertWhenNoData can be combined, if you want an alert to trigger whenever at least one of these conditions is true.


An example of an alert condition:

apiVersion: openslo/v1
kind: AlertCondition
metadata:
  name: cpu-usage-breach
  displayName: CPU usage breach
spec:
  description: If the CPU usage is too high for given period then it should alert
  severity: page
  condition:
    kind: burnrate
    op: lte
    threshold: 2
    lookbackWindow: 1h
    alertAfter: 5m

AlertNotificationTarget

An Alert Notification Target defines the possible targets where alert notifications should be delivered to. For example, this can be a web-hook, Slack or any other custom target.

apiVersion: openslo/v1
kind: AlertNotificationTarget
metadata:
  name: string
  displayName: string # optional, human readable name
spec:
  target: # required
  description: # optional

An example Alert Notification Target:

apiVersion: openslo/v1
kind: AlertNotificationTarget
metadata:
  name: OnCallDevopsMailNotification
spec:
  description: Notifies by a mail message to the on-call devops mailing group
  target: email

Alternatively, a similar notification target can be defined for Slack like this:

apiVersion: openslo/v1
kind: AlertNotificationTarget
metadata:
  name: OnCallDevopsSlackNotification
spec:
  description: "Sends P1 alert notifications to the slack channel"
  target: slack
Notes (AlertNotificationTarget)
  • target string, describes the target of the notification, e.g. Slack, email, web-hook, Opsgenie etc
  • description string, optional description about the notification target, contains at most 1050 characters

💡 Note: The way the alert notification targets are is an implementation detail of the system that consumes the OpenSLO specification.

For example, if the OpenSLO is consumed by a solution that generates Prometheus recording rules, and alerts, you can imagine that the name of the alert notification gets passed as a label to Alertmanager which then can be routed accordingly based on this label.


Service

A Service is a high-level grouping of SLO. It may be defined before creating SLO to be able to refer to it in SLO's spec.service. Multiple SLOs can refer to the same Service.

apiVersion: openslo/v1
kind: Service
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters

SDK

DISCLAIMER: The SDK is a work in progress and is subject to change.

The OpenSLO SDK is a set of utilities designed for programmatic access to OpenSLO specification.

openslo's People

Contributors

agaurav avatar alexnauda avatar arunsathiya avatar berthartm avatar crthaze avatar davex98 avatar dependabot[bot] avatar dgzlopes avatar dnsmichi avatar ericksoen avatar fearfactor3 avatar fourstepper avatar fpiwowarczyk avatar geototti21 avatar ian-bartholomew avatar jackdwyer avatar kenfinnigan avatar mmazur avatar niallrmurphy avatar nieomylnieja avatar nobl9-mikec avatar patil2099 avatar proffalken avatar programmer04 avatar r3code avatar skrolikiewicz avatar soasme avatar tapico-weyert avatar tmc avatar weyert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openslo's Issues

`AlertCondition.alertAfter` type should be string

Summary

Currently the alertAfter field of AlertCondition, see here is defined as a number but the example uses 5m as a value.

To cater for a value of 5m, which is correct, the type should be string.

What is the current bug behavior?

Type for alertAfter field set to number, but 5m is an example value.

What is the expected correct behavior?

Type for alertAfter field set to string.

Possible fixes

Modify the type of alertAfter to be a string.

Clarify `AlertCondition.threshold`

Summary

Currently AlertCondition.threshold is defined as a number in the spec but an int in the Oslo struct.

I can see use cases of wanting to threshold on more than 100%, 200%, 300%, etc, such as 150%.

I don't know if "number" in our definition would support a value of 1.5, or whether it needs to change.

Once we've clarified the meaning in the spec, a follow-on issue would be to adjust the Oslo structs to align with it.

Possible fixes

Field is a float to support decimal values.

Allowed object name rules should be looser

Current rules for names are:

  • contain at most 63 characters
  • contain only lowercase alphanumeric characters or -
  • start with an alphanumeric character
  • end with an alphanumeric character

When thinking about a large enough SLO management system (and considering how many object types the current spec supports), I propose giving a freer hand to implementers to make sure large systems can handle "namespacing" objects. Therefore the naming rules should imo change to something like this:

Implementations MUST support the following obj naming rules:

  • contain at most 63 characters
  • contain only lowercase alphanumeric characters or -
  • start with an alphanumeric character
  • end with an alphanumeric character

Implementations SHOULD additional support the following obj naming rules:

  • contain up to 255 characters
  • may contain the following special characters: - . | / \

Add operator to Alert Condition

Problem to solve

Right now, spec.condition doesn't have an operator (ie, "gt", "gte", "lt", etc) for the threshold. We should add that in order to be more specific about the alerting.

RatioMetrics support for single metric

In some cases (Dynatrace), ratio metrics are pre-computed by the platform and recording them won't fit with the OpenSLO spec, might it be possible to support a flavor of the Ratiometrics that uses such a value?

Proposed behaviors/format

ratioMetrics:
  incremental: true
  rateMetric: true
  good:
    source: <source>
    queryType: <query_type>
    query: '<query>'
    filter: '<filter>' # optional
ratioMetrics:
  incremental: true
  good:
    source: <source>
    queryType: <query_type>
    query: '<query>'
    filter: '<filter>' # optional
  total:
    source: <source>
    queryType: <query_type>
    query: '<query>'
    filter: '<filter>' // optional

This would give the OpenSLO spec compatiblity with the data elements of the Dynatace SLO implementation

Rate Metric Implementation

{
  "enabled": true,
  "name": "Payment service availability",
  "customDescription": "Rate of successful payments per week",
  "useRateMetric": true,
  "metricRate": "builtin:service.successes.server.rate",
  "evaluationType": "AGGREGATE",
  "filter": "type(\"HOST\")",
  "target": 95,
  "warning": 97.5,
  "timeframe": "-1d"
}

No Rate Metric Implementation

{
  "enabled": true,
  "name": "Payment service availability",
  "customDescription": "Rate of successful payments per week",
  "useRateMetric": false,
  "metricNumerator": "builtin:service.errors.server.successCount",
  "metricDenominator": "builtin:service.requestCount.total",
  "evaluationType": "AGGREGATE",
  "filter": "type(\"HOST\")",
  "target": 95,
  "warning": 97.5,
  "timeframe": "-1d"
}

API Version should probably be v1alpha1 not v1alpha

The typical pattern in Kubernetes APIs is v<eventual-version number> [alpha | beta ] < alpha/beta version number> e.g. v1alpha1

Without the additional version number it becomes impossible to revise the spec using semantic versioning.

e.g. v1alpha1, v1alpha2, etc.

Right now, you would have to do v2alpha or jump to v1beta1 neither of which is probably desirable.

Allow defining alerts associated with SLOs

Problem to solve

Typically, when you define a SLO you want to be informed or alerted when they are not met. I would like to be able to define alerts for a SLO.

Proposal

I would like to propose an enhancement to the SLO specification so that you can associate or link alert definition to the SLO, e.g.

alerts:
  - ./alerts/my-slo-alert.yaml
  - ./alerts/my-slo/alert*.yaml

A alert definition for a SLO could be something like:

--- 
apiVersion: openslo/v1alpha
description: "A multi-window multi-burn alert is triggered when needed\n"
kind: Alert
metadata: 
  name: HighResponseTime
spec: 
  name: '<slo_1>'
  labels: 
    owner: "shopifiy"
    repo: "shopfiy/shopcart"
    tier: 2
  annotations:
    summary: "Drop in swift responses for Shopcart API"
    displayName: "High response times"
    runbook: https://guide.shopify.com/runbooks/high-response-time
  severity: ticket # severity of the alert (e.g. ticket, page, P1 board meeting request)
  method: simple # define the alert kind, e.g. multi window multi burn
  query: 
    measurement: burnRate # measurement to use
    expression: |  # the query used?
      some_query_for_resource

Further details

None

Links / references

https://nobl9.github.io/techdocs_YAML_Guide/

Add description length to kind spec

Problem to solve

For each kind that has a spec.description we should add a max length in order to be more specific, and in line with the metadata.description

Proposal

Add a max 1050 character limit to all spec.description

thresholdMetric and ratioMetric

Why is thresholdMetric under indicator and ratioMetric under objectives? Does this mean we can't have an objective for a thresholdMetric?

Also is it ratioMetric or ratioMetrics? The example in the README has ratioMetrics but the value under Objectives has ratioMetric

Spec is nice, but status is twice as nice

In kube-native resource definitions, we see a lot of implementations that neglect to define the expected shape of the status, but it is vital as the specification to define the minimum shape expected for a status of an observed resource.spec. Without this, observers of the resource will be unable to reconcile their state with the resource.

I recommend adding metadata.generation, this is the mutation count for .spec changes.

Then the minimum .status shape should be:

status:
  observedGeneration: <int> # generation of observed spec of resource.
  conditions:                            # array of conditions
    - lastTransitionTime: <datetime> # in kube api datetime of last meaningful transition of status.
      status: <triloolean> # "True", "False", "Unknown"
      type: Ready # the string "Ready"

Introduce `queryFrequency` to ratioMetric

Problem to solve

Hey, I'd like to suggest the addition of queryFrequency as an optional first-level field to ratioMetric. We already have metadata where we can pass extra information but I believe we need this information pretty much to all data sources so I'd like to suggest it be the first level. This is to pass information on how often we want to run the query.

Thoughts on that or how else we can solve this?

Proposal

  objectives:
    - displayName: APILatency
      op: gt
      value: 1
      target: 0.95
      ratioMetric:
        incremental: true
        bad:
          source: MySource
          queryType: query
          queryFrequency: number # integer in seconds, e.g. 60 , 3600
          query: "some query"
        total:
          source: MySource
          queryType: query
          query: "some query"

ratioMetric is missing a `raw:` type

ratioMetric combines the good query or the bad query with the total query to get a ratio.

Except the ratio itself can already be present as a precalculated metric. Sloth supports this option with a sli.raw.error_ratio_query option (example).

I'm proposing adding a similar option to ratioMetric. Something like:

raw: # raw ratio query, either "good/bad" + "total" or "raw" must be provided
  type: success | error  # indicates whether query returns the success or the error ratio
  metricSource:
    (…)

(I'm not married to calling it raw, so if anybody has better ideas, feel free.)

End to end demo

Problem to solve

As a user, I want to be able to see a working demo of OpenSLO, from SLO definition all the way through to alerting.

Proposal

We should develop an end to end demo of OpenSLO, including:

  • Definition of SLOs and SLIs in code
  • Validation of the OpenSLO files in CI using oslo
  • Ingestion of the SLO and SLI definitions into a platform (in CI/CD?)
  • Having that platform query the data source as defined in the SLI
  • Representation of the SLO data as defined in the SLO file
  • Alerting based on alerting defined in the SLO file

Further details

We should probably choose and define the use cases, including:

  • Build validation
  • Production
  • Performance regression

Introduce `annotations` to SLO metadata

Problem to solve / Proposal

Hey, similar to k8s annotation system, introduce annotations to OpenSLO, where we can pass metadata to OpenSLO implementations. We have a use case where we want to have annotations for Cloudwatch or any kind of metrics publisher, that we would like to set on SLO level. By adding this the Spec is more customizable and by default is not adding any new implementation details.

---
apiVersion: openslo/v1alpha
kind: SLO
metadata:
  name: checkout-front-end-latency
  displayName: Checkout Front End Latency
  annotations:
    paddle.com/cloudwatch-publisher-enabled: true
    paddle.com/cloudwatch-namespace: SLI

Durations aren't consistent in spec

Summary

There are currently a couple different ways to describe time durations in the spec. One of them is an object, ex

  count : numeric
  unit : enum (minutes/hours/days/etc.)

and the other is common duration shorthand:

   count : numeric | duration shorthand (1m/3h/etc.)

We should unify on duration shorthand as discussed in the most recent community meeting. It is supported by common go libraries, so consistent parsing shouldn't be a problem

Consider supporting SLA4OAI spec

Problem to solve

SLA + SLO are things that go together. This spec defines the second.
However, there is also another spec with the goal of describing SLA, called SLA4OAI.

Would it make sense to add support, somehow? Perpahs just as a linked doc to this spec.

Proposal

To somehow have the ability to link an OpenSLO file with a SLA4OAI one.

--

Congrats on this great initiative folks! 🚀

Umbrella issue for defining schema in some common machine readable format

Problem to solve

As discussed many times, having a definition of schema in some parsable format is highly desirable. Let's discuss here possible solutions, pros, and cons for them and decide which one we want to choose.

Please keep in mind that YAML is only a format that we use to describe this because it's typical for the configuration/infrastructure world, but our YAMLs are convertible to JSONs. Furthermore, probably APIs of the platform will expect those definitions in JSON anyways. Basically, YAMLs are more human readable than bare schema. YAMLs should be able to be easily validated against the schema defined with the chosen solution.

Nice to have

  • popular / widely supported
  • support validation (required a conditional one)
  • support aggregation (define a piece of schema and reuse it)
  • code generation for popular languages
  • use it seamlessly in https://github.com/OpenSLO/oslo
  • anything else?

Proposal

Document planned use cases

What are the expected consumers for these SLO definitions? There should be a section in the introduction which gives a few examples, to give people a better understanding of why they would use this.

For example:

  • Auto-generating documentation
  • Auto-generating alerts and reports
  • Providing an agenda for SLO review meetings
  • etc.

Drop .metricSource. nesting level

PR #111 added DataSource to the spec, which increased the flexibility.

But I think the implementation ended up more verbose than it could've. This is the level of nesting for a simple ratioMetric: spec.indicator.spec.ratioMetric.good.metricSource.spec.query.

I'm not sure what purpose the .metricSource. serves in there. I wonder if we couldn't drop it. Instead define thresholdMetric and ratioMetric's good, bad and ugly as being "of type metricSource", so the nesting would drop by at least one level to: spec.indicator.spec.ratioMetric.good.spec.query.

(We could leave defining it with .metricSource. included for backwards compatibility.)

Separation of SLO and SLI

In order to promote separation of concern between the SLOs and the SLIs, we should separate out the two. This has the added benefit of portability as well as allowing for better access control for the files. In addition, it keeps the SLO definition separate and independent of the data sources

Add additional alert condition measurements

Problem to solve

Currently for Alert Condition, we only have averageBurnRate. We should have some others that we support.,

Proposal

We should have others, including timeToBurnBudget and burnedBudget

Create a shared demo app as a hello world for OpenSLO

Problem to solve

We currently have many different examples floating around all the various projects.

Proposal

Let's come together and create a decent hello world example SLO based on real world data.

We could use the https://openslo.com/ website itself and export its metrics.
Another idea would be using the example from the "Implementing Service Level Objectives" book!
https://www.wienershirtzel.com/

Note: Sorry for the typos, this issue was created during a live stream.

Clarify SLO AlertPolicy

Summary

In the SLO section, alertPolicies are not defined as being either inlined or a reference. We should decide on one, and clarify it.

What is the expected correct behavior?

Have in the spec if the alertPolicies in the SLO kind is a reference, inlined, or both.

Support Composite SLO in OpenSLO.

Problem to solve

Support Composite SLO in OpenSLO.
The goal of Composite SLO is to enable the user to capture an end-to-end journey.

Proposal

The Composite SLO will provide users the ability to define an SLO that is based on the health of many independent objectives that are defined with different queries, targets, etc. (even data sources).

As an example, the banking industry normally has 10-20 services for a user scenario. Withdrawing money from an ATM requires multiple services and transactions to happen in low latency, high availability environment. The SLO in this case is the success of a customer withdrawing money from an ATM. One might imagine we would aim for 99.99% reliability and with all the underlying services and their SLOs making up the Composite SLO.

To enable the creation of a Composite SLO we will have to modify the current structure of our YAML files to enable specifying different SLI for each objective instead of having one SLI for one SLO object. We could achieve that by adding indicatorRef:

objectives:
  - displayName: string # optional
    op: lte | gte | lt | gt # conditional operator used to compare the SLI against the value. Only needed when using a thresholdMetric
    value: numeric # optional, value used to compare threshold metrics. Only needed when using a thresholdMetric
    target: numeric [0.0, 1.0) # budget target for given objective of the SLO
    timeSliceTarget: numeric (0.0, 1.0] # required only when budgetingMethod is set to TimeSlices
    timeSliceWindow: number | duration-shorthand # required only when budgetingMethod is set to TimeSlices
    indicatorRef: name of the SLI

or we could simple inline our SLI in the objective:

objectives:
  - displayName: string # optional
    op: lte | gte | lt | gt # conditional operator used to compare the SLI against the value. Only needed when using a thresholdMetric
    value: numeric # optional, value used to compare threshold metrics. Only needed when using a thresholdMetric
    target: numeric [0.0, 1.0) # budget target for given objective of the SLO
    timeSliceTarget: numeric (0.0, 1.0] # required only when budgetingMethod is set to TimeSlices
    timeSliceWindow: number | duration-shorthand # required only when budgetingMethod is set to TimeSlices
    thresholdMetric:
      metricSource: string
        metricSourceRef: string
        spec:
         # arbitrary chosen fields for every data source type to make it comfortable to use

Further details

We could decide on our upcoming community meeting and discuss this way of handling Composite SLO in our standard.

Best practices repo

Problem to solve

A lot of people as they get started, are not quite sure where to start. Additionally, there are a number of best practices that we use, that are not documented or centralized

Proposal

In order to help people get started, as well as to utilize the learnings of others, we should start a collection of best practices for common scenarios.

Additionally we can reach out to other projects and show them how they can use the OpenSLO spec in their projects

`Second` unit referenced where minimum value should be `Minute`

- unit: Second

Currently the minimum RollingWindow is mentioned at 5 minutes, but the example references a Second value.

- **timeWindows[ ]** *TimeWindow* is a list but accepting only exactly one
  item, one of the rolling or calendar aligned time window:

  - Rolling time window. Minimum duration for rolling time window is 5
    minutes, maximum 31 days).

    ```yaml
    unit: Day | Hour | Minute
    count: numeric
    isRolling: true
    ```

We should update to be consistent across both pieces of documentation

allow ratioMetric to support single query

Currently, the ratioMetric requires two separate queries, one for good responses, the other for total. In some circumstances, it would be desirable to have a single query where good and total are inferred based on the query response. The spec would need to support both the tallying of the total number, as well as mapping of the values to the appropriate group e.g. true maps to good and increments the total, and false simply maps to total

SLO Aggregation

Problem to solve

Currently, multiple SLOs can encompass a single user journey, without a single SLO that measures the user experience.

Proposal

Add the ability to aggregate SLOs, and roll them up into a single SLO

This is similar to Keptn's Quality Gates, and proposed by Andres Grabner. More info here: https://www.youtube.com/watch?v=bMnMkOKVzdg

Further details

Key features:

  • Performance Signature
  • Synthetic SLI from multiple SLOs
  • Key SLOs
    • All will fail if this fails
  • Weighted
    • Total weight is the aggregation/sum of all weights
  • Performance testings
  • Regression Detection

Standardized OpenSLO to Markdown Support

Problem to solve

OpenSLO definitions are great for programmatic interfaces but aren't great for human reading. As an SLO adopter, I want a way to synchronize information between my SLO Documents and OpenSLO definitions. The former (SLO Documents) often include information that isn't as useful from a programmatic point of view, such as verbose descriptions, architecture diagrams, data workflow diagrams, etc.

This information is still critical to the SLO lifecycle, for communicating with stakeholders and gaining alignment, it's just not useful as a core part of the programmatic specification.

Proposal

This still needs a bit of brainstorming, but I'd love a tool that can look at an OpenSLO definition and generate a basic SLO Document in markdown. Additional fields (like Architecture Diagram images, for example) could be stored in metadata with a standard naming convention would help organize and generate the resulting markdown.

Too much flexibility here gets really close to a full blown CMS, so we'll want to brainstorm ways to keep things simple while still providing sufficient value to SLO adopters.

Further details

This was discussed a bit in the OpenSLO slack : https://openslo.slack.com/archives/C0202J83M3R/p1656536916703469

Links / references

Making TimeSlice-budgeted SLOs clearer

Issues

  1. Move timeSliceWindow directly under SLO.spec due to reasons stated in the following section.
  2. Defining timeSliceTarget is required for SLOs using Timeslices. I'm not familiar with real-world scenarios where such a calculation method is being used (anyone who is, please comment), but wouldn't it be true that the default timeSliceTarget would be expected to be the same as target? If yes, I'd suggest amending the spec to reflect that – that it defaults to the same value as target unless specified otherwise.
  3. Timeslices or TimeSlices – the spec says it should be the former, but some comments use the latter. We should probably keep with what's already present in v1 (Timeslices), on the other hand in general we tend to use CamelCase (AlertPolicy, DataSource, etc.).

Reasoning for timeSliceWindow move

This bit is expanded on in #160

Currently defining TimeSlice-budgeted SLOs happens in two places:

  1. SLO.spec.budgetingMethod: Timeslices to indicate how to perform SLO calculations.
  2. SLO.spec.objectives[].timeSliceTarget and .timeSliceWindow to provide the required details.

This makes me wonder why if budgetingMethod: Timeslices is a property of the whole SLO, is timeSliceWindow hidden under .objectives[].. This is inconsistent to me considering both parameters are required to actually have a working TimeSlice-budgeted SLO, therefore I'd expect them to be defined at the same level. Either both directly under .SLO.spec or both in the .objectives[]..

The latter seems like the wrong approach (because if timeSliceWindow is under .objectives[]., then why not timeWindow as well), so I'm in favor of the first one – moving timeSliceWindow directly under SLO.spec.

(There's a wider discussion to be had on why we have .objectives[]. at all – under what circumstances a single Service Level Objective should have multiple objectives; but this isn't the place.)

Figure out where indicator belongs

Right now there seems to be a bit of an identity crisis between ratio based metrics and threshold metrics. To drive more consistency in where each thing is defined, move the data in the spec around

Current thresholdMetric Implementation

apiVersion: openslo/v1alpha
kind: SLO
metadata:
  name: '<slo_1>'
spec:
  description: '<slo_1_description>'
  service: <service_a>
  indicator:
    thresholdMetric:
      source: '<source>'
      query: '<query_1>'
      queryType: '<query_type>'
  timeWindows:
    - unit: <unit>
      count: <numeric>
      isRolling: <boolean>
  budgetingMethod: <method>
  objectives:
    - displayName: <display_name>
      op: <op>
      value: <value>
      target: <target>

Current ratioMetric Implementation

apiVersion: openslo/v1alpha
kind: SLO
metadata:
  name: '<slo_1>'
spec:
  description: '<slo_1_description>'
  service: <service_a>
  indicator:
  timeWindows:
    - unit: <unit>
      count: <numeric>
      isRolling: <boolean>
  budgetingMethod: <method>
  objectives:
    - displayName: <display_name>
       value: <value>
       target: <target>
       ratioMetric:
         incremental: <boolean>
         good:
           source: <source>
           queryType: <query>
           query: <query_string>
         total: # the denominator
           source: <source>
           queryType: <query>
           query: <query_string>

It would seem easier to manage the spec if the concept of the indicator was either universally handled outside of the objective or inside the objective, not split between them based on the metric type

Having this consistency would drive more universal consistency in SLO definitions.

Choices for implementation might look something like:

(a) ratioMetric as indicator

apiVersion: openslo/v1alpha
kind: SLO
metadata:
  name: '<slo_1>'
spec:
  description: '<slo_1_description>'
  service: <service_a>
  indicator:
    ratioMetric:
       incremental: <boolean>
       good:
         source: <source>
         queryType: <query>
         query: <query_string>
       total: # the denominator
         source: <source>
         queryType: <query>
         query: <query_string>
  timeWindows:
    - unit: <unit>
      count: <numeric>
      isRolling: <boolean>
  budgetingMethod: <method>
  objectives:
    - displayName: <display_name>
       value: <value>
       target: <target>

(b) indicator inside objective

apiVersion: openslo/v1alpha
kind: SLO
metadata:
  name: '<slo_1>'
spec:
  description: '<slo_1_description>'
  service: <service_a>
  timeWindows:
    - unit: <unit>
      count: <numeric>
      isRolling: <boolean>
  budgetingMethod: <method>
  objectives:
    - displayName: <display_name>
      op: <op>
      value: <value>
      target: <target>
      thresholdMetric:
        source: '<source>'
        query: '<query_1>'
        queryType: '<query_type>'

For the record, my vote is option (a) as this properly separates SLI from SLO in the spec

Labels and annotations are inconsistent, they should be fleshed out or dropped from the spec

General Schema for objects does not mention anything other than name and displayName. And yet SLO definition also mentions labels: and annotations, while AlertNotificationTarget sort of mentions metatdata.labels, but in an incomplete way (mentioned in Notes, missing in schema).

This needs to either be made consistent or dropped altogether.

Making it consistent

Drop those definitions from specific objects and put them in General Schema section.

Drop altogether

I understand that this is a vestige of the prevalence of k8s, but why should this even exist in the spec? If one uses the spec to created CRDs from, then all the standard k8s metadata content norms and habits will apply anyway, no matter what we do or don't mention in the spec.

But if we don't, then what's the point of expecting an implementer to support these? I can imagine implementations where labels or annotations serve no purpose.

IMO

I vote for dropping these from the spec and adding a short note in "General Schema" section mentioning that implementations of this spec on top of existing architectures (like k8s) will likely put a lot of stuff in metadata that's architecture–specific (e.g. annotations and labels in k8s) and the spec has no problem with these existing.

Evaluate synergies with Backstage service catalog

Just had this pop onto my radar thanks to a colleague, and didn't know quite how to get this into discussion. There's a current CNCF SIG for service catalogs called Backstage that seems to have a lot of synergy with this project. In particular, the "service" construct here seems to have some strong alignment with the Component or Resource constructs in Backstage's model. I wondered if it might be worth a look at supporting attachment of SLOs defined here to backstage components as well as the OpenSLO native service types.

https://github.com/backstage/backstage

Only 0.0 - 1.0 objectives?

target: numeric [0.0, 1.0) # budget target for given objective of the SLO

timeSliceTarget: numeric (0.0, 1.0] # required only when budgetingMethod is set to TimeSlices

This assumes the values will be only between 0% and 100% which seems to match the typical availability service level indicators. And generally a good practice. Although I saw other kind of SLIs in real-life.

Relevant SRE book chapter:
https://sre.google/workbook/implementing-slos/
which generally advocates for defining the 0%-100% ones. And even introduce the notions of "SLI specification" and "SLI implementations" with only the latter ones matching 0%-100%.

TLDR: I suggest to open up for other kind of SLIs, not only the [0,1] ones.

Create Custom Resource Definitions (CRDs) to represent the spec

Problem to solve

At the moment there is YAML within the README defining the desired structure, and it appears the specification is embedded into https://github.com/OpenSLO/oslo

Making it difficult to grab the specification and utilize it.

Proposal

To align with the Kubernetes ecosystem, it would be great if there were CRDs that defined the various parts of the specification.

Links / references

https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/

Better define sections/components of the spec

Problem to solve

There seem to be two different sets of data relative to SLO/SLI discussions: the definition and the implementation. These have been alluded to in other conversations and issues as important lines to consider in where the OpenSLO spec asserts itself and where it leaves it up to the specific implementation.

Consider, for the purposes of this discussion:
I have a service that returns checking account balance data.

I promise that this service will be available 99.99% of the time every month, else I will pay customers who experience a failure $15 per failure.

What additional information we would need to define an SLI/SLO for the above service?
What key attributes of the SLI/SLO do we want to capture as part of the agreement?
How would we record such an agreement in OpenSLO?

Proposal

Using this skeletal real-life example to help create a better understanding about what goes into the definition of an SLI/SLO, vs what data needs to be captured to actually derive the SLI/SLO from underlying instrumentation. Propose adjustments to the spec accordingly.

Realtion of Services and SLOs not clear

It is not entirely clear that a service needs to be defined before you can apply that service to an SLO in the documentation.

On the SLO Service attribute that should call that out, and on the Service part it should say what you do with the service string. Some people expect the service to be able to contain a list of SLOs, when in fact the relation is defined the other way.

Introduce `Bad` as radio-metric value

Problem to solve

👋 Hey, there are some use-cases where it's easier to measure the Bad events of an SLO, e.g. ( Count # of 5xx) instead of the Good ones and then do Total - Bad = Good. Especially if we want this to be universality used with any kind of datasource.

Proposal

I recommend to add a bad radio-metric value and update the SLO spec that either good or bad must be set.

     ratioMetrics:
        incremental: true
        bad:
          source: datadog
          queryType: query
          query: sum:requests.error{*}
        total:
          source: datadog
          queryType: query
          query: sum:requests.total{*}

Looking forward to your thoughts 😄

Resolve inconsistency in labels example

Summary

The label example in the spec uses this, but the definition has it as an array of strings and Oslo defines it as this.

We should resolve the inconsistency by updating the example in the spec.

What is the current bug behavior?

Label example uses:

metadata:
  name: string
  displayName: string # optional
  labels:
    userImpacting: "true"
    team: "identity"

Possible fixes

Based on the definition the example should be:

metadata:
  name: string
  displayName: string # optional
  labels:
    userImpacting:
      - "true"
    team:
      - "identity"

Define schema in .cue?

I think defining the schema using https://cuelang.org/ would bring a lot of benefits at this early stage of the project, e.g.:

  • Simplified configuration/schema generation via flexible data and schema expression language
  • Constrains, eg. bounds (numbers), optional/required/default fields, etc.
  • Import/export from/to standard formats, incl. CRD/OpenAPI

Related: OpenSLO/oslo#23

Tracking dependencies

It would be nice if we can have a way to tracking inter-components dependencies relationship.

Would be useful in the following cases:

  1. Ensure dependents SLO never exceed the dependencies
    I.e. if A depends on (B and C), and B and C each have the SLO of 99%, then A should not have the SLO exceeding 99^2 / 100 = 98.01

  2. Circular dependency between services: Optional opt-out circular dependency detection. 2(or more) Services are circular dependents would have the same SLO which is the min SLO among them.

Make source specification more general

Problem to solve

Currently, fields for defining data gathering step in YAMLs look like that

  source: string # data source for the metric
  queryType: string # a name for the type of query to run on the data source
  query: string # the query to run to return the metric
  metadata: # optional, allows data source specific details to be passed

which works nicely for some data sources in which each metric is specified with one string e.g.:

  • Prometheus -> xd_server_requests{code="2xx",host="xd.com"}
  • Datadog -> avg:trace.http.request.duration{*}
  • Dynatrace -> builtin:synthetic.http.duration.geo:filter(and(in("dt.entity.http_check",entitySelector("type(http_check),entityName(~"API Sample~")")),in("dt.entity.synthetic_location",entitySelector("type(synthetic_location),entityName(~"N. California~")")))):splitBy("dt.entity.http_check","dt.entity.synthetic_location"):avg:auto:sort(value(avg,descending)):limit(20) (referenced as metric selector)
  • Graphite -> stats.response.200 (referenced as metric path)
  • Splunk -> search index=xd-events source=udp:5072 sourcetype=syslog status<400 | bucket _time span=1m | stats avg(response_time) as n9value by _time | rename _time as n9time | fields n9time n9value
  • Splunk Observability -> "data('demo.trans.latency', filter=filter('demo_datacenter', 'Tokyo') and filter('demo_host', 'server4')).mean().publish()" (referenced as program)
  • New Relic -> SELECT average(duration*1000) FROM Transaction WHERE appName='production' TIMESERIES

Even when instead of the query this string is referenced differently in data source docs it's quite obvious where to put it in.

But for some other data source to fetch data comfortable you have to specify more e.g.:

  • Lightstep -> two values required XzpycSRa (referenced as stream ID) and good (referenced as type of data)
  • Pingdom -> two values required 5435381 (referenced as check ID) and up (referenced as status)
  • App Dynamics -> two values required xd (referenced as application name) and End User Experience|App|Very Slow Requests (referenced as metricPath)
  • For Redshift, CloudWatch, BigQuery, and other integrations from big cloud providers it may be handy to pass region, project id, name of database, etc.

The above is a little bit hard to describe with our current approach for the schema. It's not flexible enough, I know that whatever can be put in the metadata section, but it may not be obvious to use it. What to do with queryType and query in such a situation?

Proposal

I'm proposing a flexible approach that can be customizable for each data source vendor, below is the initial draft proposal to discuss.

  source: 
    type: string # predefined type e.g. Prometheus, Datadog, etc.
    spec:
        # arbitrary chosen fields for every data source type to make it comfortable to use

Further details

It will make OpenSLO spec flexible for any vendor of metrics

Define target between 0-100

During my past work on slo-libsonnet I had several people point out that writing targets between 0-1 by hand seems weird and not intuitive.
Especially targets like 99.5 are a lot more readable than 0.995. Given that OpenSLO is also meant to by written by hand I wanted to bring it up here, as it's been brought up with me before.

It's somewhat related to #24 but still different enough to have a separate issue, I believe.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.