openslo / slogen Goto Github PK

View Code? Open in Web Editor NEW

75.0 75.0 5.0 2.8 MB

tool to create and manage content for reliability tracking from logs/event data.

License: Apache License 2.0

Go 98.07% Makefile 0.08% HCL 1.49% Shell 0.36%

command-line-tool golang openslo reliability slo sumologic terraform

slogen's Introduction

Introduction
Specification
Examples
Glossary
Work in progress for future versions
- v2alpha1

Introduction

The intent of this document is to outline the OpenSLO specification.

The goal of this project is to provide an open specification for defining SLOs to enable a common, vendor–agnostic approach to tracking and interfacing with SLOs. Platform-specific implementation details are purposefully excluded from the scope of this specification.

OpenSLO is an open specification i.e., it is a specification created and controlled, in an open and fair process, by an association or a standardization body intending to achieve interoperability and interchangeability. An open specification is not controlled by a single company or individual or by a group with discriminatory membership criteria. Additionally, this specification is designed to be extended where needed to meet the needs of the implementation.

Before making a contribute please read our contribution guideline.

Specification

Goals

Compliance with the Kubernetes YAML format
Vendor-agnostic
Be flexible enough to be extended elsewhere

General Schema

apiVersion: openslo/v1
kind: DataSource | SLO | SLI | AlertPolicy | AlertCondition | AlertNotificationTarget | Service
metadata:
  name: string
  displayName: string # optional
  labels: # optional, it's allowed to assign multiple values to a single key
    # example labels
    organization: "acme"
    team:
      - "identity"
      - "rbac"
    costCentre: "project1"
    serviceTier:
      - "tier-1"
  annotations: # optional
    # example annotations
    openslo.com/key1: value1
    fooimplementation.com/key2: value2
spec:

Notes (General Schema)

kind string - required, one of: DataSource, SLO, SLI, AlertPolicy, AlertCondition, AlertNotificationTarget, Service
metadata.name: string - required field
- all implementations must at least support object names that follow RFC1123:
  - are up to 63 characters in length
  - contain lowercase alphanumeric characters or -
  - start with an alphanumeric character
  - end with an alphanumeric character
- implementations are additionally encouraged to support names that:
  - are up to 255 characters in length
  - contain lowercase alphanumeric characters or -, ., |, /, \
metadata.labels: map[string]string|string[] - optional field key <> value
- the key segment is required and must contain at most 63 characters beginning and ending with an alphanumeric character [a-z0-9A-Z] with dashes -, underscores _, dots . and alphanumerics between.
- the value of key segment can be a string or an array of strings
metadata.annotations: map[string]string - optional field key <> value
- annotations should be used to define implementation / system specific metadata about the SLO. For example, it can be metadata about a dashboard url, or how to name a metric created by the SLI, etc.
- key have two segments: an optional prefix and name, separated by a slash /
- the name segment is required and must contain at most 63 characters beginning and ending with an alphanumeric character [a-z0-9A-Z] with dashes -, underscores _, dots . and alphanumerics between.
- the prefix is optional and must be a DNS subdomain: a series of DNS labels separated by dots ., it must contain at most 253 characters, followed by a slash /.
- the openslo.com/ is reserved for OpenSLO usage

Custom Data Types

duration-shorthand

The duration shorthand is specified as a single–word string (no whitespaces) consisting of a positive integer number followed by a case–sensitive single–character postfix.

Allowed postfixes are:

m – minutes
h – hours
d – days
w – weeks
M – months
Q – quarters
Y – years

Examples: 12h, 4w, 1M, 1Q, 365d, 1Y.

This specification does not put requirements on how (or whether) to implement each postfix, therefore implementers are free to pick an implementation that best suits their environments.

There is however the possibility that future versions of this spec will take a more prescriptive stance on this issue.

Object Types

💡 Note: Specific attributes are described in detail in the Notes subsection of each object type's section.

DataSource

A DataSource represents connection details with a particular metric source.

Check work in progress for v2.

apiVersion: openslo/v1
kind: DataSource
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters
  type: string # predefined type e.g. Prometheus, Datadog, etc.
  connectionDetails:
    # fields used for creating a connection with particular datasource e.g. AccessKeys, SecretKeys, etc.
    # everything that is valid YAML can be put here

Notes (DataSource)

DataSource enables reusing one source between many SLOs and moving connection specific details (e.g. authentication) away from SLO definitions.

This spec does not enforce naming conventions for data source types, however the OpenSLO project will publish guidelines in the form of supplementary materials once common patterns start emerging from implementations.

An example of the DataSource kind can be:

apiVersion: openslo/v1
kind: DataSource
metadata:
  name: string
  displayName: string # optional
spec:
  type: CloudWatch
  connectionDetails:
    accessKeyID: accessKey
    secretAccessKey: secretAccessKey

SLO

A service level objective (SLO) is a target value or a range of values for a service level that is described by a service level indicator (SLI).

Check work in progress for v2.

apiVersion: openslo/v1
kind: SLO
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters
  service: string # name of the service to associate this SLO with, may refer (depends on implementation) to existing object Kind: Service
  indicator: # see SLI below for details
  indicatorRef: string # name of the SLI. Required if indicator is not given.
  timeWindow:
    # exactly one item; one of possible: rolling or calendar–aligned time window
    ## rolling time window
    - duration: duration-shorthand # duration of the window eg 1d, 4w
      isRolling: true
    # or
    ## calendar–aligned time window
    - duration: duration-shorthand # duration of the window eg 1M, 1Q, 1Y
      calendar:
        startTime: 2020-01-21 12:30:00 # date with time in 24h format, format without time zone
        timeZone: America/New_York # name as in IANA Time Zone Database
      isRolling: false # if omitted assumed `false` if `calendar:` is present
  budgetingMethod: Occurrences | Timeslices | RatioTimeslices
  objectives: # see objectives below for details
  alertPolicies: # see alert policies below for details

Notes (SLO)

indicator optional, represents the Service Level Indicator (SLI), described in SLI section. One of indicator or indicatorRef must be given. If declaring composite SLO must be moved into objectives[].
indicatorRef optional, this is the name of Service Level Indicator (SLI). One of indicator or indicatorRef must be given. If declaring composite SLO must be moved into objectives[].
timeWindow[ ] optional, TimeWindow is a list but accepting only exactly one item, one of the rolling or calendar aligned time window:
- Rolling time window. Duration should be provided in shorthand format e.g. 5m, 4w, 31d.
- Calendar Aligned time window. Duration should be provided in shorthand format eg. 1d, 2M, 1Q, 366d.
description string optional field, contains at most 1050 characters
budgetingMethod enum(Occurrences | Timeslices | RatioTimeslices), required field
- Occurrences method uses a ratio of counts of good events to the total count of the events.
- Timeslices method uses a ratio of good time slices to total time slices in a budgeting period.
- RatioTimeslices method uses an average of all time slices' success ratios in a budgeting period.
objectives[ ] Threshold, required field, described in Objectives section. If thresholdMetric has been defined, only one Threshold can be defined. However, if using ratioMetric then any number of Thresholds can be defined.
alertPolicies[ ] AlertPolicy, optional field. section. An alert policy can be defined inline or can refer to an Alert Policies object, in which case the following are required:
- alertPolicyRef string: this is the name or path to the AlertPolicy

Objectives

Objectives are the thresholds for your SLOs. You can use objectives to define the tolerance levels for your metrics.

objectives:
  - displayName: string # optional
    op: lte | gte | lt | gt # conditional operator used to compare the SLI against the value. Only needed when using a thresholdMetric
    value: numeric # optional, value used to compare threshold metrics. Only needed when using a thresholdMetric
    target: numeric [0.0, 1.0) # budget target for given objective of the SLO, can't be used with targetPercent
    targetPercent: numeric [0.0, 100) # budget target for given objective of the SLO, can't be used with target
    timeSliceTarget: numeric (0.0, 1.0] # required only when budgetingMethod is set to TimeSlices
    timeSliceWindow: number | duration-shorthand # required only when budgetingMethod is set to TimeSlices or RatioTimeslices
    indicator: # required only when creating composite SLO, see SLI below for more details
    indicatorRef: string # required only when creating composite SLO, required if indicator is not given.
    compositeWeight: numeric (0.0, inf+] # optional, supported only when declaring multiple objectives, default value 1.

Example:

objectives:
  - displayName: Foo Total Errors
    target: 0.98
  - displayName: Bar Total Errors
    targetPercent: 99.99

Notes (Objectives)

op enum( lte | gte | lt | gt ), operator used to compare the SLI against the value. Only needed when using a thresholdMetric
value numeric, required field, used to compare values gathered from metric source. Only needed when using a thresholdMetric.

Either target or targetPercent must be used.

target numeric [0.0, 1.0), optional, but either this or targetPercent must be used. Budget target for a given objective of the SLO. A target: 0.9995 is equivalent to targetPercent: 99.95.
targetPercent: numeric [0.0, 100), optional, but either this or target must be used. Budget target for a given objective of the SLO. A targetPercent: 99.95 is equivalent to target: 0.9995.
timeSliceTarget numeric [0.0, 1.0], required only when budgeting method is set to TimeSlices
timeSliceWindow (numeric | duration-shorthand), required only when budgeting method is set to TimeSlices or RatioTimeslices. Denotes the size of a time slice for which data will be evaluated e.g. 5, 1m, 10m, 2h, 1d. Also ascertains the frequency at which to run the queries. Default interpretation of unit if specified as a number in minutes.
indicator optional, represents the Service Level Indicator (SLI), described in SLI section. One of indicator or indicatorRef must be given in objective when creating composite SLO.
indicatorRef optional, this is the name of Service Level Indicator (SLI). One of indicator or indicatorRef must be given when creating composite SLO.

Notes (Composite SLO)

Composite SLO goal of composite SLO is to enable user an end-to-end journey, it is done by defining many independent objectives. Each objective can have different queries, data sources and targets. The basic implementation assumes that the Composite Error Budget burns if the Error Budget for any of the SLO objectives within the Composite SLO is burning. The logic of those calculations is the same for Composite SLOs as for regular (standard) objectives and SLOs.

Weight allows the user to change the impact of a given SLO on the whole composite SLO. Weight is just multiplier, it means that if weight is 0.5, SLO will have half impact as default, on the other hand if weight is 100, this SLO will be 100 times more impactful. By default, weight has value 1 and doesn't need to be specified.

Calculations should be as simple as possible to make composite SLO intuitive and easy to implement. It is hard to compare different error budget calculating methods therefore all composite objectives need to be calculated with one type of error budget calculating method. Here is brief description how given budgeting method should impact composite SLO and how wight scale its impact:

Occurrences - if SLO burns its budget composite is burning its budget at the same rate. Each violation that consumed SLO's budget will impact Composite at the same rate. Weight multiplies the rate of burning of SLO (referenced as burn rate) that burns composite.
Timeslices - this is binary depending on whether it was a good or bad minute. If it was a bad minute for any individual objective, it's considered a bad minute for the Composite SLO.
Ratiotimeslices - it is the sum of missing up to 100 percent. If two SLOs have average of Ratiotimeslices on 95%, composite will have average of Ratiotimeslices on 90%. Weight multiplies missing part of given slo.

SLI

A service level indicator (SLI) represents how to read metrics from data sources.

Check work in progress for v2.

apiVersion: openslo/v1
kind: SLI
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters
  thresholdMetric: # either thresholdMetric or ratioMetric must be provided
    metricSource:
      metricSourceRef: string # optional, this field can be used to refer to DataSource object
      type: string # optional, this field describes predefined metric source type e.g. Prometheus, Datadog, etc.
      spec:
        # arbitrary chosen fields for every data source type to make it comfortable to use
        # anything that is valid YAML can be put here.
  ratioMetric: # either thresholdMetric or ratioMetric must be provided
    counter: true | false # true if the metric is a monotonically increasing counter,
                          # or false, if it is a single number that can arbitrarily go up or down
                          # ignored when using "raw"
    good: # the numerator, either "good" or "bad" must be provided if "total" is used
      metricSource:
        metricSourceRef: string # optional
        type: string # optional
        spec:
          # arbitrary chosen fields for every data source type to make it comfortable to use.
    bad: # the numerator, either "good" or "bad" must be provided if "total" is used
      metricSource:
        metricSourceRef: string # optional
        type: string # optional
        spec:
          # arbitrary chosen fields for every data source type to make it comfortable to use.
    total: # the denominator used with either "good" or "bad", either this or "raw" must be used
      metricSource:
        metricSourceRef: string # optional
        type: string # optional
        spec:
          # arbitrary chosen fields for every data source type to make it comfortable to use.

    rawType: success | failure # required with "raw", indicates how the stored ratio was calculated:
                               #  success – good/total
                               #  failure – bad/total
    raw: # the precomputed ratio stored as a metric, can't be used together with good/bad/total
      metricSource:
        metricSourceRef: string # optional
        type: string # optional
        spec:
          # arbitrary chosen fields for every data source type to make it comfortable to use.

Notes (SLI)

description string optional field, contains at most 1050 characters

Either ratioMetric or thresholdMetric must be used.

thresholdMetric Metric, represents the query used for gathering data from metric sources. Raw data is used to compare objectives (threshold) values.
ratioMetric Metric {good, total}, {bad, total} or raw.
- counter enum(true | false), specifies whether the metric is a monotonically increasing counter. Has no effect when using a raw query.
- good represents the query used for gathering data from metric sources used as the numerator. Received data is used to compare objectives (threshold) values to find good values. If bad is defined then good must not be set.
- bad represents the query used for gathering data from metric sources used as the numerator. Received data is used to compare objectives (threshold) values to find bad values. If good is defined then bad must not be set.
- total represents the query used for gathering data from metric sources that is used as the denominator. Received data is used to compare objectives (threshold) values to find total number of metrics.
- rawType enum(success | failure), required when using raw, specifies whether the ratios represented by the "raw" ratio metric are of successes or failures. Not to be used with good and bad as picking one of those determines the type of ratio.
- raw represents the query used for gathering already precomputed ratios. The type of ratio (success or failure) is specified using rawType.

An example of an SLO where SLI is inlined:

apiVersion: openslo/v1
kind: SLO
metadata:
  name: foo-slo
  displayName: Foo SLO
spec:
  service: foo
  indicator:
    metadata:
      name: foo-error
      displayName: Foo Error
    spec:
      ratioMetric:
        counter: true
        good:
          metricSource:
            metricSourceRef: datadog-datasource
            type: Datadog
            spec:
              query: sum:trace.http.request.hits.by_http_status{http.status_code:200}.as_count()
        total:
          metricSource:
            metricSourceRef: datadog-datasource
            type: Datadog
            spec:
              query: sum:trace.http.request.hits.by_http_status{*}.as_count()
  objectives:
    - displayName: Foo Total Errors
      target: 0.98

Ratio Metric

If a service level indicator has ratioMetric defined, the following maths can be used to calculate the value of the SLI. Below we describe the advised formulas for calculating the indicator value.

Good-Total queries If the good and total queries are given then following formula can be used to calculate the value:

indicatorValue = good / total

If we have 99 good requests out of a total of 100 requests, the calculated value for the indicator would be: 99 / 100 = 0.99. This represents 99% on a 0-100 scale using the formula 0.99 * 100 = 99.

Bad-Total queries If the bad and total queries are given then following formula can be used to calculate the value:

indicatorValue = ( total - bad ) / total

If we have 1 error out of a total of 100 requests, the calculated value for the indicator would be: (100 - 1) = 0.99. This represents 99% on a 0-100 scale using the formula 0.99 * 100 = 99.

💡 Note: As you can see for both query combinations we end up with the same calculated value for the service level indicator.

The required spec key will be used to pass extraneous data to the data source. The goal of this approach is to provide maximum flexibility when querying data from a particular source. In the following examples we can see that this works fine for both simple and more complex cases.

An example of ratioMetric:

ratioMetric:
  counter: true
  good:
    metricSource:
      type: Prometheus
      metricSourceRef: prometheus-datasource
      spec:
        query: sum(localhost_server_requests{code=~"2xx|3xx",host="*",instance="127.0.0.1:9090"})
  total:
    metricSource:
      type: Prometheus
      metricSourceRef: prometheus-datasource
      spec:
        query: localhost_server_requests{code="total",host="*",instance="127.0.0.1:9090"}

An example of thresholdMetric:

thresholdMetric:
  metricSource:
    metricSourceRef: redshift-datasource
    spec:
      region: eu-central-1
      clusterId: metrics-cluster
      databaseName: metrics-db
      query: SELECT value, timestamp FROM metrics WHERE timestamp BETWEEN :date_from AND :date_to

Field type can be omitted because the type will be inferred from the DataSource when metricSourceRef is specified.

An example thresholdMetric that does not reference a defined DataSource:

thresholdMetric:
  metricSource:
    type: Redshift
    spec:
      region: eu-central-1
      clusterId: metrics-cluster
      databaseName: metrics-db
      query: SELECT value, timestamp FROM metrics WHERE timestamp BETWEEN :date_from AND :date_to
      accessKeyID: accessKey
      secretAccessKey: secretAccessKey

Field type can't be omitted because the reference to an existing DataSource is not specified.

AlertPolicy

An Alert Policy allows you to define the alert conditions for an SLO.

apiVersion: openslo/v1
kind: AlertPolicy
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters
  alertWhenNoData: boolean
  alertWhenResolved: boolean
  alertWhenBreaching: boolean
  conditions: # list of alert conditions
    - conditionRef: # required when alert condition is not inlined
  notificationTargets:
  - targetRef: # required when alert notification target is not inlined

Notes (AlertPolicy)

description string, optional description about the alert policy, contains at most 1050 characters
alertWhenBreaching boolean, true, false, whether the alert should be triggered when the condition is breaching
alertWhenResolved boolean, true, false, whether the alert should be triggered when the condition is resolved
alertWhenNoData boolean, true, false, whether the alert should be triggered when the condition indicates that no data is available
conditions[ ] Alert Condition, an array, (max of one condition), required field. A condition can be defined inline or can refer to external Alert condition defined in this case the following are required:
- conditionRef string: this is the name or path the Alert condition
notificationTargets[ ] Alert Notification Target, required field. A condition can be defined inline or can refer to an AlertNotificationTarget object, in which case the following are required:
- targetRef string: this is the name or path to the AlertNotificationTarget

💡 Note: The conditions field is of the type array of AlertCondition but only allows one single condition to be defined. The use of an array is for future-proofing purposes.

An example of an Alert policy which refers to another Alert Condition:

apiVersion: openslo/v1
kind: AlertPolicy
metadata:
  name: AlertPolicy
  displayName: Alert Policy
spec:
  description: Alert policy for cpu usage breaches, notifies on-call devops via email
  alertWhenBreaching: true
  alertWhenResolved: false
  conditions:
    - conditionRef: cpu-usage-breach
  notificationTargets:
    - targetRef: OnCallDevopsMailNotification

An example of an Alert Policy were the Alert Condition is inlined:

apiVersion: openslo/v1
kind: AlertPolicy
metadata:
  name: AlertPolicy
  displayName: Alert Policy
spec:
  description: Alert policy for cpu usage breaches, notifies on-call devops via email
  alertWhenBreaching: true
  alertWhenResolved: false
  conditions:
    - kind: AlertCondition
      metadata:
        name: cpu-usage-breach
        displayName: CPU Usage breaching
      spec:
        description: SLO burn rate for cpu-usage-breach exceeds 2
        severity: page
        condition:
          kind: burnrate
          op: lte
          threshold: 2
          lookbackWindow: 1h
          alertAfter: 5m
  notificationTargets:
    - targetRef: OnCallDevopsMailNotification

AlertCondition

An Alert Condition allows you to define under which conditions an alert for an SLO needs to be triggered.

apiVersion: openslo/v1
kind: AlertCondition
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters
  severity: string # required
  condition: # required
    kind: string
    op: enum
    threshold: number
    lookbackWindow: duration-shorthand
    alertAfter: duration-shorthand

Notes (AlertCondition)

description string, optional description about the alert condition, contains at most 1050 characters
severity string, required field describing the severity level of the alert (ex. "sev1", "page", etc.)
condition, required field. Defines the conditions of the alert
- kind enum(burnrate) the kind of alerting condition thats checked, defaults to burnrate

If the kind is burnrate the following fields are required:

op enum(lte | gte | lt | gt), required field, the conditional operator used to compare against the threshold
threshold number, required field, the threshold that you want alert on
lookbackWindow duration-shorthand, required field, the time-frame for which to calculate the threshold e.g. 5m
alertAfter duration-shorthand: required field, the duration the condition needs to be valid for before alerting, defaults to 0m

If the alert condition is breaching, and the alert policy has alertWhenBreaching set to true the alert will be triggered

If the alert condition is resolved, and the alert policy has alertWhenResolved set to true the alert will be triggered

If the service level objective associated with the alert condition returns no value for the burn rate, for example, due to the service level indicators missing data (e.g. no time series being returned) and the alertWhenNoData is set to true the alert will be triggered.

💡 Note: The alertWhenBreaching and alertWhenResolved, alertWhenNoData can be combined, if you want an alert to trigger whenever at least one of these conditions is true.

An example of an alert condition:

apiVersion: openslo/v1
kind: AlertCondition
metadata:
  name: cpu-usage-breach
  displayName: CPU usage breach
spec:
  description: If the CPU usage is too high for given period then it should alert
  severity: page
  condition:
    kind: burnrate
    op: lte
    threshold: 2
    lookbackWindow: 1h
    alertAfter: 5m

AlertNotificationTarget

An Alert Notification Target defines the possible targets where alert notifications should be delivered to. For example, this can be a web-hook, Slack or any other custom target.

apiVersion: openslo/v1
kind: AlertNotificationTarget
metadata:
  name: string
  displayName: string # optional, human readable name
spec:
  target: # required
  description: # optional

An example Alert Notification Target:

apiVersion: openslo/v1
kind: AlertNotificationTarget
metadata:
  name: OnCallDevopsMailNotification
spec:
  description: Notifies by a mail message to the on-call devops mailing group
  target: email

Alternatively, a similar notification target can be defined for Slack like this:

apiVersion: openslo/v1
kind: AlertNotificationTarget
metadata:
  name: OnCallDevopsSlackNotification
spec:
  description: "Sends P1 alert notifications to the slack channel"
  target: slack

Notes (AlertNotificationTarget)

target string, describes the target of the notification, e.g. Slack, email, web-hook, Opsgenie etc
description string, optional description about the notification target, contains at most 1050 characters

💡 Note: The way the alert notification targets are is an implementation detail of the system that consumes the OpenSLO specification.

For example, if the OpenSLO is consumed by a solution that generates Prometheus recording rules, and alerts, you can imagine that the name of the alert notification gets passed as a label to Alertmanager which then can be routed accordingly based on this label.

Service

A Service is a high-level grouping of SLO. It may be defined before creating SLO to be able to refer to it in SLO's spec.service. Multiple SLOs can refer to the same Service.

apiVersion: openslo/v1
kind: Service
metadata:
  name: string
  displayName: string # optional
spec:
  description: string # optional up to 1050 characters

slogen's People

Contributors

Stargazers

Watchers

Forkers

kumoroku arunsechergy tzmvp arpitjain305 patkaehuaea

slogen's Issues

binary for linux x86_64 does not work

From the release page for v1.0.1, I downloaded slogen_1.0.0_Linux_x86_64.tar.gz (strange that the version number is different!). And I attempted to run it:

$ ./slogen --help
./slogen: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.35' not found (required by ./slogen)
./slogen: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by ./slogen)

I'm not sure what the problem is here, but the previous release works fine on my system (HP Elitebook G5 running Ubuntu 18.04.6

Here's the release from https://github.com/OpenSLO/slogen/releases/tag/v1.0-beta :

$ slogen --help

Generates terraform files from openslo compatible yaml configs. 
Generated terraform files can be used to configure SLO monitors, scheduled views & dashboards in sumo.
One or more config or directory containing configs can be given as arg. Doesn't supports regex/wildcards as input.

Usage:
  slogen [paths to yaml config]... [flags]
  slogen [command]

Examples:
slogen service/search.yaml 
slogen ~/team-a/slo/ ~/team-b/slo ~/core/slo/login.yaml
slogen ~/team-a/slo/ -o team-a/tf


Available Commands:
  completion  Generate the autocompletion script for the specified shell
  destroy     destroy the content generated from the slogen command, equivalent to 'terraform destroy'
  docs        generate markdown documents of this tool in the specified path
  help        Help about any command
  list        utility command to get additional info about your sumo resources e.g. 
  new         create a sample config from given profile
  validate    A brief description of your command

Flags:
  -o, --out string                output directory where to create the terraform files (default "tf")
  -d, --dashboardFolder string    root dashboard folder where to organise the dashboards per service (default "slogen-tf-dashboards")
  -m, --monitorFolder string      root monitor folder where to organise the monitors per service (default "slogen-tf-monitors")
  -i, --ignoreErrors              whether to continue validation even after encountering errors
  -p, --plan                      show plan output after generating the terraform config
  -a, --apply                     apply the generated terraform config as well
  -c, --clean                     clean the old tf files for which openslo config were not found in the path args
      --asModule                  whether to generate the terraform config as a module
      --useViewHash               whether to use descriptive or hashed name for the scheduled views, hashed names ensure data for old view is not used when the query for it changes
      --onlyNative                whether to generate only the native slo resources
      --sloFolder string          root slo folder where to organise native slos by service (default "slogen")
      --sloMonitorFolder string   root monitor folder where to organise native slo monitors by service (default "slogen")
  -h, --help                      help for slogen

Use "slogen [command] --help" for more information about a command.

And here's my machine info:

$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

$ uname -a
Linux SunilChop2019-Lunix 4.15.0-194-generic #205-Ubuntu SMP Fri Sep 16 19:49:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Hierarchal SLO rollup/aggregation

advance version of #6 so as to provide insights for various service/product groups, business unit etc.

OpenSLO discussion on this : #6

Burnrate alerts aren't working correctly

I have an SLO that is 30m (short window) and 6h (long window). I've put the threshold the same on both.

When the SLO was triggered, it was quite quick (within 5m) but the alert took 6 hours to resolve after it went back to normal.

I would have expected it to be resolved quickly according to https://sre.google/workbook/alerting-on-slos/

Looking into this a bit deeper, I think that the threshold values on the monitor take 6 hours to evaluate, and it might not be possible to do "Multiwindow, Multi-Burn-Rate Alerts" using sumologic's monitors.

pass proxy config to terraform session

Explore gaps in metric query language to do SLO tracking on them

add forecasted values to end of month in the panel for SLO/budget remaining

day labels for honeycomb displaying weekly trend in SLO dashboard

support for dynamic specification of unplanned maintenance windows

to be done via a separate yaml config or user uploaded lookup table

ability to create monitors on subSLI's generated via "fields" specified

e.g. if customerID is specified as field to aggregate SLI data on then also allow to create monitor on them such that alerts can be triggered if errorBudget for even a single customer goes below 0%

The "Overview Dashboard" terraform code isn't static

The "tf/dashboards/overview-.tf" file is constantly changing, each time I run "slogen . --apply --clean".

I believe this is because the map is unordered and the rows are being pulled out arbitrarily: https://github.com/SumoLogic-Labs/slogen/blob/main/libs/overview.go#L193

This means that my terraform code changes on each run, rather than being hermetic.

This is a low priority, but it would be nice to have these rows sorted.

SLO dashboards should also be able to do 30 days rolling

All of the dashboards created by slogen are "This Month", and since its the 1st of March, most of the data is only showing for 1 day.

I'd like the dashboards to be the last 30 days rolling instead so that I can better assess if the SLO is being met.

enable dash variables to be regex/wildcards

currently its in form where ("{{task}}"="*" or task="{{task}}"), needs to be | where ("{{task}}"="*" or task matches "{{task}}") instead so that wildcards also used while filtering.

SLO Burn rate monitoring is incorrect

Hi team, I've been using your tool extensively and I am loving it!

I have come across an issue with the monitoring of an SLO.

My current alerting configuration is as follows:

apiVersion: openslo/v1alpha
kind: SLO
metadata:
  displayName: xxx
  name: xxx
spec:
  service: xxx
  budgetingMethod: Occurrences
  objectives:
    - ratioMetrics:
        total:
          source: sumologic
          queryType: Logs
          query: |
            xxx
        good: 
          source: sumologic
          queryType: Logs
          query: 'xxx'
        incremental: true
      displayName: xxx
      target: 0.99
alerts:
  burnRate:
    - shortWindow: '10m'
      shortLimit: 14
      longWindow: '1h'
      longLimit: 14
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning
    - shortWindow: '30m'
      shortLimit: 6
      longWindow: '6h'
      longLimit: 6
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning
    - shortWindow: '6h'
      shortLimit: 1
      longWindow: '24h'
      longLimit: 1
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning

When I evaluate the SLO over a 24h period, it is currently at 98.41897 (which is below the 99 to meet the SLO).

I would have expected that I would receive at least 1 email stating that this SLO is not being met, however all the monitors generated aren't being triggered.

I'm wondering if the calculation of one of these items may be incorrect?

Current version: There is no slogen command to output the version, but I'm pointing to the latest of the main branch.

dashboard variable for fields specified for each SLO dashboard

currently only added to Global overview dashboards for top 4 common labels/fields

Missing Data Alerts

The Monitors created don't have useful missing data alerts.

I'd like to be alerted if I screw up a query and the total count goes to 0 for x minutes.

Panel : Worst hours with respect to error budget depletion

Top N slots where error budget was most affected

OOB samples/templates for standard services

Based on terraform-sumologic-sumo-logic-monitor//monitor_packages

Changing SLOs causing issues with scheduled views

I've noticed that when I update an SLO, often the dashboards and graphs do not update with the new data for that SLO.

As an example, I'm currently writing latency SLOs as follows:

apiVersion: openslo/v1alpha
kind: SLO
metadata:
  displayName: api Latency
  name: api-ltc
spec:
  service: account-opening
  description: The amount of POST requests to /api that are faster than 2s
  budgetingMethod: Occurrences
  objectives:
    - ratioMetrics:
        total:
          source: sumologic
          queryType: Logs
          query: |
            _sourceCategory=/xxx
            | kv "path", "elapsed_time"
            | where path matches /api/
        good: 
          source: sumologic
          queryType: Logs
          query: 'elapsed_time < 2000'
        incremental: true
      displayName: api calls that are faster than 2s
      target: 0.90

Now, if I update the SLO to a new value of 1s, the graphs and data that populates the new dashboards is the same data that was from the original elapsed_time < 2000.

I have a feeling this is because the data kept in a scheduled view is not deleted, but just disabled.
https://help.sumologic.com/Manage/Scheduled-Views/Pause-or-Disable-Scheduled-Views#disable-a-scheduled-view

Once disabled, no additional data can be indexed in a scheduled view. A disabled scheduled view is not technically deleted, but it can't be re-enabled. If you disable a view and later create a new view with the same name, you won't see duplicate results; instead all the data from both scheduled views are treated as one.

If this is true, I wonder if it's worth hashing the query and appending it to the scheduled view name?

Often when I switch an SLO, I expect it to stop firing and re-evaluate the data.

SLO breakdown panel by fields specified in the config

add documentation on getting connection-id from sumo ui

aggregated overview dashboards at service level

apps are made of services and apps might be part of a portfolio. For example, a bank may have online banking as a portfolio consisting of the following hierarchy:

mobile banking app -> consisting of account service and bill pay service
credit card app -> consisting of account service and payment service
In this example, SLO/SLIs defined at the service level would roll up to the corresponding app

Pre-requisite:

One or more SLOs have been defined for various services
Success Scenario:

Operations user (e.g. developer) picks which services should be grouped (ideally by consulting the service map)
System validates the following
the evaluation type is identical for chosen services (e.g periodic or aggregate but not mixing the two)
compliance period and type match
If the grouping are valid, SLI/SLO/error budget budgets are visualised

support for timeslices based budgeting

Timeslices method uses a ratio of good time slices vs. total time slices in a budgeting period.

add subcommand to get list of monitor connection with ids

workaround the issue that currently no way in UI to get connection id for specifying it in the alerting config

allow timezone param to set in yaml for monitor notifications

Support for other Sumologic connections

The current sumologic email connection is helpful, but l need to use the Webhook connection as well with the payload_override described in the terraform module.

This will allow me to add the custom template variables seen here: https://github.com/SumoLogic-Labs/slogen/blob/main/libs/templates/terra/monitor.tf.gotf#L57.

samples config for timeslices based budgeting

detailed documentation on configuring multi-window multi-burn alerts

samples for advance use cases of SLO

min,max etc
bracketed as used by search team

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble

openslo / slogen Goto Github PK

slogen's Introduction

Table of Contents

Introduction

Specification

Goals

General Schema

Notes (General Schema)

Custom Data Types

duration-shorthand

Object Types

DataSource

Notes (DataSource)

SLO

Notes (SLO)

Objectives

Notes (Objectives)

Notes (Composite SLO)

SLI

Notes (SLI)

Ratio Metric

AlertPolicy

Notes (AlertPolicy)

AlertCondition

Notes (AlertCondition)

AlertNotificationTarget

Notes (AlertNotificationTarget)

Service

slogen's People

Contributors

Stargazers

Watchers

Forkers

slogen's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs