GithubHelp home page GithubHelp logo

sensu-go-fatigue-check-filter's Introduction

Sensu Bonsai Asset Karma Test Otto Test release

Sensu Go Fatigue Check Filter

Overview

The Sensu Go Fatigue Check Filter is a Sensu Event Filter for managing alert fatigue.

A typical use of filters is to reduce alert fatigue. One of the most typical examples of this is create the following filter that only passes through events on their first occurrence and every hour after that.

---
type: EventFilter
api_version: core/v2
metadata:
  name: hourly
  namespace: default
spec:
  action: allow
  expressions:
  - event.check.occurrences == 1 || event.check.occurrences % (3600 / event.check.interval)
    == 0
  runtime_assets: []

However, the use of the filter above creates some limitations. Suppose you have one check in particular that you want to change to only alert after three (3) occurrences. Typically that might mean creating another handler and filter pair to assign to that check. If you have to do this often enough and you start to have an unwieldy mass of handlers and filters.

That's where this Fatigue Check Filter comes in. Using annotations, it makes the number of occurrences and the interval tunable on a per-check or per-entity basis. It also allows you to control whether or not resolution events are passed through.

Usage examples

N/A

Configuration

Sensu Go

Asset registration

Assets are the best way to make use of this plugin. If you're not using an asset, please consider doing so! If you're using sensuctl 5.13 or later, you can use the following command to add the asset:

sensuctl asset add sensu/sensu-go-fatigue-check-filter --rename fatigue-check-filter

Note that the --rename is not necessary, but references to the runtime asset in the filter definition as in the example below would need to be updated to match.

If you're using an earlier version of sensuctl, you can download the asset definition from this project's Bonsai asset index page.

You can create your own asset by creating a tar file containing lib/fatigue_check.js and creating your asset definition accordingly.

Asset definition

If not using sensuctl asset add:

---
type: Asset
api_version: core/v2
metadata:
  name: fatigue-check-filter
  namespace: default
spec:
  sha512: 2e67975df7d993492cd5344edcb9eaa23b38c1eef7000576b396804fc2b33362b02a1ca2f7311651c175c257b37d8bcbbce1e18f6dca3ca04520e27fda552856
  url: http://example.com/sensu/assets/fatigue-check.tar.gz

Filter definition

---
type: EventFilter
api_version: core/v2
metadata:
  name: fatigue_check
  namespace: default
spec:
  action: allow
  expressions:
  - fatigue_check(event)
  runtime_assets:
  - fatigue-check-filter

Handler definition

---
type: Handler
api_version: core/v2
metadata:
  namespace: default
  name: email
spec:
  type: pipe
  command: sensu-email-handler -f [email protected] -t [email protected] -s smtp.example.com
    -u emailuser -p sup3rs3cr3t
  timeout: 10
  filters:
  - is_incident
  - not_silenced
  - fatigue_check

Check definition

---
type: CheckConfig
api_version: core/v2
metadata:
  name: linux-cpu-check
  namespace: default
  annotations:
    fatigue_check/occurrences: '3'
    fatigue_check/interval: '900'
    fatigue_check/allow_resolution: 'false'
spec:
  command: check-cpu -w 90 c 95
  handlers:
  - email
  interval: 60
  publish: true
  runtime_assets: 
  subscriptions:
  - linux

Entity definition

Via the agent.yml:

---
##
# agent configuration
##

#name: ""

#namespace: "default"

#subscriptions: 
#  - "localhost"

annotations:
  fatigue_check/occurrences: "3"
  fatigue_check/interval: "900"
  fatigue_check/keepalive_occurrences: "1"
  fatigue_check/keepalive_interval: "300"
  fatigue_check/allow_resolution: "false"

[...]

Keepalives

Keepalives do not have check resources with annotations that can be used to tune this filter. Using standard entity annotations would override the settings for all other checks. To address this specific case, two additional tunables exist for customizing this filter for keepalive events. These can be set as arguments to the fatigue_check() function in the filter definition or as entity annotations to override the defaults on a per entity basis.

Cron scheduled checks

Since cron scheduled checks do not provide an explicit interval, this filter has to compute the apparent interval using the history available in the event. In order for this to work, there has to be at least two entries in the event's check history. The only time there would be less than two history entries is during the first two check executions. By default this should not be an issue, but in the unlikely case the interval is set this low, a default interval of 60 seconds is used.

Annotations

The Fatigue Check Filter makes use of four annotations within the check and/or entity metadata for normal checks with an additional two keepalive annotations availalbe in the entity metadata. The entity annotations taking precedence over check annotations. All annotations take precedence of the fatigue_check() function arguments and defaults.

Annotation Default Usage
fatigue_check/occurrences 1 On which occurrence to allow the initial event to pass through for normal checks
fatigue_check/interval 1800 In seconds, at what interval to allow subsequent events to pass through, ideally a multiple of the check interval for normal checks
fatigue_check/allow_resolution true Determines whether or not a resolution event is passed through
fatigue_check/suppress_flapping true Determines whether or not to suppress events for checks that are marked as flapping
fatigue_check/keepalive_occurrences 1 On which occurrence to allow the initial event to pass through for keepalives (entity only)
fatigue_check/keepalive_interval 1800 In seconds, at what interval to allow subsequent events to pass through, ideally a multiple of the check interval for keepalives (entity only)

Arguments

The fatigue_check() function can take up to five arguments.

fatigue_check(event, occurrences, interval, keepalive_occurrences, keepalive_interval)

The first one is the event and is required. The remaining four are optional and allow you to override the built-in defaults for occurrences, interval, keepalive_occurrences, and keepalive_interval, respectively. For example, if you'd like a version of the filter that, by default on non-keepalive checks, matches on the second occurrence instead of the first you could create a filter similar to below:

---
type: EventFilter
api_version: core/v2
metadata:
  name: fatigue_check_two_occurrences
  namespace: default
spec:
  action: allow
  expressions:
  - fatigue_check(event, 2)
  runtime_assets:
  - fatigue-check-filter

If you'd like one that overrides the default 30 minute interval for non-keepalive checks with a 10 minute one you could create one similar to below (note that in order to specify the third argument, you have to provide the second):

---
type: EventFilter
api_version: core/v2
metadata:
  name: fatigue_check_10m_interval
  namespace: default
spec:
  action: allow
  expressions:
  - fatigue_check(event, 1, 600)
  runtime_assets:
  - fatigue-check-filter

If you'd like one that overrides the default occurrences for keeaplives and alerts on the second occurrence rather than the first you could create one similar to below (note that in order to specify the fourth argument, you have to provide the second and third):

---
type: EventFilter
api_version: core/v2
metadata:
  name: fatigue_check_two_occurrences
  namespace: default
spec:
  action: allow
  expressions:
  - fatigue_check(event, 1, 1800, 2)
  runtime_assets:
  - fatigue-check-filter

If you'd like one that overrides the default 30 minute interval for keepalives with a 10 minute one you could create one similar to below (note that in order to specify the fifth argument, you have to provide the second, third, and fourth as well, even if you want the defaults):

---
type: EventFilter
api_version: core/v2
metadata:
  name: fatigue_check_10m_interval
  namespace: default
spec:
  action: allow
  expressions:
  - fatigue_check(event, 1, 1800, 1, 600)
  runtime_assets:
  - fatigue-check-filter

Non-repeating alerts

If you need to have alerts which will not repeat, meaning the alert is only ever sent on the first occurrence and none after (aside from the resolution, if allow_resolution is true, which is the default), then you will need to set the interval (or keepalive_interval) to zero (0) via an annotation.

Installation from source

Sensu Go

See the instructions above for asset registration.

Additional notes

  • This filter makes use of the occurrences_watermark attribute that was buggy up until Sensu Go 5.9. Your mileage may vary on prior versions.

  • If the interval is not a multiple of the check's interval, then the actual interval is computed by rounding up the result of dividing the interval by the check's interval. For example, an interval of 180s with a check interval of 25s would pass the event through on every 8 occurrences (200s).

Contributing

Please submit an issue if you have problems or suggestions.

sensu-go-fatigue-check-filter's People

Contributors

asachs01 avatar flowerysong avatar hillaryfraley avatar jspaleta avatar nixwiz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

sensu-go-fatigue-check-filter's Issues

Resolution events not suppressed.

I've applied the following type of configuration within every check in my environment.

          "annotations": {
            "fatigue_check/occurrences": "3",
            "fatigue_check/interval": "900",
            "fatigue_check/allow_resolution": "false"
          }

Agents have their two fatigue_check options set.

    "annotations": {
      "fatigue_check/keepalive_occurrences": "1",
      "fatigue_check/keepalive_interval": "0",
    }

Occurrence configuration is functioning as expected; checks that have issues or alert for fewer than their configured occurrences do not generate warning or critical notifications.

However, resolution messages are still passing through the pipeline by the hundreds each second.

Description
NTP OK: Offset 0.001697361469 secs|offset=0.001697s;30.000000;60.000000;
Status
Resolved
Entity
host.place-hold-domain.com

Filter blocks every message random logs

{"asset":"fatigue-check-filter","component":"asset-manager","level":"info","msg":"asset does not exist","time":"2019-09-19T20:04:43+03:00"}
{"component":"pipelined","error":"ReferenceError: 'fatigue_check' is not defined","level":"error","msg":"error executing JS","time":"2019-09-19T20:04:43+03:00"}
{"assets":["fatigue-check-filter"],"check_name":"check_disk_usage","check_namespace":"default","component":"pipelined","entity_name":"sensu","entity_namespace":"default","filter":"fatigue_check","level":"debug","msg":"denying event that does not match filter","time":"2019-09-19T20:04:43+03:00"}
{"check_name":"check_disk_usage","check_namespace":"default","component":"pipelined","entity_name":"sensu","entity_namespace":"default","filter":"fatigue_check","handler":"slack","level":"debug","msg":"denying event with custom filter","time":"2019-09-19T20:04:43+03:00"}
{"component":"pipelined","error":"ReferenceError: 'fatigue_check' is not defined","level":"error","msg":"error executing JS","time":"2019-09-19T20:04:43+03:00"}
{"assets":["fatigue-check-filter"],"check_name":"check_disk_usage","check_namespace":"default","component":"pipelined","entity_name":"sensu","entity_namespace":"default","filter":"fatigue_check","level":"debug","msg":"denying event that does not match filter","time":"2019-09-19T20:04:43+03:00"}
{"check_name":"check_disk_usage","check_namespace":"default","component":"pipelined","entity_name":"sensu","entity_namespace":"default","handler":"alerta","level":"info","msg":"event filtered","time":"2019-09-19T20:04:43+03:00"}

everything was set up according to examples

Feature request: Option to change default values

Would anyone be interested in option to change default values by parameters?
I mean I want to change default value of occurrences to fatigue_check/occurrences: 2 instead of having to change every single check.

I wouldn't call myself experienced in JS, but if someone else wants this feature I'll try to code it.

Alert on occurrences only when within a time window

I have a use case of alerting on occurrences, but only if they happen within a time window.

For e.g. alert if there are 3 consecutive events in 1 hour.

This need arises especially when generating alerts via the event API, and not via regular checks. So, let's say you have a cronjob that is known to fail intermittently, and we have set occurrences to 3. Without the time window, an alert will be generated every time occurrences cross 3, even if they happen over a week.

Allow alerts which do not repeat

The filter doesn't seem to allow for alerts which do not repeat.

It would be good to be able to set interval to 0 to get an initial alert and then no repetition.

Currently, that'll get set to 1800 by:

var interval = interval || 1800; // and every 30 minutes thereafter

Thanks,

Ian

Feature request: Add support for filtering based on entity annotations

This filter currently only supports filtering based on check level annotations. While that is great for normal checks, keepalives do not have annotations inside event.check.annotations. As keepalives are always associated with some entity, we can leverage the entity level annotations to filter keepalives based on the same rules that checks have.

Issues with default parameters

https://github.com/nixwiz/sensu-go-fatigue-check-filter/blob/d46f851a32367e50ac4b21d4aac05697737288cb/lib/fatigue_check.js#L1

Hi @nixwiz again :)
I'm not sure why, but the code updated in #18 does not work.

May 10 10:39:24 sensu-go sensu-backend[10256]: {"component":"pipelined","error":"error evaluating /var/cache/sensu/sensu-backend/79142cb649b43e595520c5a1ccb8c365211c83fc1b7509cb3165ef055d51825fe8958b28417eab767cc7f4a3478fe0fe1c5361aff67753a890f8d82ad35d53e9/lib/fatigue_check.js: (anonymous): Line 1:41 Unexpected token = (and 3 more errors)","level":"error","msg":"error executing JS","time":"2020-05-10T10:39:24+02:00"}

Following works as expected

function fatigue_check(event,occurences,interval) {

    // my defaults
    var occurrences = occurences || 1; // only the first occurrence
    var interval = interval || 1800; // and every 30 minutes thereafter
...

I found out that assigning default values in function definition was introduced in ES6, but I don't even know how to check current sensu-go JS version. I'm currently using latest sensu-go version from debian repo

root@sensu-go:~# sensu-backend version
sensu-backend version 5.19.3, build 9f7a94a9e4c8fb27abd92721d0949fc265240352, built 2020-05-02T00:27:55Z

Can't make fatigue filter work with proxy entities

Hi!

I'm having some problems to make the filter work. I'm using it in the typical cpu_check and also a custom check that simply make a curl to the local installed agent and create a proxy entity. When I enable the filter, none of the events with the fatigue annotations pass even when I run the curl more times than the defined occurrences.

I have run locally the lib/fatigue_check.js and passed the event.json obtained using the debug handler and noticed that it tries to find the annotations inside check.annotations but the annotations in the event.json are inside check.metadata.annotations.

Also the sensu documentation says that the annotations are at check.annotations, so I should be missing something.
Aren't the event received by the filter and the event that is received by the handler the same (without any mutator)?

Any ideas why the filter never let pass the event even when the conditions are meet?

fatigue_asset.yml:

type: Asset
api_version: core/v2
metadata:
  name: sensu-go-fatigue-check-filter
  labels:
  annotations:
    io.sensu.bonsai.url: https://bonsai.sensu.io/assets/nixwiz/sensu-go-fatigue-check-filter
    io.sensu.bonsai.api_url: https://bonsai.sensu.io/api/v1/assets/nixwiz/sensu-go-fatigue-check-filter
    io.sensu.bonsai.tier: Community
    io.sensu.bonsai.version: 0.7.0
    io.sensu.bonsai.namespace: nixwiz
    io.sensu.bonsai.name: sensu-go-fatigue-check-filter
    io.sensu.bonsai.tags: eventfilter, filter
spec:
  builds:
  - url: https://assets.bonsai.sensu.io/fe1401d4e174d493dc2976e01140a70f94d6532c/sensu-go-fatigue-check-filter_0.7.0.tar.gz
    sha512: 1229e25b49e944ba0dc9534fa2150cf592e1dc31f2ad51c1dab60dadf1d989df82295f932364c5e2fc055015dc15c78ca0c643e8e1f1c3d1afb4e8a9aa51d38b
    filters: []

fatigue_filter.yml:

type: EventFilter
api_version: core/v2
metadata:
  name: fatigue_check
  namespace: default
spec:
  action: allow
  expressions:
  - fatigue_check(event)
  runtime_assets:
  - sensu-go-fatigue-check-filter

slack_handler.yml

type: Handler
api_version: core/v2
metadata:
  created_by: admin
  name: slack
  namespace: default
spec:
  command: sensu-slack-handler --channel '#xxxxxx'
  env_vars:
  - SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxxxxxxxxxxxxxxxxxxxxx
  filters:
  - is_incident
  - not_silenced
  - state_change_only
  - fatigue_check
  handlers: null
  runtime_assets:
  - sensu-slack-handler
  secrets: null
  timeout: 0
  type: pipe

cpu_check.yml

type: CheckConfig
api_version: core/v2
metadata:
  created_by: admin
  name: check_cpu
  namespace: default
  annotations:
    fatigue_check/occurrences: '3'
    fatigue_check/interval: '600'
    fatigue_check/allow_resolution: 'false'
spec:
  command: /etc/sensu/plugins/sensu-plugins-cpu-checks/bin/check-cpu.sh -w 90 -c 95
  handlers:
  - slack
  - email
  - asterisk
  - debug
  interval: 30
  publish: true
  subscriptions:
  - cpu

curl:

curl -X POST -H 'Content-Type: application/json' -d '{
    "check": {
        "metadata": {
            "name": "CHECKNAME",
            "labels": {
                "environment": "production"
            },
            "annotations": {
                "fatigue_check/occurrences": "2",
                "fatigue_check/interval": "3600",
                "fatigue_check/allow_resolution": "false"
            }
        },
        "handlers": [
            "slack",
            "email",
            "asterisk",
            "debug"
        ],
        "status": 2,
        "output": "critical test",
        "proxy_entity_name": "Test"
    }
}' http://127.0.0.1:3031/events

incorrect notification interval

Hello,

sensu-go-agent 5.16.1-8521
sensu-go-backend 5.16.1-8521
fatigue-check-filter //assets.bonsai.sensu.io/.../sensu-go-fatigue-check-filter_0.3.2.tar.gz

I have a check configured as so:

type: CheckConfig
api_version: core/v2
metadata:
  annotations:
    fatigue_check/allow_resolution: "true"
    fatigue_check/interval: "3600"
    fatigue_check/occurrences: "1"
  name: check-supervisor
  namespace: default
spec:
  check_hooks: null
  command: /opt/sensu-plugins-ruby/embedded/bin/check-supervisor.rb
  env_vars: null
  handlers:
  - slack
  high_flap_threshold: 0
  interval: 60
  low_flap_threshold: 0
  output_metric_format: ""
  output_metric_handlers: null
  proxy_entity_name: ""
  publish: true
  round_robin: true
  runtime_assets: null
  stdin: false
  subdue: null
  subscriptions:
  - supervisor
  timeout: 0
  ttl: 0

This is the slack handler:

type: Handler
api_version: core/v2
metadata:
  name: slack
  namespace: default
spec:
  command: sensu-slack-handler --channel '#monitoring'
  env_vars:
  - SLACK_WEBHOOK_URL=https://hooks.slack.com/services/xxxxxx
  filters:
  - is_incident
  - fatigue_check
  handlers: null
  runtime_assets:
  - sensu-slack-handler
  timeout: 0
  type: pipe

The filter:

type: EventFilter
api_version: core/v2
metadata:
  name: fatigue_check
  namespace: default
spec:
  action: allow
  expressions:
  - fatigue_check(event)
  runtime_assets:
  - fatigue-check-filter

The entity:

type: Entity
api_version: core/v2
metadata:
  labels:
    mqtt_auth_params: -a /etc/mosquitto/ca_certificates/cacert.pem -u xxx -p xxx
      -n
    mqtt_host: 192.168.12.71
  name: edi
  namespace: default
spec:
  deregister: false
  deregistration: {}
  entity_class: agent
  last_seen: 1578919497
  redact:
  - password
  - passwd
  - pass
  - api_key
  - api_token
  - access_key
  - secret_key
  - private_key
  - secret
  sensu_agent_version: 5.16.1
  subscriptions:
  - system
  - mqtt
  - supervisor
  - teltonika-mqtt
  - fail-test
  - entity:edi
  system:
    arch: amd64
    hostname: edi
    network:
      interfaces:
      - addresses:
        - 127.0.0.1/8
        - ::1/128
        name: lo
      - addresses:
        - fe80::21e:c9ff:fed9:62a3/64
        mac: 00:1e:c9:d9:62:a3
        name: eno1
      - addresses: null
        mac: 00:1e:c9:d9:62:a4
        name: eno2
      - addresses:
        - 192.168.12.71/24
        - fe80::21e:c9ff:fed9:62a3/64
        mac: 00:1e:c9:d9:62:a3
        name: br0
      - addresses:
        - 10.0.3.1/24
        mac: 00:16:3e:00:00:00
        name: lxcbr0
      - addresses:
        - 10.200.200.1/24
        name: wg0
      - addresses:
        - 172.11.0.1/24
        - fe80::9233:dea2:a93e:6952/64
        name: tun0
      - addresses:
        - fe80::fc09:63ff:febd:d061/64
        mac: fe:09:63:bd:d0:61
        name: vethMHFW4P
      - addresses:
        - fe80::fc06:45ff:fe37:e3ca/64
        mac: fe:06:45:37:e3:ca
        name: vethIIEVDU
      - addresses:
        - fe80::fc4c:67ff:fe94:a802/64
        mac: fe:4c:67:94:a8:02
        name: vethVPM60G
      - addresses:
        - fe80::fce0:16ff:fecb:cd2e/64
        mac: fe:e0:16:cb:cd:2e
        name: vethE6205T
      - addresses:
        - fe80::fc8b:7dff:fe72:917d/64
        mac: fe:8b:7d:72:91:7d
        name: vethWPO1HU
      - addresses:
        - fe80::fc04:a1ff:fe25:3e08/64
        mac: fe:04:a1:25:3e:08
        name: veth2BC8IP
      - addresses:
        - fe80::fc8c:30ff:fef7:ea11/64
        mac: fe:8c:30:f7:ea:11
        name: veth1VGWS5
    os: linux
    platform: ubuntu
    platform_family: debian
    platform_version: "18.04"
  user: agent

so I should be getting an alert every 60 minutes, but instead I'm getting a notification every 120 minutes.

What else should I check to find out where the problem is?
Thanks

Ability to set defaults with annotations

The filter defaults are hard coded and the only way of overriding these are with annotations on all entities or all checks if the default values are not desired.

This feature request is for the fatigue filter to accept annotations on the filter itself, to provide the ability to have custom defaults (which can then be overridden with annotations on a specific entity or check if required).

Thanks,

Ian

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.