yelp / sensu_handlers Goto Github PK

Custom Sensu Handlers to support a multi-tenant environment, allowing checks themselves to emit the type of handler behavior they need in the event json

License: Apache License 2.0

Ruby 91.59% Puppet 8.41%

sensu_handlers's Introduction

Yelp sensu_handlers

Warning: These handlers are intended for use by Advanced sensu users. Do not use them if you are setting up Sensu for the first time. Use standard handlers from the community plugins repo

These work best with the Yelp monitoring_check or the pysensu-yelp python library to make checks that these handlers act upon.

To Repeat: these handlers are special and require special event data to work. If the special event data (like team) is not provided, these handlers will do nothing.

Available Handlers

Base

The base handler is the only handler necissary to use. It is the default. All other handler behavior is derived from the event data.

This allows checks to use one handler, and we can add new features or deprecate old ones without changing client-side configuration.

The base handler also handles advanced filtering. It respects the following tunables:

alert_after - Seconds to wait before any handler is activated. Defaults to

realert_every - Integer which filters out repeat events (uses "mod"). realert_every => 2 would filter every other event. Defaults to -1 which is treated as a special input and does exponential backoff.

This handler also provides many helping functions to extract team data, etc.

All other handlers inherit the base handler.

nodebot (irc)

Uses the nodebot tool to send IRC notifications. Nodebot is helpful here as it retains a persistent connection to the IRC server, which can be expensive to setup.

Sends notification to the pages_irc_channel or ${team_name}-pages if the alert has page => true
Sends notification IRC messages to the array of irc_channels specified by the check, otherwise sends to the notifications_irc_channel specified in the team data.
If out of all that there are no channels, then no notifications will be sent.

mailer (notification emails)

Modification of the sensu-community-plugins mailer that can route emails to different destinations depending on the circumstance.

Sends an email to the notification_email destination if specified in the check.
Otherwise it uses the notification_email specified by the team.
Will refuse to send any email if notification_email => false.

pagerduty (pages)

Modification of the sensu-community-plugins handler that can open events on different Pagerduty services depending on the inputs.

Only activates if the page boolean key in the event data is set to true
Only creates an incident on "Critical" levels (Does not page for warnings)
Uses the pagerduty_api_key config set to the team to determine which service to open or close an event in.
Tries to provide maximum context in the pagerduty event details
Automatically closes events that are resolved.

jira (tickets)

This handler can make a JIRA ticket for an alert.

The alert must have ticket => true
Derives the Project to make the ticket in from the project key set in the event data
Falls back to the default project for the team if unset.
WARNING: The Jira project must not have special required fields
WARNING: Jira has special "transition" states in order to close tickets, this handler won't work if you have some custom "workflow"? (specifically, it won't be ble to close/fix/done issues. Patches welcome)
WARNING: Be sure to use exponential backoff in order to not overload your Jira server.

Other

There are other handlers included here that are not yelp-specific in the sense that they do not use the team construct, and are included out of convenience.

Puppet Usage

If you are using the module itself, it can deploy the handlers and configure them.

class sensu_handlers {
  # See the teams section
  $teams => $team_data,
}

Puppet Parameters

See the inline docstrings in init.pp for parameter documentation.

Teams

The Sensu handlers must have the team declarations available for consumption. This data must be in hiera because currently the monitoring_check module also utilizes it.

On the plus side, hiera allows you to describe your team configuration easily:

sensu_handlers::teams:
  dev1:
    pagerduty_api_key: 1234
    pages_irc_channel: 'dev1-pages'
    notifications_irc_channel: 'dev1'
  dev2:
    pagerduty_api_key: 4567
    pages_irc_channel: 'dev2-pages'
    notifications_irc_channel: 'dev2'
  frontend:
    # The frontend team doesn't use pagerduty yet, just emails
    notifications_irc_channel: 'frontend'
    pages_irc_channel: 'frontend'
    notification_email: 'frontend+pages@localhost'
    project: WWW
  ops:
    pagerduty_api_key: 78923
    pages_irc_channel: 'ops-pages'
    notifications_irc_channel: 'operations-notifications'
    notification_email: 'operations@localhost'
    project: OPS
  hardware:
    # Uses the ops Pagerduty service for page-worhty events,
    # but otherwise just jira tickets
    pagerduty_api_key: 78923
    project: METAL

Team Syntax

This is a very important aspect of the configuration of these sensu handlers. The team syntax determines the default behavior of the handlers, given an input team.

Warning: If you typo a team name, the Sensu handlers will not know how to associate an alert with the right outputs. This is a common source of mistakes.

Lets look at the team syntax in more detail:

sensu_handlers::teams:
  ops:
    pagerduty_api_key: 78923
    pages_irc_channel: 'ops-pages'
    notifications_irc_channel: 'operations-notifications'
    notification_email: 'operations@localhost'
    project: OPS

sensu_handlers::teams: - Normal puppet-hiera lookup name. Matches 1:1 with the sensu_handlers module, teams parameter. This is a hash
ops: - Team name. This is the primary lookup key
pagerduty_api_key: deadbeef - In pagerduty, this corresponds to a "service". That service must use the "generic" or "sensu" api format. Sharing the api key with a "Nagios" service will NOT work
pages_irc_channel: ops-pages - If there is an event with page=>true, a notification will go to this channel. This parameter defaults to $team-pages. It can take an array of channels. No need to have the leading "#".
notifications_irc_channel: 'operations-notifications' - Non-paging events will appear here. If ommited, defaults to $team-notifications. This also can accept an array, and does not need a leading "#"
notification_email: 'operations@localhost' - If set, the handler will send emails for every event to this address. If ommited it will send no emails. You can send the email to multiple destinations by using comma separated list (like any email client)
project: OPS - Used by the JIRA handler. If a event comes in that has ticket=>true, the jira handler will open a ticket on this project. There no default for this parameter. Special considerations have to be made for the JIRA project to enable auto-opening and auto-closing of tickets, see the docs on the jira handler.

Manually Invoking These Handlers

You can manually invoke these handlers in order to test them, ensuring that (for example) a JIRA ticket is correctly raised. Simply pipe the Sensu alert in JSON into one of the handlers, and it should parse it as if it were a fresh alert.

$ grep 'failed' /var/log/sensu/sensu-server.log  | tail -n 1 | jq .event > last_failed_event.json
$ cat last_failed_event | sudo -u sensu ruby jira.rb

Support

Please open a github issue for support.

sensu_handlers's People

Contributors

Stargazers

Watchers

sensu_handlers's Issues

Can someone please help me install these Handlers

Hi There,
I need to use the Jira handler and I am having hard time installing the gem on my sensu server. Can someone please send me instructions on how to install this gem?

Thanks
Hadie

Generate an alert if the API query fails

Hi guys,

This is more of a discussion than anything, just wondering if maybe you have any thoughts.

I have a monkey patched version of your JIRA handler, but in our orgs infinite wisdom, changes to transitions, required custom fields or such like will often mean the handler fails. It's difficult to notice this unless you're active catching the exceptions and checking events vs open jira tickets.

Can you think of a way whereby a sensu alert (with another handler) is generated without having to pull in a bunch of existing gems?

Obviously we can rescue the exception and go from there, just wondering if you have any novel ideas..

Remove 'nail' references

After removing habitat lines, this looks like it is just the nodebot path.
We should be able to just use the path to find it.

Use the new at_exit disabling mecanism rather than monkeypatching at_exit

sensu-plugins/sensu-plugin@82c4cd2

I added a neater way to disable the at_exit handler in the above commit.

Shift our Sensu handlers to use it.

Replace fog dependency with aws-sdk

Fog is gigantic gem (6000 files), which is directly referenced in about 3 or 4 lines.

Given we already include aws-sdk in puppet-omnibus and that aws-sdk is far more natural dependency, let's migrate to that? I'm happy to volunteer.

aws_prune should only consider "running" instances to be available

We currently only consider "non terminated" instances when considering which ec2 instances to respect when activating handlers.

This causes alerts in the time between a server being terminated (in the shutting-down state) and when the server is actually terminated.

But I think we can do better by simply ignoring servers that are shutting-down or terminated, and then they will be pruned asap.

Support related alerts functionality

So something that I've wanted for a long time is the ability for symptom alerting to include causal alert information. After chatting a bit with @solarkennedy we think that the right place to put this functionality is here in the sensu_handlers.

The idea is that the pagerduty/jira handler can query the sensu api for "related" events, probably by tag or host etc ... Then it would include this contextual information in the call to action alert.

Thoughts? Is this the right place to do it, is this a crazy idea?

Init class doesn't use the split out sub classes

I think when you merged in my review, you missed init.pp?

Ability to set :context_path for jira handler

In the jira.rb file there is a field "context_path"
:context_path => '',

This is set to empty in the .rb file. My jira setup needs this set to /jira.
Therefore it would be great if this option would be read from the sensu handler config as well.

Can you add this option?
Thank you for your support!

Document the specifics of each handler

While there is "code", it would be nice if you could understand the behavior of the handlers (jira specifically) by reading documentation.

This issue is to write more documentation on each handler and how it works.

aws_prune.pp uses cron::d which is a yelpism

We haven't open-sourced our cron module (right?), so we should probably switch the cron::d define to just a regular cron.

Purge opsgenie references?

If we no longer use it, I'm tempted to not have it in here as it is going to be dead code and it will rot.

Thoughts @bobtfish?

Develop watchdog handler

The sensu watchdog requires handler help to detect the present of a watchdog_timer key, if present it should update/add a stash to record the last event. The key is an int that represents seconds. The deployed stash should probably be something like watchdog/$fqdn/$check_name to be similar to silences.

This is all that it needs to do. A separate process is in charge of detecting stale watchdogs checks and spawning new events based on them.

Acceptance criteria is when there is some watchdog.rb file in our default handler array that does this, and you can see stashes show up after an event that uses the watchdog_timer key.

Use the send-test-sensu-alert script to aid in troubleshooting. Use hiera to selectively deploy the handler in production.

There should be tests that go with it. I want to see tests that show that:

when an event comes in with no watchdog_timer key, nothing happens
when an event comes in with a watchdog_timer => nil nothing happens
when an event comes in with a watchdog_timer => 60 expect it to receive create_stash or something like that.

Remove `habitat` from pagerduty event ids in favor of `region` (I guess)

We need to know what habitat the pagerduty events come from in order to do webhooks.
@bobtfish what are these webhooks anyway, Nagios stuff?

Till then we'll get duplicate pagerduty incidents for any kind of source based event. (server_side, sensu stuff, etc)

remove dependency on puppet-omnibus embedded ruby

https://github.com/Yelp/sensu_handlers/blob/master/manifests/aws_prune.pp#L46

Install necessary gems for sensu embedded ruby and use those.

pagerduty handler can sometimes emit too long of a description, and fail to page, which is bad

We should truncate like we do in the irc handler. PD requires the description to be <= 1024 chars.

errorsDescription is too long (maximum is 1024 characters)messageEvent object is invalidstatusinvalid event

The other wtf is that the error is not printed because of the elegance of our ruby.

Please also make the error reporting easier to discover by simply printing the output of the result if it didn't work:

    result = Redphone::Pagerduty.trigger_incident(
      :service_key  => api_key,
      :incident_key => incident_key,
      :description  => description,
      :details      => full_description_hash
    )
    puts result if  result['status'] != 'success'
    result['status'] == 'success'

exponential backoff sometimes crashes

I think this might be happening in a log(0) event:

severities":["ok","warning","critical","unknown"],"name":"jira"},"output":"/etc/sensu/handlers/base.rb:173:in `log': Numerical result out of range - log (Errno::ERANGE)\n"}

I think this is the corresponding event data:

{"timestamp":"2014-09-04T00:37:24.241628-0700","level":"info","message":"handling event","event":{"client":{"name":"REDACTED","address":
"REDACTED","subscriptions":[REDACTED"],"safe_mode":true,"keepalive":{"handlers":["default"],"team":"operations","page":true,"realert
_every":"-1","runbook":"REDACTED},"timestamp":1409816239},"check":{"handlers":["default"],"command":" /usr/lib/nagios/plugins/check_http -H l
ocalhost -p 5052","subscribers":[],"standalone":true,"interval":60,"aggregate":false,"handle":true,"alert_after":300,"realert_every":"-1","runbook":"REDACTED","annotation":REDACTED","sla":"No SLA defined.","dependencies":[],"team":"REDACTED","irc_channels":"REDACTED","notification_email":"undef","ticket":false,"project":false,"page":true,"tip":false,"name":"check_marathon","i
ssued":1409816234,"executed":1409816234,"output":"CRITICAL - Socket timeout after 10 seconds\n","status":2,"duration":10.004,"history":["2","2","0","0","0"
,"0","0","0","2","2","0","0","0","0","0","0","2","2","2","2","2"]},"occurrences":5,"action":"create"},"handler":{"command":"/etc/sensu/handlers/jira.rb","t
ype":"pipe","severities":["ok","warning","critical","unknown"],"name":"jira"}}

I think I need to detect this case and.. do something?

Develop autofixer handler

Autofixer could be a system that receives events from sensu and attempts to fix them if it's told how to do it. This handler will provide the plumbing necessary to make it happen.

Potential sequence of steps could be receive an event from sensu, determine if this event is actionable by autofixer at this point in time (for example, attempt to autofix only after N minutes of ongoing event), perform a specified action, mark event as "autofix has been applied" in case the autofix action fails to resolve the event (to provide means for escalation to higher intelligence entity like a human).

Or make "autofix has been applied" as a counter to support situations where we want to apply autofix up to N times and after that give up.

Let's build plumbing to teach the machines how to fix problems themselves!

Thank you. :) Team based config

I don't have another place to communicate with you guys. I wanted to say thank you for your handler model and the team based configuration. because of the team based configuration we're able to transition from hipchat to slack handler by just updating a teams channel information rather than having to find every check defined for those teams and modify them all.

thought you might like to know that that bit of forward looking design paid off for us in this transition.

Thank you.

Gemfile has bobtfish stuff in it

@bobtfish can you update the gem file to have released gems and still have the tests work?

Jira handler has "magic numbers" and makes some assumptions about your Jira setup

The first magic number is "1" for the issue type. (which is a "Bug" at yelp in most projects)

The second is "731", which is a common resolving "transition_id" that most of our projects have. Other JIRA installs will probably have different numbers.

I think these need to be tunable on a per-team/project basis.

Assigning to @hashbrowncipher as he needs this to scratch his own itch.

remove habitat stuff

The habitat stuff is too specific and doesn't make sense for other consumers.

This should be generalized in some way, probably passed in as a setting that the handler can read

When creating a jira ticket, assign it to the on-call person in pagerduty

My teams have always had the problem of ignoring sensu created JIRA tickets. I believe that this happens because of the bystander effect due to the tickets being created without an assignee.

It would be great if we could assign JIRA tickets to the on-call person in pagerduty who would be notified if the check was paging.

handler not triggering resolve action for an event transitioned from crit -> warn -> ok

TL; DR

It looks like sensu does not filter events to handlers when it is resolved like
crit -> warn -> ok (in some cases, see below)

but when it does filter events to handlers when it is resolved like
crit -> ok

sensu while transitioning from action = create(crit) to action = create (warn), will reset the occurrences to 1.
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L549

sensu while transitioning from action = create(crit) to action = resolve(ok), will retain the occurrences from create.
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L536

given a scenario

interval: 10
alert_after: 40
initial_failing_occurrence = 4 (from https://github.com/Yelp/sensu_handlers/blob/master/files/base.rb#L231)

simulate trigger incident

occurrences = 1
action = create
status = 2
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

occurrences = 2
action = create
status = 2
number_of_failed_attempts = 2 - 4 = -2 < 1 -> do not trigger

occurrences = 3
action = create
status = 2
number_of_failed_attempts = 3 - 4 = -1 < 1 -> do not trigger

occurrences = 4
action = create
status = 2
number_of_failed_attempts = 4 - 4 = 0 < 1 -> do not trigger

occurrences = 5
action = create
status = 2
number_of_failed_attempts = 5 - 4 = 1 (!<1) -> trigger (create PD !!)

#crit -> warn
#occurrences set to 1 after status = 1
https://github.com/sensu/sensu/blob/bb91fea7797d2402349a3e86b7cd3f43b78621c8/lib/sensu/server/process.rb#L549

occurrences = 1
action = create
status = 1
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

simulate resolve incident
#immediately warn -> ok
#occurrences retained from warn after status = 0
occurrences = 1
action = resolve
status = 0
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

after this event is deleted and as a result sensu handler does not filter the event to PD handler (to resolve incident),
as a result orphaned PD incidents.

whereas (simulate where occurrences is not yet reset to 1)

simulate trigger incident

occurrences = 1
action = create
status = 2
number_of_failed_attempts = 1 - 4 = -3 < 1 -> do not trigger

occurrences = 2
action = create
status = 2
number_of_failed_attempts = 2 - 4 = -2 < 1 -> do not trigger

occurrences = 3
action = create
status = 2
number_of_failed_attempts = 3 - 4 = -1 < 1 -> do not trigger

occurrences = 4
action = create
status = 2
number_of_failed_attempts = 4 - 4 = 0 < 1 -> do not trigger

occurrences = 5
action = create
status = 2
number_of_failed_attempts = 5 - 4 = 1 (!<1) -> trigger (create PD !!)

occurrences = 6
action = create
status = 2
number_of_failed_attempts = 6 - 4 = 2 (!<1) -> do not trigger until next alert_after …and so on.

simulate resolve incident
#crit -> ok
#occurrences retained from crit after status = 0
occurrences = 6
action = resolve
status = 0
number_of_failed_attempts = 6 - 4 = 1 (!<1) -> trigger (resolve PD !!)

Not sure i was able to explain this clearly, it is mostly code references.

But i guess this is the reason why we have had some cases where PD incidents were not cleared from pager duty even though these were resolved from PD.

fix:

https://github.com/Yelp/sensu_handlers/blob/master/files/base.rb#L231
we short circuit filter_repeated to handler when action: resolve.

Document how to invoke these handlers directly for troubleshooting purposes

This is "just" cat'ing event json and piping it, but it isn't obvious.

Just an example taking some event data from the log and "replaying" it, pipe to a handler would be helpful for troubleshooting purposes.

filter alerts with "Execution timed out" in pagerduty and jira handlers

I would like to open this up for discussion.

If a check is taking longer to run than expected, it often would exit 2 (critical) with output of "Execution timed out".

This comes from sensu-spawn gem - https://github.com/sensu/sensu-spawn/blob/master/lib/sensu/spawn.rb#L163

What if we filter these out of pagerduty and jira handlers? After all when we get this, we can't be certain it's the check that failed - in fact almost always it's not the actual check but a bad or hung ec2 instance etc.

A positive outcome of this is we would cut down on (frequently) unactionable tickets and pages. Also, if we ever decide to do more auto-remediation on these, we would be more confident autoremediation is attempted when it's needed - since running autoremediation in response to "execution timed out" (essentially an unknown exit code from a check) may not always be desirable.

A negative outcome would be we would lose implicit info about some problems that oncall often derives from seeing "Execution timed out".

Discuss.

@solarkennedy @bobtfish

Make aws-prune run "out of band" and not as a handler

Running the aws-prune script as an even handler has two downsides:

It is wasteful cpu-wise
It means there is always a warning event while a server is being shut down

Running the aws-prune thing as a cron job could make this more efficient and prune terminated instances faster, without causing spurious alerts.

plans for sensu core filtering support?

Sensu 0.26 and later have started deprecating use of filtering at handler and use core inbuilt filtering.
Are there any plans to support for sensu 0.26+ for the above changes ? :)

Handlers are not CPU effecient

Due to the the design, all handlers spawn for every event, each filters themselves, each queries the api for the same stashes, etc.

@bobtfish We talked about doing this as an extension. This sounds good, but it pretty scary.

Could we do some intermediate thing where the base handler is activated, does filtering, and then it decides what additional handlers to spawn depending on the event data? Pretty much just making 1 mega handler?

Remove irc.local references

This should be a parameter in the sensu_handlers class

Licensing

Thanks for the awesome work done here, especially the base handler. There are many parts of that I'd like to use in my own base handler along with much of my own functionality.

In the spirit of sharing, I'd like to open source my work under the MIT license but you have no license specified so I'm not sure if I can.

Can you clarify?