grafana / alerting Goto Github PK

Set of libraries used to build alerting systems at Grafana - including the Alertmanager.

License: GNU Affero General Public License v3.0

Makefile 0.15% Go 89.14% Jsonnet 0.04% HTML 10.67%

alerting's Introduction

The open-source platform for monitoring and observability

Grafana allows you to query, visualize, alert on and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data-driven culture:

Visualizations: Fast and flexible client side graphs with a multitude of options. Panel plugins offer many different ways to visualize metrics and logs.
Dynamic Dashboards: Create dynamic & reusable dashboards with template variables that appear as dropdowns at the top of the dashboard.
Explore Metrics: Explore your data through ad-hoc queries and dynamic drilldown. Split view and compare different time ranges, queries and data sources side by side.
Explore Logs: Experience the magic of switching from metrics to logs with preserved label filters. Quickly search through all your logs or streaming them live.
Alerting: Visually define alert rules for your most important metrics. Grafana will continuously evaluate and send notifications to systems like Slack, PagerDuty, VictorOps, OpsGenie.
Mixed Data Sources: Mix different data sources in the same graph! You can specify a data source on a per-query basis. This works for even custom datasources.

Get started

Unsure if Grafana is for you? Watch Grafana in action on play.grafana.org!

Documentation

The Grafana documentation is available at grafana.com/docs.

Contributing

If you're interested in contributing to the Grafana project:

Start by reading the Contributing guide.
Learn how to set up your local environment, in our Developer guide.
Explore our beginner-friendly issues.
Look through our style guide and Storybook.

Get involved

Follow @grafana on Twitter.
Read and subscribe to the Grafana blog.
If you have a specific question, check out our discussion forums.
For general discussions, join us on the official Slack team.

This project is tested with BrowserStack

License

Grafana is distributed under AGPL-3.0-only. For Apache-2.0 exceptions, see LICENSING.md.

alerting's People

Contributors

Stargazers

Watchers

alerting's Issues

Templating : Add URL buttons

Using the default template we have access to visual informations like the colored firing state and also Panel, Dashboard and Silence URL :

However, it seems that this feature is not working when using custom templates. I tried to use the default template as a custom template and the result is different :

As you can see, no buttons, no colored state. Is there a way to add this ?

Alertmanager peerReconnectTimeout config

We are using Grafana in HA with alerting in kubernetes. This enables the alertmanager gossip network. Alertmanger has a default of:

--cluster.reconnect-timeout value: length of time to attempt to reconnect to a lost peer (default: "6h0m0s")

Ref: https://github.com/prometheus/alertmanager/blob/main/cmd/alertmanager/main.go#L230

This works fine except due to the turn over of pods the gossip network will try to hit any expired/terminated pods for 6 hours every 10 seconds. I didn't see an easy way to edit/change this but it would be nice if this were configurable.

Need a way to differentiate between a new and a repeat notification

When I configure a notification policy, I can set the "Repeat interval" this causes notifications to be repeated after a certain time.

Right now, a new notification is identical to a "repeat" notification. It would really help us if there's a way to differentiate these. One way I can think of solving this is by having a function that I can call from the alert template?

I'll be happy to send out a patch after we have clarity on how to handle this.

Replace golint with revive

Per:

level=warning msg="[runner] The linter 'golint' is deprecated (since v1.41.0) due to: The repository of the linter has been archived by the owner. Replaced by revive."

Integrate gops labels in alerting

Tasks

Beta Give feedback

Suggest gops labels in alerting rule form
Full integration of alerting labels and gops labels
Options

Slack mentions should check length of the rendered template instead of the template source

The slack user and group mentions should check the length of the rendered go template when deciding whether or not to add a mention to the slack pretext. Currently, the logic checks the length of the source string for it's conditional logic, and renders the template when interpolating it into the slack pretext.

This is a problem, because if the go template renders into an empty string, a broken mention will be added to the pretext.

To reproduce, use {{- /* empty */ -}} as the "Mention Groups" configuration setting.

alerting/receivers/slack/slack.go

Lines 368 to 380 in 939f557

 if len(sn.settings.MentionGroups) > 0 { 

 appendSpace() 

 for _, g := range sn.settings.MentionGroups { 

 mentionsBuilder.WriteString(fmt.Sprintf("<!subteam^%s>", tmpl(g))) 

 } 

 } 

 if len(sn.settings.MentionUsers) > 0 { 

 appendSpace() 

 for _, u := range sn.settings.MentionUsers { 

 mentionsBuilder.WriteString(fmt.Sprintf("<@%s>", tmpl(u))) 

 } 

 }

Proposal: Package Structure

.
├── receivers/
│   ├── alertmanager/
│   │   ├── config.go
│   │   └── alertmanager.go
│   └── sensu/
│       ├── config.go
│       └── sensu.go
├── images/
│   └── images.go
├── logging/
│   └── log.go
└── crypto/
    └── crypto.go

Make `processNotifierError` public

And ensure all of its types are public

Slack workflow webhook always results in "failed to send Slack message: unexpected empty response"

What is happening?

Using Slack workflow webhook results in errors thrown in Grafana server code, even though Slack messages are sent successfully.
Example logs:

logger=ngalert.notifier.alertmanager org=1 t=2024-04-24T18:43:15.211109257Z level=error component=alertmanager orgID=1 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="slack/slack[0]: notify retry canceled due to unrecoverable error after 1 attempts: failed to send Slack message: unexpected empty response"

Slack message sent successfully:

Important note:

If running Grafana in HA mode, this results in alerts being sent multiple times, since HA peers think that their peers failed to send the notification, even though it was sent successfully.

What is expected to happen?

Webhook "200 ok" responses are gracefully handled without error.
Using Slack webhook contact point when running in HA mode results in only 1 notification sent.

Steps to reproduce

Create a Slack webhook (docs)
Run a grafana server locally (e.g. docker run --rm -it -p "3000:3000" grafana/grafana-oss:10.4.1)
Create a Slack contact point and enter Slack webhook URL created in step 1
Set default notification policy to use Slack contact point
Create a new Alert with a threshold that causes it to always be firing (e.g. >-1)
Observe errors in Grafana server logs

Other Notes

I believe the problematic code is here, where a JSON object is always expected in the response. However from the Slack docs, it states that plaintext "ok" will be returned on success: https://api.slack.com/messaging/webhooks#handling_errors

in most cases you'll receive a "HTTP 200" response with a plain text ok indicating that your message posted successfully

No calling `GC` on the notification log as part of the Grafana Alertmanager

https://github.com/grafana/alerting/blob/main/alerting/grafana_alertmanager.go#L171

We're not cleaning up old entries unless we do.

Webhook Receiver should support custom headers

The webhook receiver should have the ability to add arbitrary headers to a http request.

When configuring a webhook contact point it is common that the webhook receiver requires additional arbitrary headers to process the request (e.g. "X-" headers).

This issue would also require a UI change in grafana to allow the additional headers to be defined in the UI

Alertmanager notifier overrides the `image` annotation

from @yuri-tceretian: We need to copy the alert due to it being a pointer and it might be different across contact points
We probably shouldn't use the "image" annotation and instead replace it with __grafana_image___ to signal it's a private one.

UI Bug in provisioned opsgenie alerts

On Provisioned Contact Points for OpsGenie, the UI appears to have a bug where it is requiring the OpsGenie URL be filled in, even though it is defaulting to the correct URL on the backend (we did not provide a URL in our Terraform config for the contact point, relying on the default).

It appears to only surface when you are using the Test button.

Steps to reproduce:

Provision an OpsGenie Contact Point without the Alert URL set through the API
Click Test, run a test
Close the test frame
Now you can no longer click test as it is erroring out on the missing alert URL

Looks like the default is being set here:

alerting/receivers/opsgenie/config.go

Lines 64 to 66 in 2dda1c6

 if raw.APIUrl == "" { 

 raw.APIUrl = DefaultAlertsURL 

 }

The UI should fill this in on the user facing side as well.

Update Google Hangouts to Google Chat

To do grafana/grafana#70044 in grafana/alerting.

ruleURL should open the alert rule page and not list the rules, previously it used to open the dashboard

https://github.com/grafana/alerting/blob/793c67215ba14d6515b93707caeb7e6d9bf82189/receivers/slack/slack.go#LL290C90-L290C90

[Feature request] Passing alert label value to rendered panel variable

Hi Team,

OS: Red Hat Enterprise Linux release 8.6 (Ootpa)
Grafana: v. 9.1.5
Plugins: Grafana Image Renderer

I have alert rule created for VMs CPU usage based on Google Cloud Monitoring data source.
As required by image rendering for email notification this rule is associated with dashboard/panel by providing DashboardID/PanelID.
However this panel content depends on $instance variable set in dashboard. So in fact panel shows VM's CPU usage for specific selected VM.

Any here is what would be nice to have - passing label value to rendered panel/dashboard (guessing this should be included in rendered URL),
For example now I can see in log file:

req="url:"https://10.10.20.20:3000/d-solo/0yDwS6W4k/xxx-vm-instances-gcp?orgId=1&panelId=47&render=1\"

and assuming I have a label $metric.label.instance_name="instance_1" in alert rule I need:

req="url:"https://10.10.20.20:3000/d-solo/0yDwS6W4k/xxx-vm-instances-gcp?orgId=1&panelId=47&render=1&instance=instance_1\"

Or maybe it would be possible in another way.

Many thanks in advance !

ged

Alert fires twice with two different image URLs when enabling Azure storage

Hello,

I've configured my Grafana instance to upload alert screenshots to Azure:

GF_EXTERNAL_IMAGE_STORAGE_AZURE_BLOB_ACCOUNT_NAME: <redacted>
GF_EXTERNAL_IMAGE_STORAGE_AZURE_BLOB_ACCOUNT_KEY:  <redacted>
GF_EXTERNAL_IMAGE_STORAGE_AZURE_BLOB_CONTAINER_NAME: <redacted>
GF_EXTERNAL_IMAGE_STORAGE_AZURE_BLOB_SAS_TOKEN_EXPIRATION_DAYS: 14

grafana.ini:
  external_image_storage:
    provider: azure_blob
unified_alerting.screenshots:
    capture: true
    upload_external_image_storage: true
    max_concurrent_screenshots: 5
imageRenderer:
  enabled: true
  networkPolicy:
    limitIngress: false

It all works well and screenshots are uploaded to Blob storage but each alert is sent twice to the webhook contact point. The only difference between the two payloads, is the imageURL:

First payload:

{
    "receiver": "Webhook",
    "status": "firing",
    "alerts": [
        {
            "status": "firing",
            "imageURL": "https://<public grafana url>/<image path>"
            ...
        }
    ],
    ...

Second payload:

{
    "receiver": "Webhook",
    "status": "firing",
    "alerts": [
        {
            "status": "firing",
            "imageURL": "https://<redacted>.blob.core.windows.net/<rest of the url>"
            ...
        }
    ],
    ...

Is this expected behavior? Is there a way to disable the local upload and only get the notification with the azure blob link?

Dummy 'summary' message for Pagerduty integration discussion

Slack thread: https://grafana.slack.com/archives/C043CEX9MBK/p1689199648823409

Summary:
I ran into a situation recently where the Pagerduty-Grafana integration was failing to send 'alert resolved' notifications to a Pagerduty contact point after the alert resolved. The problem was due to a golang template for the 'summary' section of the contact point being wrapped in an {{ if .Alerts.Firing }} tag (which of course would result in an empty summary upon resolution of an alert). Pagerduty was responding with a 400 every time this empty summary came through, resulting in a failure to auto-resolve a page.

I was asked by @yuri-tceretian to generate an issue so ya'll could discuss the addition of possible "protect the user from their own misconfiguration" dummy text upon empty summary.

My own thoughts:
The addition of golang templating to grafana alerts in grafana 9 looks really powerful, but it's still pretty complex and not entirely transparent to the userbase yet... I'm not sure I (as a user) would appreciate the addition of an extra 'sane default' in an ecosystem where there's already a lot that can go wrong and not quite as much transparency on the alert compilation process...

... but you guys are the experts, go get 'em. :)

Review discrepancies between Grafana and Alertmanager contact points

This is part of the process of unification of Alertmanagers (Mimir and Grafana). Grafana notifiers are derived from the legacy alerting system and have different formats of configuration, use different ways of communicating with the APIs, etc. Therefore, to unify (or not) notifiers we need to review the discrepancies, document them, and then make decisions in each case.

Alertmanager notifiers:

Notifier	Mimir	Grafana
Alertmanager	🔲	✅
Ding Ding	🔲	✅
Discord	✅	✅
Email	✅	✅	link
Google Chat	🔲	✅
Kafka	🔲	✅
LINE	🔲	✅
OPS Genie	✅	✅	link
PagerDuty	✅	✅	link
Pushover	✅	✅
Sensugo	🔲	✅
Slack	✅	✅
SNS	✅	🔲
MS Teams	🔲	✅
Telegram	✅	✅	link
Threema	🔲	✅
VictorOps	✅	✅
Webex	✅	✅
Webhook	✅	✅
WeChat	✅	🔲
WeCom	🔲	✅

Global difference:

In Alertmanager there is a flag send_resolved whereas in Grafana it is called disableResolveMessage
Grafana notifiers support images.

Add Amazon SNS as receiver

As noted here Amazon SNS is missing as an alert receiver in Grafana, while supported in Mimir

This package should support Amazon SNS as a receiver to enable Grafana to send alert notifications to SNS for Grafana-managed alerts. Amazon Managed Grafana supports this today, but it is not present in the open source Grafana.

I can contribute this.

Cleanup

Remove the KV comment related to Grafana, see #4 (comment)
Create and simplify a crypto interface #11 (comment)
Revise if we need nilPeer around #11 (comment)

Implementing streaming mode for kafka plugin

Recently, we have been facing issues with rate-limits for Kafka Plugin which hits the Kafka REST v3 endpoint to send alerts.
We have a druid datasource, and some we have some rules defined. Now periodically, grafana carries out evaluations on data fetched from druid and sends an event to one of our kafka topics with all the necessary metadata, which is processed further to send an alert.
For every (cluster, rule) we have, grafana sends a unique event to kafka. This can lead to the influx of large volume of events to our kafka topic. Hence we are hitting rate limit errors on our kafka topic.
I took a look at the grafana codebase and here are my observations:
Grafana sends the event to kafka topic using the Kafka REST V3/V2 API as defined here (notifyWithAPIV3). We use V3 for our case.
This function call at the end reaches this file finally calling the sendWebRequestSync , essentially making an HTTP POST request.

The following client is used, along with the defined transport:

var netTransport = &http.Transport{
	TLSClientConfig: &tls.Config{
		Renegotiation: tls.RenegotiateFreelyAsClient,
	},
	Proxy: http.ProxyFromEnvironment,
	Dial: (&net.Dialer{
		Timeout: 30 * time.Second,
	}).Dial,
	TLSHandshakeTimeout: 5 * time.Second,
}
var netClient WebhookClient = &http.Client{
	Timeout:   time.Second * 30,
	Transport: netTransport,
}

The http.Client internally maintains a pool of persistent TCP connections per host to improve efficiency of requests, which can be controlled using some transport parameters.
In grafana’s case, The transport does not define these two parameters: MaxIdleConnsPerHost , MaxConnsPerHost.
Hence the default values are used: MaxIdleConnsPerHost = 2 , MaxConnsPerHost = 0
From the documentation:

// MaxIdleConns controls the maximum number of idle (keep-alive)
// connections across all hosts. Zero means no limit.
MaxIdleConns int

// MaxIdleConnsPerHost, if non-zero, controls the maximum idle
// (keep-alive) connections to keep per-host. If zero,
// DefaultMaxIdleConnsPerHost is used.
MaxIdleConnsPerHost int

When grafana receives a large number of events that it needs to send to our kafka topic, it can create as many connections to the kafka REST host in order to fulfil the requests.
And since it can only maintain maximum of two connections in its idle pool, all the other connections get created and are not reused.
For example, if we receive 15 concurrent requests, the client will create 15 connections to the host, out of which the 13 will be closed soon after since the max open connections we can have is set to 2.
Our kafka has a rate limit of 25 connections per second. Hence the limit gets breached in case of huge volume of events. It is also recommended to close connection after every request but this may not be feasible.

Tentative solutions:
Using the Kafka V3 streaming mode: The V3 API supports streaming using which we can send 1000 requests per second. So we can modify the code to send multiple request bodies over a single HTTP connection at the application level. I have implemented a basic way for streaming just to do a POC, but even this is not entirely correct. I am able to send responses over a single connection (streaming), however, I am not able to read responses in streaming mode. The entire response is read all at once. So if I send 200 requests over the connection, the response always waits till all 200 are sent and only then I am able to read the responses, all 200 together.

Looking at the grafana code, integrating it with the current implementation would need a big overhaul. As of now, each thread receives a single alert request, and we send the post request.

We also would need to specify the parameters MaxIdleConnsPerHost , MaxConnsPerHost to set an appropriate limit per host to take into consideration the rate limits.
Looking for inputs on how to go about this. Thank you for your valuable time!

ValueString in Grafana 10

I noticed this comment here:

alerting/templates/template_data.go

Line 41 in 793c672

ValueString string `json:"valueString"` // TODO: Remove in Grafana 10

Are there plans to replace ValueString with something easier to work with? I'm currently parsing this string in order to associate my Graphite query string with the float64 value, since

Values        map[string]float64 `json:"values"`

does not offer the context I need (it only maps the var name to the number whereas I need the name label).

Include screenshot URL to Telegram message

Currently, Telegram integration sends two messages when an alert has a screenshot attached to it.
However, Grafana screenshot service can upload screenshots to a remote host and provides URL to the image in the alert.

When Grafana is configured this way, we probably can let users control how screenshots are sent by Telegram integration: add URL to the first message or send it as the second message.

Add Everbridge for contact point

I am looking to have the built-in alert manager within grafana send to the everbridge endpoint, but due to the issue here:
https://community.grafana.com/t/alerting-webhook-contact-point-authorization-header/73220, I am unable to do so because of custom headers. I see the following PR has been open for a long while and hasn't merged in. this could also work but there's no OOTB solution for me to send to this endpoint currently. #135.

How can I best request the ability to either increase the flexibility of the web hook contact point, or add an additional one?

Cheers!

Consistent error names in the images package

Error names in our images package use the singular and plural forms of "image(s)" interchangeably. We should remove "Image(s)" from the error names to avoid inconsistency and stuttering when using e.g. images.ErrImagesNoURL.

Marshalling an `alerting.GrafanaReceiver` into an `apimodels.PostableGrafanaReceiver` is a pain

In Grafana, we're keeping the core types of apimodels right where they belong: In Grafana. However, want to make sure we avoid cyclic dependencies so we can't bring apimodels into grafana/alerting.

Because of this, we have to do a two way conversation of Grafana Receivers to and from the alerting package. Let's figure out a way to unify them.

Slack allows configuration of both incoming webhooks and bot tokens at the same time

The Slack receiver allows configuration of both incoming webhooks and bot tokens at the same time. This is a bug. The intended behaviour is that users configure one or the other, but not both. The UI prevents users from being able to set both, but I think it's still possible when using the Provisioning API.

There are a number of incorrect tests that also assert both incoming webhooks and bot tokens can be configured at the same time. This is incorrect and should be fixed too.

[Feature Request] add more alerting "default" templates

So, this is what caused me to give up for 1 year to move to grafana unified alerting:

Here is a default slack notification from grafana 8 with a basic alert:

This shows in very clear ways:

where this happened
which alert
the value that broke the threshold
All done by defining slack as destination and creating a simple alert.

While the new alerting:
If I do the same, I get an ugly alert

After working a lot, I managed to implement a generic template ... but I need to define each alert to emit the exact right data.
This is quite brittle and quite hard to maintain when having thousands of alerts.

=> would it be possible to see if a default template could be added that would look like the old one ? ( I guess it might be difficult - so maybe just for "classic_conditions" alerts?)
We could then define a receiver for those alerts, making life so much easier for people starting with grafana alerting or migrating

Pending period in the Grafana UI is bound to the wrong value

It seems pending period is bound to the evaluation interval when the UI for an alert rule is loaded even if there is a custom pending period defined.
In the following example an evaluation group has been assigned with the Evaluation interval of 1m. A pending period of 30m has been configured, incorrectly visualized as 1m. If you look in the edit evaluation group modal you can see the rule having a pending period of 30m. If I save the rule now the pending period will become 1m.

Grafana v11.1.0-69051 (9d44c8e8cf)

Support for slack mrkdwn

It would be helpful if the Slack receiver permitted Slack mrkdwn in the text of Slack notifications. This would allow us to add mentions and text hyperlinks to the body of Slack notifications. This could significantly improve the readability of our slack alerts.

Prometheus makes this configurable using the mrkdwn_in setting for the Slack receiver.

Alert notifications are lost when grafana restarts

What you expected to happen:
When Grafana restarts, Alertmanager which handles the notifications, loses all alerts but keeps only the status which is whether it was alerting or resolved and the time when it happened (it's called notification log) but does not have any knowledge of pending notifications at the time of restart. To repopulate the Alertmanager, it can take up to 2 -3 evaluation intervals of a rule.

What you expected to happen:
It would help if the AlertManager keeps a history of pending notifications and sends them after it restarts (or after the first rule evaluation if the alert is still firing).

Environment:
Grafana version: 10.2.2

Customize "OPEN IN GRAFANA" widget in Google Chat

Hello,

I recently upgrade Grafana to version 9.x and previously the button "OPEN IN GRAFANA" used to have a link to the panel where it's alert comes from.

But now it always send to "alert list page" that isn't useful to analyse the alert itself because alert list do not show the metrics.

I'd like to request a way to customize this button maybe from template or something like that.

Thanks.

Slack: add an option to send the image as a new message

When slack is configured via bot token, it can only send images as part of the thread of the alert message.
Would it be possible to add more options ?
Whatever is easiest, between:

same as old notification where the screen shot is a separate message that comes right after the alert message.
or as part of the message itself (either via templating or via an option).

Allow customization of Slack thread message (for images)

Currently it's not possible to customize the slack comment: https://github.com/grafana/alerting/blob/main/receivers/slack/slack.go#L506C6-L532

Templating: allow to transform json text to object

A custom annotation or label with json as text should be allow to transform to object.
For example

[
  {
    "annotations": {
      "summary": "Instance instance1 has been down for more than 5 minutes",
      "data": "{\"field1\": \"value1\"}"
    },
    "labels": {
      "instance": "instance1"
    },
    "startsAt": "2024-03-12T14:09:43.689Z"
  }]

we should be allowed to do something like that in template:

{{ range (data .Annotations.data).SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}

	if len(sn.settings.MentionGroups) > 0 {
	appendSpace()
	for _, g := range sn.settings.MentionGroups {
	mentionsBuilder.WriteString(fmt.Sprintf("<!subteam^%s>", tmpl(g)))
	}
	}

	if len(sn.settings.MentionUsers) > 0 {
	appendSpace()
	for _, u := range sn.settings.MentionUsers {
	mentionsBuilder.WriteString(fmt.Sprintf("<@%s>", tmpl(u)))
	}
	}