GithubHelp home page GithubHelp logo

samber / awesome-prometheus-alerts Goto Github PK

View Code? Open in Web Editor NEW
6.0K 154.0 943.0 979 KB

🚨 Collection of Prometheus alerting rules

Home Page: https://samber.github.io/awesome-prometheus-alerts/

License: Other

Ruby 0.68% HTML 14.23% Ruby 0.68% HTML 14.23% Ruby 0.68% HTML 14.23% Ruby 0.46% HTML 35.96% CSS 14.72% JavaScript 4.10%
prometheus alertmanager alert rule collection awesome monitoring alerting query promql

awesome-prometheus-alerts's People

Contributors

agrinfauzi avatar alexandrumarian-portal avatar apmartins85 avatar asluck avatar asteny avatar bdossantos avatar billabongrob avatar dependabot[bot] avatar faust64 avatar fsschmitt avatar jdorel avatar jlosito avatar jpds avatar kongslund avatar mcrauwel avatar meoww-bot avatar michaelact avatar mikael-lindstrom avatar nabilbendafi avatar ozarklake avatar perlun avatar robert-will-brown avatar roock avatar samber avatar strangeman avatar testtest2227 avatar timp87 avatar tosin-ogunrinde avatar vietdien2005 avatar yasharne avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

awesome-prometheus-alerts's Issues

HostUnusualDiskReadLatency

Hello,I do not understand why use time divided by total ?

  - alert: HostUnusualDiskReadLatency
    expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Host unusual disk read latency (instance {{ $labels.instance }})"
      description: "Disk latency is growing (read operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Context switch rate rule always alerts

Hello,

Thanks for your prometheus alerts which are, indeed, well and truly awesome! They've been very helpful in migrating a project away from icinga2 towards prometheus.

There's one alert here that I find will always trigger: The node alert named ContextSwitching. If I plot the graph for the query, I find that idle servers will generally have around 2100 context switches, while a moderately busy one will have 50.5k.

These are generally multi-core processors, does that factor into it at all? Whatever the case, I think there is an issue with the PromQL expression that you might want to be made aware of. Thanks again.

Autogenerate a "real" set of rules files consumable by prometheus

Copy and pasting these rules is ok, but it would be easier if there was a generated set of rules files from rules.yml that would "just work".

Is there any reason this isn't generated? If not are there any templating tools that anyone knows of that could achieve this?

Add Prometheus alert for openebs

for https://openebs.io/

- alert: openebs_used_pool_capacity_percent
        expr: (openebs_used_pool_capacity_percent) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "OpenEBS Pool use more than 80 percent of his capacity (instance {{ $labels.instance }})"
          description: "OpenEBS Pool use more than 80% of his capacity\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

KubernetesPodNotHealthy expr problem

- alert: KubernetesPodNotHealthy
    expr: min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h]) > 0
    for: 5m
    labels:
      severity: error
    annotations:
      summary: "Kubernetes Pod not healthy (instance {{ $labels.instance }})"
      description: "Pod has been in a non-ready state for longer than an hour.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

I want to use this ,but the "expr" Doesn't seem right. I get the error like:

Error executing query: invalid parameter 'query': 1:107: parse error: ranges only allowed for vector selectors

if I use min_over_time(sum by (namespace, pod, env, stage) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) , the result is OK。

ContainerCpuUsage reports kubernetes-cadvisor metrics

kubernetes-cadvisor is a process that runs in the kubelet. Consider changing the query for ContainerCpuUsage from

(sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80

to

(sum(rate(container_cpu_usage_seconds_total{job!="kubernetes-cadvisor"}[3m])) BY (instance, name) * 100) > 80

so that the alert won't fire constantly for cadvisor?

截屏2020-03-15 下午1 02 33

Change in network average transmit

This is a good way to be notified in case of intrusion, DDOS attack or just a dumb user doing something stupid with a server.

If average traffic increases a lot over the last 5 min. compared to the last 5h, you get an alert.

(sum by (instance) (avg_over_time(node_network_transmit_bytes_total[5m])))/(sum by (instance) (avg_over_time(node_network_transmit_bytes_total[5h])))

(sum by (instance) (avg_over_time(node_network_receive_bytes_total[5m])))/(sum by (instance) (avg_over_time(node_network_receive_bytes_total[5h])))

Add Elastic Search Alert - Disk Usage

Hi,

an idea to add for elastic search:

    - alert: ElasticsearchDiskUsageTooHigh
      expr: ( 1-(elasticsearch_filesystem_data_available_bytes{}/elasticsearch_filesystem_data_size_bytes{}) ) * 100 > 90
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Elasticsearch Disk Usage Too High (instance {{ $labels.instance }})"
        description: "The disk usage is over 90% for 5m\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ElasticsearchDiskUsageWarning
      expr: ( 1-(elasticsearch_filesystem_data_available_bytes{}/elasticsearch_filesystem_data_size_bytes{}) ) * 100 > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Elasticsearch Disk Usage warning (instance {{ $labels.instance }})"
        description: "The disk usage is over 80% for 5m\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

CorednsPanicCount

I use helm chart: prometheus-9.1.0 (prometheu
the alert rule :

- alert: CorednsPanicCount
  expr: increase(coredns_panic_count_total[10min]) > 0
  for: 5m
  labels:
    severity: error
  annotations:
    summary: "CoreDNS Panic Count (instance {{ $labels.instance }})"
    description: "Number of CoreDNS panics encountered\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

seem to be not correct I see in log :


 \"CorednsPanicCount\": could not parse expression: parse error at char 36: bad duration syntax: \"10mi\""
level=error ts=2019-09-13T15:07:29.344Z caller=main.go:757 msg="Failed to apply 

something strange it's like 10min become 10mi in log

Translate rules from node_exporter to Netdata metrics

Do you think it is interesting to translate all those rules from node_exporter to Netdata metrics so you can use a different source for the metrics?

I particularly use Netdata metrics as a source for Prometheus, so I will translate for me those rules I consider interesting. I can share if you think that could be useful.

Check node exporter alerts, please

Since node_exporter 0.16 all metrics related to size in bytes added "_bytes" to original name
Eg.
node_filesystem_free became node_filesystem_free_bytes

Refacto: write more accurate descriptions for faster troubleshooting

Example:

From:

- name: Prometheus rule evaluation failures
  description: 'Prometheus encountered {{ $value }} rule evaluation failures.'
  query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'

To:

- name: Prometheus rule evaluation failures
  description: 'Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.'
  query: 'increase(prometheus_rule_evaluation_failures_total[3m]) > 0'

An effect field would enable us to improve alert template.

How to filter out master in ReplicationLag

I have data from postgres-exporter: https://github.com/socialwifi/docker-postgres-exporter/blob/master/queries.yaml#L1
But it also returns very big value for master node. Idk how to filter it out on alertmanager level.

604800 > pg_replication_lag > 10

Looks very stupid, but works fine. Assuming alerts should fire and fixed in one week.

However, master node can change in any time, so replication lag will be lower for master during change, so I keep looking for better solution.

MDRaid Alert

Would be handy if we could get add the MDRaid alert for md raid array degradation. Here's what i've.

- alert: MDRaidDegrade
    expr: (node_md_disk - node_md_disk_active) != 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID."
      description: "CRITICAL - Node {{ $labels.instance}} has DEGRADED RAID {{$labels.device}}. VALUE - {{ $value }}.

MysqlRestarted will never fire

Looking at this alert

  - alert: MysqlRestarted
    expr: mysql_global_status_uptime < 60
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "MySQL restarted (instance {{ $labels.instance }})"
      description: "MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

we need to change either for or expr, because mysql_global_status_uptime is in seconds and it won't stay < 60 for 5 minutes.

I'm very new to Prometheus, and I have these two ideas:
1.

    expr: mysql_global_status_uptime < 70
    for: 1m

This will fire an alert almost immediately after MySQL restart

    expr: mysql_global_status_uptime < 310
    for: 5m

This will fire after 5 minutes.

I personally would rather go with first option, but let me know if I'm mistaken here with anything.

Information about absent values

I recently checked the alerts of our cluster and some of them would never fire,
as the value we wanted to be alerted for, was not exported.

I tackled it by adding an alert that fires on "absent" values:
expr: absent(up{job="myjob"})

Might it be helpful to include information about this issue in general?

Restrict Fstype for Disk Fill Alert

After adding Alert 2.7, I noticed that i'm getting disk fill alerts for /run/user/1113 mountpoint, since the rule will apply to all fstype labels of tmpfs as well. Perhaps we should set rule to restrict it to ext4 and xfs etc or may be add a not equal to condition to rule out tmpfs ?

predict_linear(node_filesystem_free_bytes{fstype=~"ext4|xfs"}[1h], 4 * 3600) < 0

Change Rule Status Code

I think we should change this expr

probe_http_status_code <= 199 OR probe_http_status_code >= 300

to

probe_http_status_code <= 199 OR probe_http_status_code >= 400

Because status code 302 or 301 is http redirect https and it valid

7.7. Dead locks

pg_stat_database_deadlocks{pg_stat_database_de}[1m]
is an invalid expression

7.7. Dead locks
PostgreSQL has dead-locks

  • alert: DeadLocks
    expr: rate(pg_stat_database_deadlocks{pg_stat_database_de}[1m]) > 0
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "Dead locks (instance {{ $labels.instance }})"
    description: "PostgreSQL has dead-locks\n VALUE = {{ $value }}\n LABELS: {{ $labels

node-exporter nfs alerts

I collect the nfs stats using node-exporter and i want to monitoring the nfs stats. But i have have no idea how to set the threshold, for example nfs request latency. Can someone give me advice about the threshold of nfs stats.

rabbitmq rules don't work

Hello,
Do you maybe have updated rabbitmq rules as these are not working (most of them)?
The metrics you have specified in the expressions do not exist so i have made similar (kinda). Any feedback appreciated and thanks.

CPU load

I'm currently testing some of your alert rules and stumbled accross the CPU load snippet. I'm not 100% sure how it exactly works, but if I got this straight it should trigger an alert when the average CPU usage for 30mins is greater than 2 ... right?

- alert: CpuLoad
  expr: node_load15 / (count without (cpu, mode) (node_cpu_seconds_total{mode="system"})) > 2
  for: 30m
  labels:
    severity: warning
  annotations:
    summary: "CPU load (instance {{ $labels.instance }})"
    description: "CPU load (15m) is high\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

I ran the stress tool yesterday for ~1.5h which pushed CPU usage to 100% within this timespan. During this load test, I executed the following expression as Prometheus query - and the returned value crawled upwards to 1.0, but never completely reached 1.0:

node_load15 / (count without (cpu, mode) (node_cpu_seconds_total{mode="system"}))

As expected, no alert was triggered. Am I doing something wrong, or is something wrong with the CPU load snippet?

UPDATE: I found the following expr seems to deliver more accurate results in my case. Not quite sure why one want to track mode="system" (as in the your expr) instead of inverting the cpu's mode="idle" metric.

- alert: CpuLoad
  expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 70
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "High CPU load (instance {{ $labels.instance }})"
    description: "CPU load is high\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Consensus on severity label

Is there any standard or agreed meaning on the severity levels?

I can't see anywhere that documents it - apologies if it exist.

Add a "for" clause to each alert

Today, we add for: 5m to alerts, automatically.

Use case: mysql_global_status_uptime < 60 would never trigger

5m can be the default value.

Progression:

Basic resource monitoring

  • Prometheus self-monitoring
  • Host/Hardware
  • Docker Containers
  • Blackbox
  • Windows

Databases and brokers

  • MySQL
  • PostgreSQL
  • PGBouncer
  • Redis
  • MongoDB
  • RabbitMQ
  • Elasticsearch
  • Cassandra
  • Zookeeper
  • Kafka

Reverse proxies and load balancers

  • Nginx
  • Apache
  • HaProxy
  • Traefik

Runtimes

  • PHP-FPM
  • JVM
  • Sidekiq

Orchestrators

  • Kubernetes
  • Nomad
  • Consul
  • Etcd
  • Linkerd
  • Istio

Network and storage

  • Ceph
  • ZFS
  • OpenEBS
  • Minio
  • Juniper
  • CoreDNS

Other

  • Thanos

Docker Containers Alert Rules

Hi,
Thanks for this great overview!
I am testing the rules and I have a question about it.

I just implemented these rules: https://awesome-prometheus-alerts.grep.to/rules#docker-containers. They are firing, but I am not sure if the query is correct.

The following queries:

  • 3.2. Container CPU usage
  • 3.3. Container Memory usage
  • 3.4. Container Volume usage
  • 3.5. Container Volume IO usage

Are all using a sum by ip. But those metrics do not have any ip added on it.

I am running following versions:

  • cAdvisor v0.32.0
  • prometheus v2.11.1
  • alertmanager v0.18.0

Are there some requirements to get the ip? Or should it be changed to instance?

There's no easy way to copy-paste the rules

One thing that was slightly frustrating when copying these rules over to my deployment was the fact that if you just want to chuck all the alerts into the relevant YAML file you have to copy-paste each one individually.

Some sort of presentation that would allow you to get a relevant exporter's rules in one solid chunk for ease of copypasta would be very welcome.

Should cadvisor metrics filter out metrics with no name?

Hello and thanks for this great list of alerts!

I've been using them for a while on my raspberry pi cluster and have noticed that, although I pass --docker_only=true, for each metric there is one with no name label (I don't know what it stands for) which can trigger alerts quite often. I name all my containers and when I filter with {name != "" } I just get the metrics for the containers and no false alarms.

Should the cadvisor alerts filter out these metrics with no name label or am I missing something here? What do the signify? Especially since I run the cadvisor with --docker_only=true?

Best regards

Panos

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.