Hello Maybe this is a simple question and you can help me quickly. I have 3 ku

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

I'm using rules from helm-chart victoriametrics-stack <a href="https://github.com/

I also added a custom rule, there is also no cluster label <div class="snippet-cli

Seems ok to me, can you check vmalert <a href="https://docs.victoriametrics.com/vmaler

No labels <a target="_blank" rel="noopener noreferrer" href="https:/

No labels <a target="_blank" rel="noopener noreferrer n

I don't have data for this request <a target="_blank" rel="noopener noreferrer" hr

labels "cluster" in alerts rules, several k8s clusters,about victoriametrics/helm-charts

Comments (28)

schechev-a commented on June 16, 2024 1

Thank you very much, you helped a lot!

from helm-charts.

Haleygo commented on June 16, 2024

Hello @schechev-a
Can you see the cluster label when query from cluster2 vmselect directly?

from helm-charts.

schechev-a commented on June 16, 2024

Hello Haleygo

For example query:
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace)
/
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 50 / 100 )

I get:

{container="cron",namespace="production-name",pod="name_pod"}
avg:0.96115, min:0.94102, max:0.97421, last:0.96542

I did it in:

If I add in query cluster: (container, pod, namespace, cluster)
I see later this label

And which of these labels should be left in order for there to be a label cluster?
vmagent.spec:
externalLabels:
cluster: vm-agent-1{2,3}
extraArgs:
remoteWrite.label: cluster=vm-agent-1{2,3}
remoteWriteSettings:
label:
cluster: vm-agent-1,{2,3}

from helm-charts.

Haleygo commented on June 16, 2024

externalLabels:
  cluster: vm-agent-1{2,3}

should be enough.

Can you share one of your alerting rules? Since there are cluster labels in the raw series, did your rules also having statement like by (container, pod, namespace) which will remove labels beside those three.

from helm-charts.

schechev-a commented on June 16, 2024

I'm using rules from helm-chart victoriametrics-stack
https://github.com/VictoriaMetrics/helm-charts/blob/master/charts/victoria-metrics-k8s-stack/templates/rules/kubernetes-resources.yaml

from helm-charts.

schechev-a commented on June 16, 2024

I also added a custom rule, there is also no cluster label

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
  namespace: {{ .Release.Namespace }}
  name: kube-additional-rules
  labels:
    app: {{ include "victoria-metrics-k8s-stack.name" $ }}
spec:
  groups:
  - name: kubernetes-metrics-cpu
    rules:
      - alert: Node-High-LoadAverage
        expr: node_load5 > (count without (cpu, mode) (node_cpu_seconds_total{mode="system"})) * 5
        for: 30m
        labels:
          severity: critical
        annotations:
          #description: Host high CPU load-average (instance {{ "{{" }} $labels.instance }})
          #summary: Host high CPU load (instance {{ "{{" }} $labels.instance }})
      - alert: Node-High-CPULoad
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          #summary: 'Host high CPU load (instance {{ "{{" }} $labels.instance }})'
          description: CPU load is > 80%\n 
      - alert: Node-Memory-Usage
        expr: (((node_memory_MemTotal-node_memory_MemFree-node_memory_Cached)/(node_memory_MemTotal)*100)) > 75
        for: 2m
        labels:
          severity: critical
        annotations:
          #summary: High memory usage detected"
          #description: "Host memory usage (instance {{ "{{" }} $labels.instance }})"

from helm-charts.

Haleygo commented on June 16, 2024

Seems ok to me, can you check vmalert ui to see if alerts got the right label.
You can accesshttp://{{vmalert-addr}}/vmalert/alerts and see

from helm-charts.

schechev-a commented on June 16, 2024

No labels

from helm-charts.

Haleygo commented on June 16, 2024

No labels

Like I mentioned here, the expr has statement by (xx.xx.xx), that will drop the other labels like cluster.
Can you check other rules like your custom rule

      - alert: Node-Memory-Usage
        expr: (((node_memory_MemTotal-node_memory_MemFree-node_memory_Cached)/(node_memory_MemTotal)*100)) > 75
        for: 2m
        labels:
          severity: critical
        annotations:
          #summary: High memory usage detected"
          #description: "Host memory usage (instance {{ "{{" }} $labels.instance }})"

see if it has cluster label.
If not, try query node_memory_MemTotal directly in vmui or grafana to see if that got cluster label;
If not, node_memory_MemTotal could be a recording rule, you can check it's expr to see when did label got dropped.

from helm-charts.

schechev-a commented on June 16, 2024

I don't have data for this request

And if I check other queries without by (X.X.X.X) the data is not shown to me

from helm-charts.

Haleygo commented on June 16, 2024

Let me try to explain this from begining;
vmagent_2 --> vmcluster -> vmalert -> vmalertmanager -> oncall

The vmagent with external_label will add cluster label to all the metrics and send to vmcluster. You can check this by query up, you should see cluster label;
Metrics got stored in vmcluster, and vmalert use rule's expr to evaluate, if query result[will generate alerting messages you will receive from alertmanager] has no cluster label, it's likely the expr drops it. For example, if you have expr sum(up) by (job), it will drop all the label except job.

So if you want to have cluster label in the final result[alerting messages], you need to make sure two things:

the raw metrics you used to query have cluster label itself, this can be checked by query it directly, like up or container_cpu_cfs_xxx
the expr shouldn't remove label you need.
If they did, you need to modify them[adding cluster to statement], like sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster,container, pod, namespace)

from helm-charts.

schechev-a commented on June 16, 2024

Am I correct that I have to add this manually to each rule?
Is it possible to do this by specifying somewhere in values.yaml, in one place ?

Because on Prometheus I have the same rules, and I receive every alert with a label cluster,
True, I have my own alertmanager on each cluster:)

from helm-charts.

Haleygo commented on June 16, 2024

Because on Prometheus I have the same rules, and I receive every alert with a label cluster,

I don't think that's true, both MetricQL and PromQL handle operators like sum by the same way. For the sum (xx) by (container, pod, namespace), only those labels will stay. You can just query those raw expr in prometheus ui and vmui to see there is no difference.

Am I correct that I have to add this manually to each rule?
Is it possible to do this by specifying somewhere in values.yaml, in one place ?

I'm afraid there is no support for customized label in default rules.
You can add that like this example

helm-charts/charts/victoria-metrics-k8s-stack/templates/rules/k8s.rules.yaml

Lines 24 to 32 in 85bf10e

 - name: k8s.rules 

 rules: 

 - expr: |- 

  sum by (cluster, namespace, pod, container) ( 

  irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m]) 

  ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) ( 

  1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""}) 

  ) 

  record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate

And since vm sync rules from kube-prometheus, so we have the very same rules, like here.

Maybe we can add cluster as a default label in the future.

from helm-charts.

schechev-a commented on June 16, 2024

I'm sorry, but look
I'm making the same query in Prometheus and VmSelect

sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace)
  /
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)
  > ( 50 / 100 )

Result:
VMUI:

Alert:

Prometheus:

Alert:

Maybe somewhere the alertmanager adds these labels?
Or do I need to add a relabling?

from helm-charts.

Haleygo commented on June 16, 2024

You can see no cluster label here too.

Maybe somewhere the alertmanager adds these labels?

Can you share the configuration of your prometheus->alertmanager?

Or do I need to add a relabling?

No, the problem is when you're using a expr like sum (xx) by (container, pod, namespace) to alert, it's expected that result will lose other labels like cluster, so you need to modify the expr to reserve label that you need.

from helm-charts.

schechev-a commented on June 16, 2024

Alertmanager configuration:

alertmanager:
  alertmanagerSpec:
    podAntiAffinity: "hard"
    podAntiAffinityTopologyKey: #
    replicas: 2
  enabled: true
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: #
      alb.ingress.kubernetes.io/scheme: #
      alb.ingress.kubernetes.io/target-type: #
      alb.ingress.kubernetes.io/listen-ports: #
      alb.ingress.kubernetes.io/ssl-redirect: #
      alb.ingress.kubernetes.io/group.name: #
    hosts:
      - alertmanager
    paths:
      path: #
    pathType: Prefix


  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 24h
      receiver: 'oncall_eks_alerts'
      routes:
      - match:
          alertname: Watchdog
        receiver: 'null'
      - match:
          alertname: InfoInhibitor
        receiver: 'null'

    

    templates:
    - '/etc/alertmanager/config/*.tmpl'

    receivers:
    - name: 'null'
    - name: 'oncall_eks_alerts'
      webhook_configs:
        - send_resolved: true
          url: #

  templateFiles:
    template_1.tmpl: |-

      {{ define "slack.mitgo.text" }}
          {{ with index .Alerts 0 -}}
            :chart_with_upwards_trend: *<{{ .GeneratorURL }}|Graph>*
            {{- if .Annotations.runbook }}   :notebook: *<{{ .Annotations.runbook }}|Runbook>*{{ end }}
          {{ end }}
          *Alert details*:
          {{ range .Alerts -}}
            *Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }}
          *Description:* {{ .Annotations.description }}
          *Details:*
            {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
            {{ end }}
          {{ end }}
      {{ end }}

      {{ define "slack.mitgo.title" }}
          [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}
          {{- if gt (len .CommonLabels) (len .GroupLabels) -}}
            {{" "}}(
            {{- with .CommonLabels.Remove .GroupLabels.Names }}
              {{- range $index, $label := .SortedPairs -}}
                {{ if $index }}, {{ end }}
                {{- $label.Name }}="{{ $label.Value -}}"
              {{- end }}
            {{- end -}}
            )
          {{- end }}
      {{ end }}

from helm-charts.

schechev-a commented on June 16, 2024

To better understand:
Cluster: Prometheus -> Alertmanager(on the same cluster)
-> remoteWrite-> vm-storage

from helm-charts.

Haleygo commented on June 16, 2024

Can you share the configuration of your prometheus->alertmanager?

I don't think that alertmanager can add the cluster label to alerts, since it shouldn't change the data.

Can you check the cpuThrottlingHigh rule expr and it's alerts labels from prometheus ui directly?

from helm-charts.

schechev-a commented on June 16, 2024

Yes, no cluster label

from helm-charts.

Haleygo commented on June 16, 2024

Right.

Prometheus -> Alertmanager(on the same cluster)

So how did you config prometheus to send alerts to alertmanager, is there any other component know this cluster label?

from helm-charts.

schechev-a commented on June 16, 2024

Sorry, I don’t even know where to watch this anymore :)

from helm-charts.

Haleygo commented on June 16, 2024

So I think there are two problems here:

Rules expr are dropping the cluster label which you want to preserve, to fix that, you need to fix the rules like #709 (comment) explained.
I think the Cluster: Prometheus -> Alertmanager(on the same cluster) pipeline should be checked, since we saw the specific rule cpuThrottlingHigh must drop the cluster label in prometheus side, but the notification get it somehow.
To debug that, you can check the alert flow from prometheus to alertmanager. Please refer to https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config and https://medium.com/devops-dudes/prometheus-alerting-with-alertmanager-e1bbba8e6a8e.

from helm-charts.

hagen1778 commented on June 16, 2024

@schechev-a do you have external_label set in your Prometheus config?

# The labels to add to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    [ [<labelname>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#labelname): [<labelvalue>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#labelvalue) ... ]

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#configuration-file

If yes, then Prom unconditionally adds cluster label to everything emitted from it: to alerts including.
You can do the same in vmalert, but since vmalert has access to 2 distinct clusters - it makes no sense. The cluster label should be respected by the alerting expression, e.g. sum by(cluster) in each expression.

Alternatively, you can have two alerting groups: one for cluster-1 and second for cluster-2. And each group will apply additional label filtering on query time to evaluate expressions against a specific cluster only. Let me know if you'd like to know more details.

from helm-charts.

schechev-a commented on June 16, 2024

Hello @hagen1778,
I'm sorry, I just saw your messages
We decided that we would make custom rules and add the cluster name where necessary

One more question: if, for example, I have 100 clusters, will this have any effect on the load on the Victoria storage, select, insert, vmalert and alertmanager? If I have one alert group, or should I create an alert group for each of the 100 clusters?

from helm-charts.

Haleygo commented on June 16, 2024

If I have one alert group, or should I create an alert group for each of the 100 clusters?

@schechev-a If you have that many clusters with same alerting need, I'd suggest to use one alert group with one alerting rule. The differences between them are mostly on the number of requests:

vmalert makes query request to vmcluster[vmselect->vmstorage] for each rule in each group, so 100 alert groups create 100 query rerquests, more loads on datasource;
vmalert also sends generated metrics[to remoteWrite.url] and alert messages[to notifiers] grouped by alert group;
one group is more easier to manage.

So using one alerting rule is most resource efficient. And you can propagate the cluster label to rule's labels or annotations if needed.

from helm-charts.

schechev-a commented on June 16, 2024

@Haleygo Thank you!

from helm-charts.

schechev-a commented on June 16, 2024

Maybe you know best practice?
If I have an alert group for 100 clusters, and one alert rule for them
But for example, I want to exclude 10 clusters from the general alert rule.
So far I have only found this solution:
node_load5 > (count without (cpu, mode) (node_cpu_seconds_total{mode="system", cluster_name =!~"cluster_label_name", cluster_name =!~"cluster_label_name_2"})) * 5

But I can’t specify all 10 clusters in the expression

from helm-charts.

hagen1778 commented on June 16, 2024

But I can’t specify all 10 clusters in the expression

Why? You can do the following expr: my_metric{cluster!~"(cluser1|cluster2|clusterN)"}.
Alternatively, on the group level you can specify extra_filters param, so it will be applied to all expressions within this group:

from helm-charts.

labels "cluster" in alerts rules, several k8s clusters about helm-charts HOT 28 OPEN

Comments (28)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	- name: k8s.rules
	rules:
	- expr: \|-
	sum by (cluster, namespace, pod, container) (
	irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m])
	) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
	1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
	)
	record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate