GithubHelp home page GithubHelp logo

Comments (28)

schechev-a avatar schechev-a commented on June 16, 2024 1

Thank you very much, you helped a lot!

from helm-charts.

Haleygo avatar Haleygo commented on June 16, 2024

Hello @schechev-a
Can you see the cluster label when query from cluster2 vmselect directly?

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

Hello Haleygo

For example query:
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace)
/
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 50 / 100 )

I get:

{container="cron",namespace="production-name",pod="name_pod"}
avg:0.96115, min:0.94102, max:0.97421, last:0.96542

I did it in:

image

If I add in query cluster: (container, pod, namespace, cluster)
I see later this label

And which of these labels should be left in order for there to be a label cluster?
vmagent.spec:
externalLabels:
cluster: vm-agent-1{2,3}
extraArgs:
remoteWrite.label: cluster=vm-agent-1{2,3}
remoteWriteSettings:
label:
cluster: vm-agent-1,{2,3}

from helm-charts.

Haleygo avatar Haleygo commented on June 16, 2024
externalLabels:
  cluster: vm-agent-1{2,3}

should be enough.

Can you share one of your alerting rules? Since there are cluster labels in the raw series, did your rules also having statement like by (container, pod, namespace) which will remove labels beside those three.

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

I'm using rules from helm-chart victoriametrics-stack
https://github.com/VictoriaMetrics/helm-charts/blob/master/charts/victoria-metrics-k8s-stack/templates/rules/kubernetes-resources.yaml

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

I also added a custom rule, there is also no cluster label

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
  namespace: {{ .Release.Namespace }}
  name: kube-additional-rules
  labels:
    app: {{ include "victoria-metrics-k8s-stack.name" $ }}
spec:
  groups:
  - name: kubernetes-metrics-cpu
    rules:
      - alert: Node-High-LoadAverage
        expr: node_load5 > (count without (cpu, mode) (node_cpu_seconds_total{mode="system"})) * 5
        for: 30m
        labels:
          severity: critical
        annotations:
          #description: Host high CPU load-average (instance {{ "{{" }} $labels.instance }})
          #summary: Host high CPU load (instance {{ "{{" }} $labels.instance }})
      - alert: Node-High-CPULoad
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          #summary: 'Host high CPU load (instance {{ "{{" }} $labels.instance }})'
          description: CPU load is > 80%\n 
      - alert: Node-Memory-Usage
        expr: (((node_memory_MemTotal-node_memory_MemFree-node_memory_Cached)/(node_memory_MemTotal)*100)) > 75
        for: 2m
        labels:
          severity: critical
        annotations:
          #summary: High memory usage detected"
          #description: "Host memory usage (instance {{ "{{" }} $labels.instance }})"

from helm-charts.

Haleygo avatar Haleygo commented on June 16, 2024

Seems ok to me, can you check vmalert ui to see if alerts got the right label.
You can accesshttp://{{vmalert-addr}}/vmalert/alerts and see
image

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

No labels

vmalert_alerts

from helm-charts.

Haleygo avatar Haleygo commented on June 16, 2024

No labels

vmalert_alerts

Like I mentioned here, the expr has statement by (xx.xx.xx), that will drop the other labels like cluster.
Can you check other rules like your custom rule

      - alert: Node-Memory-Usage
        expr: (((node_memory_MemTotal-node_memory_MemFree-node_memory_Cached)/(node_memory_MemTotal)*100)) > 75
        for: 2m
        labels:
          severity: critical
        annotations:
          #summary: High memory usage detected"
          #description: "Host memory usage (instance {{ "{{" }} $labels.instance }})"

see if it has cluster label.
If not, try query node_memory_MemTotal directly in vmui or grafana to see if that got cluster label;
If not, node_memory_MemTotal could be a recording rule, you can check it's expr to see when did label got dropped.

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

I don't have data for this request
image

And if I check other queries without by (X.X.X.X) the data is not shown to me

from helm-charts.

Haleygo avatar Haleygo commented on June 16, 2024

Let me try to explain this from begining;
vmagent_2 --> vmcluster -> vmalert -> vmalertmanager -> oncall

  1. The vmagent with external_label will add cluster label to all the metrics and send to vmcluster. You can check this by query up, you should see cluster label;
  2. Metrics got stored in vmcluster, and vmalert use rule's expr to evaluate, if query result[will generate alerting messages you will receive from alertmanager] has no cluster label, it's likely the expr drops it. For example, if you have expr sum(up) by (job), it will drop all the label except job.

So if you want to have cluster label in the final result[alerting messages], you need to make sure two things:

  1. the raw metrics you used to query have cluster label itself, this can be checked by query it directly, like up or container_cpu_cfs_xxx
  2. the expr shouldn't remove label you need.
    If they did, you need to modify them[adding cluster to statement], like sum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster,container, pod, namespace)

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

Am I correct that I have to add this manually to each rule?
Is it possible to do this by specifying somewhere in values.yaml, in one place ?

Because on Prometheus I have the same rules, and I receive every alert with a label cluster,
True, I have my own alertmanager on each cluster:)

image

from helm-charts.

Haleygo avatar Haleygo commented on June 16, 2024

Because on Prometheus I have the same rules, and I receive every alert with a label cluster,

I don't think that's true, both MetricQL and PromQL handle operators like sum by the same way. For the sum (xx) by (container, pod, namespace), only those labels will stay. You can just query those raw expr in prometheus ui and vmui to see there is no difference.

Am I correct that I have to add this manually to each rule?
Is it possible to do this by specifying somewhere in values.yaml, in one place ?

I'm afraid there is no support for customized label in default rules.
You can add that like this example

- name: k8s.rules
rules:
- expr: |-
sum by (cluster, namespace, pod, container) (
irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m])
) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (
1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
)
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate

And since vm sync rules from kube-prometheus, so we have the very same rules, like here.

Maybe we can add cluster as a default label in the future.

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

I'm sorry, but look
I'm making the same query in Prometheus and VmSelect

sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace)
  /
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)
  > ( 50 / 100 )

Result:
VMUI:

vmui_query_1

Alert:

alert_vm_vm


Prometheus:

prom_query

Alert:
alert_prom

Maybe somewhere the alertmanager adds these labels?
Or do I need to add a relabling?

from helm-charts.

Haleygo avatar Haleygo commented on June 16, 2024

image
You can see no cluster label here too.

Maybe somewhere the alertmanager adds these labels?

Can you share the configuration of your prometheus->alertmanager?

Or do I need to add a relabling?

No, the problem is when you're using a expr like sum (xx) by (container, pod, namespace) to alert, it's expected that result will lose other labels like cluster, so you need to modify the expr to reserve label that you need.

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

Alertmanager configuration:

alertmanager:
  alertmanagerSpec:
    podAntiAffinity: "hard"
    podAntiAffinityTopologyKey: #
    replicas: 2
  enabled: true
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: #
      alb.ingress.kubernetes.io/scheme: #
      alb.ingress.kubernetes.io/target-type: #
      alb.ingress.kubernetes.io/listen-ports: #
      alb.ingress.kubernetes.io/ssl-redirect: #
      alb.ingress.kubernetes.io/group.name: #
    hosts:
      - alertmanager
    paths:
      path: #
    pathType: Prefix


  config:
    global:
      resolve_timeout: 5m
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 24h
      receiver: 'oncall_eks_alerts'
      routes:
      - match:
          alertname: Watchdog
        receiver: 'null'
      - match:
          alertname: InfoInhibitor
        receiver: 'null'

    

    templates:
    - '/etc/alertmanager/config/*.tmpl'

    receivers:
    - name: 'null'
    - name: 'oncall_eks_alerts'
      webhook_configs:
        - send_resolved: true
          url: #

  templateFiles:
    template_1.tmpl: |-

      {{ define "slack.mitgo.text" }}
          {{ with index .Alerts 0 -}}
            :chart_with_upwards_trend: *<{{ .GeneratorURL }}|Graph>*
            {{- if .Annotations.runbook }}   :notebook: *<{{ .Annotations.runbook }}|Runbook>*{{ end }}
          {{ end }}
          *Alert details*:
          {{ range .Alerts -}}
            *Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }}
          *Description:* {{ .Annotations.description }}
          *Details:*
            {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
            {{ end }}
          {{ end }}
      {{ end }}

      {{ define "slack.mitgo.title" }}
          [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}
          {{- if gt (len .CommonLabels) (len .GroupLabels) -}}
            {{" "}}(
            {{- with .CommonLabels.Remove .GroupLabels.Names }}
              {{- range $index, $label := .SortedPairs -}}
                {{ if $index }}, {{ end }}
                {{- $label.Name }}="{{ $label.Value -}}"
              {{- end }}
            {{- end -}}
            )
          {{- end }}
      {{ end }}

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

To better understand:
Cluster: Prometheus -> Alertmanager(on the same cluster)
-> remoteWrite-> vm-storage

from helm-charts.

Haleygo avatar Haleygo commented on June 16, 2024

Can you share the configuration of your prometheus->alertmanager?

I don't think that alertmanager can add the cluster label to alerts, since it shouldn't change the data.

Can you check the cpuThrottlingHigh rule expr and it's alerts labels from prometheus ui directly?
image

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

Yes, no cluster label

CPU_throtling

from helm-charts.

Haleygo avatar Haleygo commented on June 16, 2024

Right.

Prometheus -> Alertmanager(on the same cluster)

So how did you config prometheus to send alerts to alertmanager, is there any other component know this cluster label?

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

Sorry, I don’t even know where to watch this anymore :)

from helm-charts.

Haleygo avatar Haleygo commented on June 16, 2024

So I think there are two problems here:

  1. Rules expr are dropping the cluster label which you want to preserve, to fix that, you need to fix the rules like #709 (comment) explained.
  2. I think the Cluster: Prometheus -> Alertmanager(on the same cluster) pipeline should be checked, since we saw the specific rule cpuThrottlingHigh must drop the cluster label in prometheus side, but the notification get it somehow.
    To debug that, you can check the alert flow from prometheus to alertmanager. Please refer to https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config and https://medium.com/devops-dudes/prometheus-alerting-with-alertmanager-e1bbba8e6a8e.

from helm-charts.

hagen1778 avatar hagen1778 commented on June 16, 2024

@schechev-a do you have external_label set in your Prometheus config?

# The labels to add to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    [ [<labelname>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#labelname): [<labelvalue>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#labelvalue) ... ]

https://prometheus.io/docs/prometheus/latest/configuration/configuration/#configuration-file

If yes, then Prom unconditionally adds cluster label to everything emitted from it: to alerts including.
You can do the same in vmalert, but since vmalert has access to 2 distinct clusters - it makes no sense. The cluster label should be respected by the alerting expression, e.g. sum by(cluster) in each expression.

Alternatively, you can have two alerting groups: one for cluster-1 and second for cluster-2. And each group will apply additional label filtering on query time to evaluate expressions against a specific cluster only. Let me know if you'd like to know more details.

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

Hello @hagen1778,
I'm sorry, I just saw your messages
We decided that we would make custom rules and add the cluster name where necessary

One more question: if, for example, I have 100 clusters, will this have any effect on the load on the Victoria storage, select, insert, vmalert and alertmanager? If I have one alert group, or should I create an alert group for each of the 100 clusters?

from helm-charts.

Haleygo avatar Haleygo commented on June 16, 2024

If I have one alert group, or should I create an alert group for each of the 100 clusters?

@schechev-a If you have that many clusters with same alerting need, I'd suggest to use one alert group with one alerting rule. The differences between them are mostly on the number of requests:

  1. vmalert makes query request to vmcluster[vmselect->vmstorage] for each rule in each group, so 100 alert groups create 100 query rerquests, more loads on datasource;
  2. vmalert also sends generated metrics[to remoteWrite.url] and alert messages[to notifiers] grouped by alert group;
  3. one group is more easier to manage.

So using one alerting rule is most resource efficient. And you can propagate the cluster label to rule's labels or annotations if needed.

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

@Haleygo Thank you!

from helm-charts.

schechev-a avatar schechev-a commented on June 16, 2024

Maybe you know best practice?
If I have an alert group for 100 clusters, and one alert rule for them
But for example, I want to exclude 10 clusters from the general alert rule.
So far I have only found this solution:
node_load5 > (count without (cpu, mode) (node_cpu_seconds_total{mode="system", cluster_name =!~"cluster_label_name", cluster_name =!~"cluster_label_name_2"})) * 5

But I can’t specify all 10 clusters in the expression

from helm-charts.

hagen1778 avatar hagen1778 commented on June 16, 2024

But I can’t specify all 10 clusters in the expression

Why? You can do the following expr: my_metric{cluster!~"(cluser1|cluster2|clusterN)"}.
Alternatively, on the group level you can specify extra_filters param, so it will be applied to all expressions within this group:

from helm-charts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.