Comments (28)
Thank you very much, you helped a lot!
from helm-charts.
Hello @schechev-a
Can you see the cluster
label when query from cluster2 vmselect directly?
from helm-charts.
Hello Haleygo
For example query:
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace)
/
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 50 / 100 )
I get:
{container="cron",namespace="production-name",pod="name_pod"}
avg:0.96115, min:0.94102, max:0.97421, last:0.96542
I did it in:
If I add in query cluster: (container, pod, namespace, cluster)
I see later this label
And which of these labels should be left in order for there to be a label cluster?
vmagent.spec:
externalLabels:
cluster: vm-agent-1{2,3}
extraArgs:
remoteWrite.label: cluster=vm-agent-1{2,3}
remoteWriteSettings:
label:
cluster: vm-agent-1,{2,3}
from helm-charts.
externalLabels:
cluster: vm-agent-1{2,3}
should be enough.
Can you share one of your alerting rules? Since there are cluster
labels in the raw series, did your rules also having statement like by (container, pod, namespace)
which will remove labels beside those three.
from helm-charts.
I'm using rules from helm-chart victoriametrics-stack
https://github.com/VictoriaMetrics/helm-charts/blob/master/charts/victoria-metrics-k8s-stack/templates/rules/kubernetes-resources.yaml
from helm-charts.
I also added a custom rule, there is also no cluster label
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
namespace: {{ .Release.Namespace }}
name: kube-additional-rules
labels:
app: {{ include "victoria-metrics-k8s-stack.name" $ }}
spec:
groups:
- name: kubernetes-metrics-cpu
rules:
- alert: Node-High-LoadAverage
expr: node_load5 > (count without (cpu, mode) (node_cpu_seconds_total{mode="system"})) * 5
for: 30m
labels:
severity: critical
annotations:
#description: Host high CPU load-average (instance {{ "{{" }} $labels.instance }})
#summary: Host high CPU load (instance {{ "{{" }} $labels.instance }})
- alert: Node-High-CPULoad
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
#summary: 'Host high CPU load (instance {{ "{{" }} $labels.instance }})'
description: CPU load is > 80%\n
- alert: Node-Memory-Usage
expr: (((node_memory_MemTotal-node_memory_MemFree-node_memory_Cached)/(node_memory_MemTotal)*100)) > 75
for: 2m
labels:
severity: critical
annotations:
#summary: High memory usage detected"
#description: "Host memory usage (instance {{ "{{" }} $labels.instance }})"
from helm-charts.
Seems ok to me, can you check vmalert ui to see if alerts got the right label.
You can accesshttp://{{vmalert-addr}}/vmalert/alerts
and see
from helm-charts.
No labels
from helm-charts.
No labels
Like I mentioned here, the expr has statement by (xx.xx.xx)
, that will drop the other labels like cluster
.
Can you check other rules like your custom rule
- alert: Node-Memory-Usage
expr: (((node_memory_MemTotal-node_memory_MemFree-node_memory_Cached)/(node_memory_MemTotal)*100)) > 75
for: 2m
labels:
severity: critical
annotations:
#summary: High memory usage detected"
#description: "Host memory usage (instance {{ "{{" }} $labels.instance }})"
see if it has cluster label.
If not, try query node_memory_MemTotal
directly in vmui or grafana to see if that got cluster
label;
If not, node_memory_MemTotal
could be a recording rule, you can check it's expr to see when did label got dropped.
from helm-charts.
I don't have data for this request
And if I check other queries without by (X.X.X.X) the data is not shown to me
from helm-charts.
Let me try to explain this from begining;
vmagent_2 --> vmcluster -> vmalert -> vmalertmanager -> oncall
- The vmagent with external_label will add
cluster
label to all the metrics and send to vmcluster. You can check this by queryup
, you should seecluster
label; - Metrics got stored in vmcluster, and vmalert use rule's expr to evaluate, if query result[will generate alerting messages you will receive from alertmanager] has no
cluster
label, it's likely the expr drops it. For example, if you have exprsum(up) by (job)
, it will drop all the label exceptjob
.
So if you want to have cluster
label in the final result[alerting messages], you need to make sure two things:
- the raw metrics you used to query have
cluster
label itself, this can be checked by query it directly, likeup
orcontainer_cpu_cfs_xxx
- the expr shouldn't remove label you need.
If they did, you need to modify them[addingcluster
to statement], likesum(increase(container_cpu_cfs_periods_total{}[5m])) by (cluster,container, pod, namespace)
from helm-charts.
Am I correct that I have to add this manually to each rule?
Is it possible to do this by specifying somewhere in values.yaml, in one place ?
Because on Prometheus I have the same rules, and I receive every alert with a label cluster,
True, I have my own alertmanager on each cluster:)
from helm-charts.
Because on Prometheus I have the same rules, and I receive every alert with a label cluster,
I don't think that's true, both MetricQL and PromQL handle operators like sum by
the same way. For the sum (xx) by (container, pod, namespace)
, only those labels will stay. You can just query those raw expr in prometheus ui and vmui to see there is no difference.
Am I correct that I have to add this manually to each rule?
Is it possible to do this by specifying somewhere in values.yaml, in one place ?
I'm afraid there is no support for customized label in default rules.
You can add that like this example
And since vm sync rules from kube-prometheus, so we have the very same rules, like here.
Maybe we can add cluster as a default label in the future.
from helm-charts.
I'm sorry, but look
I'm making the same query in Prometheus and VmSelect
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace)
/
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)
> ( 50 / 100 )
Result:
VMUI:
Alert:
Prometheus:
Maybe somewhere the alertmanager adds these labels?
Or do I need to add a relabling?
from helm-charts.
You can see no cluster
label here too.
Maybe somewhere the alertmanager adds these labels?
Can you share the configuration of your prometheus->alertmanager?
Or do I need to add a relabling?
No, the problem is when you're using a expr like sum (xx) by (container, pod, namespace)
to alert, it's expected that result will lose other labels like cluster
, so you need to modify the expr to reserve label that you need.
from helm-charts.
Alertmanager configuration:
alertmanager:
alertmanagerSpec:
podAntiAffinity: "hard"
podAntiAffinityTopologyKey: #
replicas: 2
enabled: true
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: #
alb.ingress.kubernetes.io/scheme: #
alb.ingress.kubernetes.io/target-type: #
alb.ingress.kubernetes.io/listen-ports: #
alb.ingress.kubernetes.io/ssl-redirect: #
alb.ingress.kubernetes.io/group.name: #
hosts:
- alertmanager
paths:
path: #
pathType: Prefix
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 24h
receiver: 'oncall_eks_alerts'
routes:
- match:
alertname: Watchdog
receiver: 'null'
- match:
alertname: InfoInhibitor
receiver: 'null'
templates:
- '/etc/alertmanager/config/*.tmpl'
receivers:
- name: 'null'
- name: 'oncall_eks_alerts'
webhook_configs:
- send_resolved: true
url: #
templateFiles:
template_1.tmpl: |-
{{ define "slack.mitgo.text" }}
{{ with index .Alerts 0 -}}
:chart_with_upwards_trend: *<{{ .GeneratorURL }}|Graph>*
{{- if .Annotations.runbook }} :notebook: *<{{ .Annotations.runbook }}|Runbook>*{{ end }}
{{ end }}
*Alert details*:
{{ range .Alerts -}}
*Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }}
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
{{ end }}
{{ define "slack.mitgo.title" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}
{{- if gt (len .CommonLabels) (len .GroupLabels) -}}
{{" "}}(
{{- with .CommonLabels.Remove .GroupLabels.Names }}
{{- range $index, $label := .SortedPairs -}}
{{ if $index }}, {{ end }}
{{- $label.Name }}="{{ $label.Value -}}"
{{- end }}
{{- end -}}
)
{{- end }}
{{ end }}
from helm-charts.
To better understand:
Cluster: Prometheus -> Alertmanager(on the same cluster)
-> remoteWrite-> vm-storage
from helm-charts.
Can you share the configuration of your prometheus->alertmanager?
I don't think that alertmanager can add the cluster
label to alerts, since it shouldn't change the data.
Can you check the cpuThrottlingHigh
rule expr and it's alerts labels from prometheus ui directly?
from helm-charts.
Yes, no cluster label
from helm-charts.
Right.
Prometheus -> Alertmanager(on the same cluster)
So how did you config prometheus to send alerts to alertmanager, is there any other component know this cluster label?
from helm-charts.
Sorry, I don’t even know where to watch this anymore :)
from helm-charts.
So I think there are two problems here:
- Rules expr are dropping the
cluster
label which you want to preserve, to fix that, you need to fix the rules like #709 (comment) explained. - I think the
Cluster: Prometheus -> Alertmanager(on the same cluster)
pipeline should be checked, since we saw the specific rulecpuThrottlingHigh
must drop thecluster
label in prometheus side, but the notification get it somehow.
To debug that, you can check the alert flow from prometheus to alertmanager. Please refer to https://prometheus.io/docs/prometheus/latest/configuration/configuration/#alertmanager_config and https://medium.com/devops-dudes/prometheus-alerting-with-alertmanager-e1bbba8e6a8e.
from helm-charts.
@schechev-a do you have external_label set in your Prometheus config?
# The labels to add to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
[ [<labelname>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#labelname): [<labelvalue>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#labelvalue) ... ]
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#configuration-file
If yes, then Prom unconditionally adds cluster label to everything emitted from it: to alerts including.
You can do the same in vmalert, but since vmalert has access to 2 distinct clusters - it makes no sense. The cluster label should be respected by the alerting expression, e.g. sum by(cluster)
in each expression.
Alternatively, you can have two alerting groups: one for cluster-1 and second for cluster-2. And each group will apply additional label filtering on query time to evaluate expressions against a specific cluster only. Let me know if you'd like to know more details.
from helm-charts.
Hello @hagen1778,
I'm sorry, I just saw your messages
We decided that we would make custom rules and add the cluster name where necessary
One more question: if, for example, I have 100 clusters, will this have any effect on the load on the Victoria storage, select, insert, vmalert and alertmanager? If I have one alert group, or should I create an alert group for each of the 100 clusters?
from helm-charts.
If I have one alert group, or should I create an alert group for each of the 100 clusters?
@schechev-a If you have that many clusters with same alerting need, I'd suggest to use one alert group with one alerting rule. The differences between them are mostly on the number of requests:
- vmalert makes query request to vmcluster[vmselect->vmstorage] for each rule in each group, so 100 alert groups create 100 query rerquests, more loads on datasource;
- vmalert also sends generated metrics[to remoteWrite.url] and alert messages[to notifiers] grouped by alert group;
- one group is more easier to manage.
So using one alerting rule is most resource efficient. And you can propagate the cluster
label to rule's labels or annotations if needed.
from helm-charts.
@Haleygo Thank you!
from helm-charts.
Maybe you know best practice?
If I have an alert group for 100 clusters, and one alert rule for them
But for example, I want to exclude 10 clusters from the general alert rule.
So far I have only found this solution:
node_load5 > (count without (cpu, mode) (node_cpu_seconds_total{mode="system", cluster_name =!~"cluster_label_name", cluster_name =!~"cluster_label_name_2"})) * 5
But I can’t specify all 10 clusters in the expression
from helm-charts.
But I can’t specify all 10 clusters in the expression
Why? You can do the following expr: my_metric{cluster!~"(cluser1|cluster2|clusterN)"}
.
Alternatively, on the group
level you can specify extra_filters
param, so it will be applied to all expressions within this group:
- see https://docs.victoriametrics.com/#prometheus-querying-api-enhancements
- https://docs.victoriametrics.com/vmalert.html#groups
from helm-charts.
Related Issues (20)
- Disabling vmalertmanager in default values.yaml leads vmalert crashloop (k8s-stack) HOT 2
- Usage of the vm_concurrent_insert_capacity Metric on "VictoriaMetrics - cluster" Dashboard? HOT 2
- Add optional datasource UID
- Feature request: change scrape config to use endpointslice HOT 1
- [victoria-metrics-k8s-stack] Kind VMAgent "null" value in remoteWrite cause failed to helm install HOT 3
- Publish helm charts also as an OCI package (e.g. on GHCR registry)
- victoria-metrics-k8s-stack - VMServiceScripe for the operator is created twice HOT 2
- vmsingle is deployed eventhough is set to false HOT 3
- Can't install victoria-metrics-k8s-stack using Terraform helm provider HOT 2
- Issue with default k8s VMRules. HOT 5
- Expose streaming aggregation config in vmagent chart HOT 3
- Labels should be truncated HOT 2
- Helm Charts: missing Ingress Definition for TLS
- Cannot support existingClaim and different pvc for different vmstorage pods at the same time HOT 12
- Add a `vmstorage.vmbackup.enabled` value to the `victoria-metrics-cluster` chart HOT 1
- Added values file option for -remoteWrite.relabelConfig & -remoteWrite.urlRelabelConfig HOT 3
- Allow setting `metricRelabelings` on ServiceMonitors
- Support options for vmagent to scrape targets via a proxy HOT 2
- Duplicated VictoriaMetrics datasource HOT 1
- [k8s-stack/kube-etcd] Cannot specify mTLS auth configuration HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from helm-charts.