dotdc / grafana-dashboards-kubernetes Goto Github PK
View Code? Open in Web Editor NEWA set of modern Grafana dashboards for Kubernetes.
License: Apache License 2.0
A set of modern Grafana dashboards for Kubernetes.
License: Apache License 2.0
I have some cluster with Windows nodes enabled. I would like to ask if I can add windows support or do you think it out of context here?
Unlike kubernetes-mixin
, which have separate dashboard, I would like to add the Windows queries into the existing one. Thats possible by using queries with OR, e.g.:
sum(container_memory_working_set_bytes{cluster="$cluster",namespace=~"$namespace", image!="", pod=~"${created_by}.*"}) by (pod)
OR
<WINDOWS Query>
Since I'm running multiple OS hybrid clusters, I would like to add PRs for windows pods here. I'm not expecting that the maintainers here provide support for Windows. Before start to work here, I would like to know if its getting accepted?
Beautiful dashboards. Some of the panels show no data, and I've seen this before (Kubernetes LENS). In reviewing the JSON query it is referencing attributes or keys that are not included with cAdvisor metrics (that I have). For examples, your Global dashboard:
When I look at the CPU Utilization by namespace and inspect the JSON query it is based on container_cpu_usage_seconds_total
. When I look in my Prometheus it does not have image=
, here is a random one that was on the top of the query:
container_cpu_usage_seconds_total{cpu="total", endpoint="https-metrics", id="/kubepods/besteffort/pod03202a32-75a1-4a64-8692-1e73fd26eca3", instance="192.168.10.217:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="democratic-csi", node="k3s03", pod="democratic-csi-nfs-node-sqxp9", service="kube-prometheus-stack-kubelet"}
I'm using K3s based on Kubernetes 1.23 on bare metal with containerd, no docker runtime. I have no idea if this is from containerd, kublet, cAdivsor issue or just expected as part of life when you don't use docker runtime.
If you have any suggestions, be much appreciated.
Grafana shows an exclamation mark on the variable job
.
No response
No response
No response
I've opened a new issue because this one is not in the k3s environment but k8s.
I see some metrics missing, probably because my installation could be incomplete.
I've deployed the k8s cluster, with two masters and three workers nodes. Grafana and prometheus are deployed with "almost" de default settings.
i5Js@nanoserver:~/K3s/K8s/grafana/grafana-dashboards-kubernetes/dashboards$ k get svc -n grafana
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana ClusterIP <ip> <none> 80/TCP 18h
i5Js@nanoserver:~/K3s/K8s/grafana/grafana-dashboards-kubernetes/dashboards$ k get svc -n prometheus
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-alertmanager ClusterIP <ip> <none> 80/TCP 21h
prometheus-kube-state-metrics ClusterIP <ip> <none> 8080/TCP 21h
prometheus-node-exporter ClusterIP <ip> <none> 9100/TCP 21h
prometheus-pushgateway ClusterIP <ip> <none> 9091/TCP 21h
prometheus-server ClusterIP <ip> <none> 80/TCP 21h
I've created the datasource using the prometheus-server ip, and some of the metrics works and some don't:
I'm completely sure that those issues are because my environment because I see that your dashboards work fine, but, can you help me troubleshoot?
Thanks,
Currently there are views for:
It would be nice to have a view that would show the status of the deployments (number or replicas, ...)
No response
I tested the latest changes, and still not right...
Panel CPU Utilization by Node "expr": "avg by (node) (1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval]))",
yields:
Seems to be the total of all nodes? It is not picking up the multiple nodes, It should look like:
Panel CPU Utilization by namespace is still dark and using old metric: "expr": "sum(rate(container_cpu_usage_seconds_total{image!=\"\"}[$__rate_interval])) by (namespace)",
I did try something like above "avg by (namespace) (1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval]))"
that is not right, only got one namespace listed:
Both Memory Utilization Panels are still dark based on container_memory_working_set_bytes
when I use your unmodified files.
Hi,
I'm using those dashboards with Grafana using the light theme (easier on my eyes), and some panels are not displaying properly. e.g.:
this can be fixed by setting the color mode of the panel to None
instead of Value
Kubernetes / Views / Global
panelthe text/values should be readable even with the light theme
No response
Hi,
I'm trying to add the dashboard, but I'm having this error message in all of them:
"Error updating options: Unexpected token p in JSON at position 4"
Could you please help me?
Many thanks
Some panels are using node
to filter, and others are using a hidden instance
variable ( label_values(node_uname_info{nodename=~"(?i:($node))"}, instance)
). If a node changes its IP, then some panels will look normal and others will be missing data.
It should probably show all instances of a node.
No response
RAM Usage Request Gauge
My understanding of requests is that this should closely match the actual. Being 90% of Request is not a bad condition, that is a good condition. I think GREEN should be +/- 20% of the request value. 20% beyond that either side yellow, and the rest is red as being signification under or over request is not ideal. As it is now if you estimate the request perfectly it shows RED like an error condition and that is not the case. Only the LIMIT gauge should be like this (as you get OOM killed),
I think that is wrong, to be stable at 90% of request should get me a gold star :)
I'm not sure if CPU Request needs that as well. If so maybe its GREEN range is wider?!?
Resource by container
Could you add the Actual Usage for CPU and Memory between Request/Limits for each? That would be helpful to show where actual is between the two values.
I think CPU Usage by container
and Memory Usage by Container
should be renamed to by pod
as if you select a Pod with multiple containers, you do not get a graph with multiple plot lines which you would expect if it was by container
.
NOTE: I played with adding resource requests and limits as plot lines for CPU Usage by Container
and Memory Usage by Container
and looks good for pods with a single container. But once I selected a pod with multiple containers and thus multiple requests/limits it become confusing mess. Don't have the Grafana skills to isolate them properly. But maybe you have some ideas to make that work right.
This is the way variables are configured on k8s-views-nodes.json:
...
node = label_values(kube_node_info, node)
instance = label_values(node_uname_info{nodename=~"(?i:($node))"}, instance)
In OKE, kube_node_info
looks like this:
{__name__="kube_node_info", container="kube-state-metrics", container_runtime_version="cri-o://1.25.1-111.el7", endpoint="http", instance="10.244.0.40:8080", internal_ip="10.0.107.39", job="kube-state-metrics", kernel_version="5.4.17-2136.314.6.2.el7uek.x86_64", kubelet_version="v1.25.4", kubeproxy_version="v1.25.4", namespace="monitoring", node="10.0.107.39", os_image="Oracle Linux Server 7.9", pod="monitoring-kube-state-metrics-6fcd4d745c-txg2k", pod_cidr="10.244.1.0/25", provider_id="ocid1.instance.oc1.sa-saopaulo-1.xxx", service="monitoring-kube-state-metrics", system_uuid="d6462364-95bf-4122-a3ab-xxx"}
And node_uname_info
looks like this:
node_uname_info{container="node-exporter", domainname="(none)", endpoint="http-metrics", instance="10.0.107.39:9100", job="node-exporter", machine="x86_64", namespace="monitoring", nodename="oke-cq2bxmvtqca-nsdfwre7l3a-seqv6owhq3a-0", pod="monitoring-prometheus-node-exporter-n6pzv", release="5.4.17-2136.314.6.2.el7uek.x86_64", service="monitoring-prometheus-node-exporter", sysname="Linux", version="#2 SMP Fri Dec 9 17:35:27 PST 2022"}
For this example, node=10.0.107.39
, but when I query node_uname_info{nodename=~"(?i:($node))"}
, it doesn't return anything, because nodename
doesn't match the internal IP address of the node.
As a result, no node metrics is displayed.
No response
No response
Modifying the filter https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-views-nodes.json#L3747-L3772 to use node_uname_info{instance="$node:9100"}
fixes the issue.
The trivy dashboard breaks since this commit 4b52d9c on our clusters.
No response
The dashboard continues to work like on the commit before. Other dashboards don't seem to have this issue.
Is there any chance it is because of the missing cluster label on trivy metrics? Should we configure a specific setting to include this cluster label on the trivy operator?
3/4 of these guages use the mean, rather than the last non-null value. This can cause strangeness like incorrect reporting of current cpu requests and limits. They should also be consistent.
Current:
Last *:
Should probably use "Last *" rather than "Mean" for calculating the value.
No response
The cumulative resource usage in the namespace seems to be 1.25 cpu and 2.5Gi (I changed the two graphs to stack), but it appears as 2.5 cpu and 5Gi respectively.
I imagine the queries need the label selector image!=""
.
N/A
N/A
N/A
Thanks for very nice dashboards.
One thing missing is a variable "cluster" maybe. Having multiple clusters it is useful to limit scope to a single cluster. A multi-select variable accepting all and queries adding "cluster=~"$cluster"
.
Popup message in grafana when opening dashboards:
Templating
Failed to upgrade legacy queries Datasource prometheus was not found
Previsious version working fine
Install VictoriaMetrics as prometheus datasource and try open namespace dashboard
Dashboards works correctly
No response
There are panels in the Trivy Operator dashboard which do not properly use the Prometheus data source variable.
The global Prometheus data source variable should be applied to all panels.
Here are the places I spotted where the Prometheus data source variable is not used:
https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-addons-trivy-operator.json#L785
https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-addons-trivy-operator.json#L882
Hi,
I'm prepare #79 and I have some trouble to export the JSON file a grafana instance.
If I import a dashboard and export again without any modifications, I get a lot of changes:
For example this commit does not contain any changes, from a lot of changes of JSON level: jkroepke@706315b
Thats how I export the JSON
What the recommend way? If the mention approch is the correct one, would it be possible to import and export all dashboard the keep my PR clean as possible? Otherwise, I had tons of non related changes.
i am using the dashboard ( grafana-dashboards-kubernetes/dashboards/trivy ) but I am not getting any values for 'CVE vulnerabilities in All namespace(s)' and 'Other vulnerabilities in All namespace(s)', I have enabled OPERATOR_METRICS_VULN_ID_ENABLED= true in my trivy deployment and I am using the latest version of trivy operator and prometheus. could you please help
1.install latest trivy-operator and try to use the grafana dashboard
show cve values
No response
First of all: amazing dashboards...Thanks a ton :)
The panel "Resources by container" in the "kubernetes-views-pods" uses the metrics
kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", unit="byte"}
kube_pod_container_resource_usage{namespace="$namespace", pod="$pod", unit="byte"}
Unfortunately this leads to unexpected values as the label "resource" in these metrics can have the values "memory" and "ephemeral_storage" and counts them together.
No response
The metrics should probably be:
kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", unit="byte", resource="memory"}
kube_pod_container_resource_usage{namespace="$namespace", pod="$pod", unit="byte", resource="memory"}
No response
I find the lack of multiselect for the pod view very limiting. The view should allow to look at multiple workloads at once.
No response
I'd like the nodes dashboard to show the runtime and system resource usage, as are exported by kubelet.
This requires that the cAdvisor metrics for cgroup slices aren't being dropped. For this to work with Kube Prometheus Stack the kubelet ServiceMonitor cAdvisorMetricRelabelings
value needs to be overridden to keep the required values.
The default resolution of 30s is too low and renders some dashboards with "No Data". This is likely because I'm using Grafana Mimir, as opposed to a standard Prometheus install.
Changing the resolution from 30s to 1m shows the data as expected.
No response
This worked for me in the past, but I am building a new k3s cluster and I can't install it with the previous documentation: https://github.com/dotdc/grafana-dashboards-kubernetes#install-with-helm-values.
The error I get is a little specific to me since I am using terraform:
│ Error: unable to build kubernetes objects from release manifest: unable to decode "": json: cannot unmarshal number into Go struct field ObjectMeta.metadata.labels of type string
│
│ with module.monitoring.helm_release.prometheus-stack,
│ on ../modules/monitoring/main.tf line 2, in resource "helm_release" "prometheus-stack":
│ 2: resource "helm_release" "prometheus-stack" {
I tried to read https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml, but looks like a few things have changed.
No response
No response
No response
Hi, and thanks for the good set of Dashboards @dotdc !
I'm having some trouble with the CoreDNS dashboard.
Several graphs and statuses don't show any data, displaying the "No Data" placeholder.
I've noticed that the filter for CoreDNS is a job and not a pod.
At least in my EKS, the CoreDNS is a daemonset and not a job.
Is there something I could do or change?
Thanks =D !
No response
No response
No response
Hi,
on the "Kubernetes / Views / Namespaces" Dashboard exists a Variable "created_by" that is filled ONLY on dashboard loading. If I change to yesterday, PODs created are not shown. The only thing to be changed is in the variable "properties the refresh from 1 => 2:
"refresh": 1, // Bug
"refresh": 2, // Correct Value
Regards Philipp
Always
created_by should be "refilled" on every Time Range Change
No response
In k8s-views-nodes.json, the "FS - Device Errors" query is sum(node_filesystem_device_error) by (mountpoint)
, which aggregates data from the entire datasource.
No response
{instance="$instance"}
should be added to the query.
No response
The k8s-views-nodes.json
dashboard will have many broken panels in specific Kubernetes setups.
This is currently the case on OKE.
Apparently, this happens when the node
label from kube_node_info
doesn't match the nodename
label from node_uname_info
.
Here's some extracted metrics from a broken setup where the labels differ.
TL;DR: node="k8s-wrk-002"
and nodename="kind-kube-prometheus-stack-worker2"
.
kube_node_info:
{
__name__="kube_node_info",
container="kube-state-metrics",
container_runtime_version="containerd://1.6.19-46-g941215f49",
endpoint="http",
instance="10.27.3.148:8080",
internal_ip="172.18.0.2",
job="kube-state-metrics",
kernel_version="6.2.12-arch1-1",
kubelet_version="v1.26.3",
kubeproxy_version="v1.26.3",
namespace="monitoring",
node="k8s-wrk-002",
os_image="Ubuntu 22.04.2 LTS",
pod="kube-prometheus-stack-kube-state-metrics-6df68756d8-zvd58",
pod_cidr="10.27.1.0/24",
provider_id="kind://docker/kind-kube-prometheus-stack/kind-kube-prometheus-stack-worker2",
service="kube-prometheus-stack-kube-state-metrics",
system_uuid="8422f117-6154-45bd-97c0-e3dec80a3f60"
}
node_uname_info:
{
__name__="node_uname_info",
container="node-exporter",
domainname="(none)",
endpoint="http-metrics",
instance="172.18.0.2:9100",
job="node-exporter",
machine="x86_64",
namespace="monitoring",
nodename="kind-kube-prometheus-stack-worker2",
pod="kube-prometheus-stack-prometheus-node-exporter-qvn22",
release="6.2.12-arch1-1",
service="kube-prometheus-stack-prometheus-node-exporter",
sysname="Linux",
version="#1 SMP PREEMPT_DYNAMIC Thu, 20 Apr 2023 16:11:55 +0000"
}
This issue will continue the discussion started in #41
You can use https://github.com/dotdc/kind-lab, that will create a kind cluster with renamed nodes.
# Create the kind cluster
./start.sh
# Export configuration
export KUBECONFIG="$(pwd)/kind-kubeconfig.yml"
# Expose Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80
login: admin
password: prom-operator
Open broken dashboard:
http://localhost:3000/d/k8s_views_nodes/kubernetes-views-nodes?orgId=1&refresh=30s
Dashboard should work with a relabel_configs
like suggested @Chewie.
The solution should be described in https://github.com/dotdc/grafana-dashboards-kubernetes#known-issues
No response
The metrics for kube_node_info
& node_uname_info
produce different names for nodes, resulting in the Node dashboard not working.
Eg:
node_uname_info:
kube_node_info
Node exporter version: 1.3.1
Kube state metrics version: 2.5.0
I acknowledge this is not a bug on the dashboard itself but rather the naming standards on the different metric exporters.
However just wanted to know if other aws eks users are experiencing the same issue before I start manually editing the dashboard in an attempt to get the dashboards working.
Thanks
No response
No response
No response
When in a cluster with a lot of churn on pods, the high cardinality pod metrics cause queries to fail due to the large number of series returns. For instance I doubled the max returned label sets in victoriametrics to 60k and I still fail when trying to use the pod dashboard:
2024-04-22T18:17:33.527Z warn VictoriaMetrics/app/vmselect/main.go:231 error in "/api/v1/series?start=1713806220&end=1713809880&match%5B%5D=%7B__name__%3D%22kube_pod_info%22%7D": cannot fetch time series for "filters=[{__name__=\"kube_pod_info\"}], timeRange=[2024-04-22T17:17:00Z..2024-04-22T18:18:00Z]": cannot find metric names: error when searching for metricIDs in the current indexdb: the number of matching timeseries exceeds 60000; either narrow down the search or increase -search.max* command-line flag values at vmselect; see https://docs.victoriametrics.com/#resource-usage-limits
Have a cluster with a lot of pods being created...
No response
I have a fix suggestion that seems to work fine for me. It involves changing the namespace and job queries to not query "all pods" for labels. Like this:
namespace: label_values(kube_namespace_created{cluster="$cluster"},namespace)
job: label_values(kube_pod_info{namespace="$namespace", cluster="$cluster"},job)
On my simple test cluster, I have no issues with the Global Netowrk Utilization, but on my production cluster that does cluster and host networking the numbers are crazy:
No way I have sustained rates like that. I think this is related to the metric:
sum(rate(container_network_receive_bytes_total[$__rate_interval]))
If I look at rate(container_network_receive_bytes_total[30s])
, I get:
{id="/", interface="cni0", job="kubernetes-cadvisor"} | 2041725438.15131
{id="/", interface="enp1s0", job="kubernetes-cadvisor"} | 4821605692.45648
{id="/", interface="flannel.1", job="kubernetes-cadvisor"} | 337125370.2678834
I'm not sure what to actually look at here. I tried sum(rate(node_network_receive_bytes_total[$__rate_interval]))
and I get a reasonable traffic graph:
This is 5 nodes, pretty much at idle. Showing I/O by instance:
Here is BTOP+ on k3s01
running for a bit, lines up very will with data above:
No response
No response
No response
Tried to import the dashboards in grafana 7.4.5 , getting Failed to upgrade legacy queries e.replace is not a function
error
No response
No response
No response
On Kubernetes / Views / Pods
dashboard on Network - Bandwidth
panel wrong query of Transmitted
It is
- sum(rate(container_network_receive_bytes_total{namespace="$namespace", pod="$pod"}[$__rate_interval]))
Should be
- sum(rate(container_network_transmit_bytes_total{namespace="$namespace", pod="$pod"}[$__rate_interval]))
No response
No response
Currently, "Running Pods" panel uses the expression sum(kube_pod_container_info)
, which sums the containers, but not the pods. I believe the metric kube_pod_info
would be the best for this panel.
Should be updated here:
grafana-dashboards-kubernetes/dashboards/k8s-views-global.json
Lines 868 to 881 in 793cb68
P.S. Thank you for the dashboards, they look awesome!
Based on
The CPU modes idle, iowait, steal should be excluded from the CPU utilization.
No response
No response
Per the iostat man page:
%idle
Show the percentage of time that the CPU or CPUs were idle and the
system did not have an outstanding disk I/O request.
%iowait
Show the percentage of time that the CPU or CPUs were idle during
which the system had an outstanding disk I/O request.
%steal
Show the percentage of time spent in involuntary wait by the
virtual CPU or CPUs while the hypervisor was servicing another
virtual processor.
First of all I want to thank you for your effort for creating amazing Grafana dashboard for K8s I have deployed Prometheus helm chart stack and passed the dashboard provider value to values.yaml, everything went smooth except one issue that I am facing in /kubernetes/view/pods, which total pod RAM request usage and Total RAM limit usage gauge is showing wrong value as you can see in the below screenshot, I wonder if someone can help me to fix it.
No response
No response
No response
Similar to #46, other dashboards don't work as they match against job="node-exporter"
. The metrics exist and I don't see a need for this label?
If I remove the label filter then it looks as expected.
No response
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.