GithubHelp home page GithubHelp logo

dotdc / grafana-dashboards-kubernetes Goto Github PK

View Code? Open in Web Editor NEW
2.1K 38.0 310.0 487 KB

A set of modern Grafana dashboards for Kubernetes.

License: Apache License 2.0

grafana grafana-dashboard grafana-prometheus prometheus prometheus-metrics dashboard dashboards monitoring monitoring-dashboard grafana-dashboards

grafana-dashboards-kubernetes's People

Contributors

alexintech avatar chewie avatar clementnuss avatar cmergenthaler avatar danic-git avatar dotdc avatar elmariofredo avatar fcecagno avatar felipewnp avatar ffppmm avatar geekofalltrades avatar hoangphuocbk avatar jcpunk avatar jkroepke avatar k1rk avatar kongfei605 avatar marcofranssen avatar miracle2k avatar prasadkris avatar rcattin avatar reefland avatar superq avatar tlemarchand avatar uhthomas avatar vladimir-babichev avatar william-lp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grafana-dashboards-kubernetes's Issues

[enhancement] Windows support

Describe the enhancement you'd like

I have some cluster with Windows nodes enabled. I would like to ask if I can add windows support or do you think it out of context here?

Unlike kubernetes-mixin, which have separate dashboard, I would like to add the Windows queries into the existing one. Thats possible by using queries with OR, e.g.:

sum(container_memory_working_set_bytes{cluster="$cluster",namespace=~"$namespace", image!="", pod=~"${created_by}.*"}) by (pod)
OR
<WINDOWS Query>

Additional context

Since I'm running multiple OS hybrid clusters, I would like to add PRs for windows pods here. I'm not expecting that the maintainers here provide support for Windows. Before start to work here, I would like to know if its getting accepted?

Some metrics are missing.

Beautiful dashboards. Some of the panels show no data, and I've seen this before (Kubernetes LENS). In reviewing the JSON query it is referencing attributes or keys that are not included with cAdvisor metrics (that I have). For examples, your Global dashboard:

grafana_missing_metrics

When I look at the CPU Utilization by namespace and inspect the JSON query it is based on container_cpu_usage_seconds_total. When I look in my Prometheus it does not have image=, here is a random one that was on the top of the query:

container_cpu_usage_seconds_total{cpu="total", endpoint="https-metrics", id="/kubepods/besteffort/pod03202a32-75a1-4a64-8692-1e73fd26eca3", instance="192.168.10.217:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="democratic-csi", node="k3s03", pod="democratic-csi-nfs-node-sqxp9", service="kube-prometheus-stack-kubelet"}

I'm using K3s based on Kubernetes 1.23 on bare metal with containerd, no docker runtime. I have no idea if this is from containerd, kublet, cAdivsor issue or just expected as part of life when you don't use docker runtime.

If you have any suggestions, be much appreciated.

Metrics missing in K8s Environment

I've opened a new issue because this one is not in the k3s environment but k8s.

I see some metrics missing, probably because my installation could be incomplete.
I've deployed the k8s cluster, with two masters and three workers nodes. Grafana and prometheus are deployed with "almost" de default settings.

i5Js@nanoserver:~/K3s/K8s/grafana/grafana-dashboards-kubernetes/dashboards$ k get svc -n grafana
NAME      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
grafana   ClusterIP   <ip>   <none>        80/TCP    18h
i5Js@nanoserver:~/K3s/K8s/grafana/grafana-dashboards-kubernetes/dashboards$ k get svc -n prometheus
NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
prometheus-alertmanager         ClusterIP   <ip>     <none>        80/TCP     21h
prometheus-kube-state-metrics   ClusterIP   <ip>     <none>        8080/TCP   21h
prometheus-node-exporter        ClusterIP   <ip>     <none>        9100/TCP   21h
prometheus-pushgateway          ClusterIP   <ip>    <none>        9091/TCP   21h
prometheus-server               ClusterIP   <ip>    <none>        80/TCP     21h

I've created the datasource using the prometheus-server ip, and some of the metrics works and some don't:

Screenshot 2022-07-02 at 10 08 38

Screenshot 2022-07-02 at 10 10 16

I'm completely sure that those issues are because my environment because I see that your dashboards work fine, but, can you help me troubleshoot?

Thanks,

deployment view

Describe the enhancement you'd like

Currently there are views for:

  • global
  • namespace
  • nodes
  • pods

It would be nice to have a view that would show the status of the deployments (number or replicas, ...)

Additional context

No response

Issues with node_cpu_seconds_total

I tested the latest changes, and still not right...

Panel CPU Utilization by Node "expr": "avg by (node) (1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval]))",yields:

image

Seems to be the total of all nodes? It is not picking up the multiple nodes, It should look like:
image

Panel CPU Utilization by namespace is still dark and using old metric: "expr": "sum(rate(container_cpu_usage_seconds_total{image!=\"\"}[$__rate_interval])) by (namespace)", I did try something like above "avg by (namespace) (1-rate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval]))" that is not right, only got one namespace listed:

image

Both Memory Utilization Panels are still dark based on container_memory_working_set_bytes when I use your unmodified files.

[bug] some panels not displyaing correctly on white background

Describe the bug

Hi,

I'm using those dashboards with Grafana using the light theme (easier on my eyes), and some panels are not displaying properly. e.g.:

image

this can be fixed by setting the color mode of the panel to None instead of Value

Screenshot 2022-09-30 at 07 13 22

How to reproduce?

  1. turn on light mode for Grafana.
  2. check the Kubernetes / Views / Global panel

Expected behavior

the text/values should be readable even with the light theme

Additional context

No response

[bug] node dashboard only shows latest instance

Describe the bug

Some panels are using node to filter, and others are using a hidden instance variable ( label_values(node_uname_info{nodename=~"(?i:($node))"}, instance)). If a node changes its IP, then some panels will look normal and others will be missing data.

image

How to reproduce?

  1. Collect node metrics.
  2. Change IP of node.
  3. Observe the node dashboard has some unaffected panels, and others which only show the latest 'instance'.

Expected behavior

It should probably show all instances of a node.

Additional context

No response

View Pods Dashboard Feature Requests / Issues

RAM Usage Request Gauge
My understanding of requests is that this should closely match the actual. Being 90% of Request is not a bad condition, that is a good condition. I think GREEN should be +/- 20% of the request value. 20% beyond that either side yellow, and the rest is red as being signification under or over request is not ideal. As it is now if you estimate the request perfectly it shows RED like an error condition and that is not the case. Only the LIMIT gauge should be like this (as you get OOM killed),

image
I think that is wrong, to be stable at 90% of request should get me a gold star :)

I'm not sure if CPU Request needs that as well. If so maybe its GREEN range is wider?!?


Resource by container
Could you add the Actual Usage for CPU and Memory between Request/Limits for each? That would be helpful to show where actual is between the two values.
image


I think CPU Usage by container and Memory Usage by Container should be renamed to by pod as if you select a Pod with multiple containers, you do not get a graph with multiple plot lines which you would expect if it was by container.


NOTE: I played with adding resource requests and limits as plot lines for CPU Usage by Container and Memory Usage by Container and looks good for pods with a single container. But once I selected a pod with multiple containers and thus multiple requests/limits it become confusing mess. Don't have the Grafana skills to isolate them properly. But maybe you have some ideas to make that work right.

[bug] Failed to display node metrics

Describe the bug

This is the way variables are configured on k8s-views-nodes.json:

...
node = label_values(kube_node_info, node)
instance = label_values(node_uname_info{nodename=~"(?i:($node))"}, instance)

In OKE, kube_node_info looks like this:

{__name__="kube_node_info", container="kube-state-metrics", container_runtime_version="cri-o://1.25.1-111.el7", endpoint="http", instance="10.244.0.40:8080", internal_ip="10.0.107.39", job="kube-state-metrics", kernel_version="5.4.17-2136.314.6.2.el7uek.x86_64", kubelet_version="v1.25.4", kubeproxy_version="v1.25.4", namespace="monitoring", node="10.0.107.39", os_image="Oracle Linux Server 7.9", pod="monitoring-kube-state-metrics-6fcd4d745c-txg2k", pod_cidr="10.244.1.0/25", provider_id="ocid1.instance.oc1.sa-saopaulo-1.xxx", service="monitoring-kube-state-metrics", system_uuid="d6462364-95bf-4122-a3ab-xxx"}

And node_uname_info looks like this:

node_uname_info{container="node-exporter", domainname="(none)", endpoint="http-metrics", instance="10.0.107.39:9100", job="node-exporter", machine="x86_64", namespace="monitoring", nodename="oke-cq2bxmvtqca-nsdfwre7l3a-seqv6owhq3a-0", pod="monitoring-prometheus-node-exporter-n6pzv", release="5.4.17-2136.314.6.2.el7uek.x86_64", service="monitoring-prometheus-node-exporter", sysname="Linux", version="#2 SMP Fri Dec 9 17:35:27 PST 2022"}

For this example, node=10.0.107.39, but when I query node_uname_info{nodename=~"(?i:($node))"}, it doesn't return anything, because nodename doesn't match the internal IP address of the node.
As a result, no node metrics is displayed.

How to reproduce?

No response

Expected behavior

No response

Additional context

Modifying the filter https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-views-nodes.json#L3747-L3772 to use node_uname_info{instance="$node:9100"} fixes the issue.

[bug] incorrect node count with grafana agent

Describe the bug

Then metric up{job="node-exporter"} does not exist with Grafana agent, and so the total number of nodes is reported as 0.

image

How to reproduce?

  1. Use Grafana Agent
  2. Load dashboard

Expected behavior

Should show the total number of nodes (5 in this case).

Additional context

No response

[bug] Trivy Dashboard Templating Failed to upgrade legacy queries Datasource prometheus was not found

Describe the bug

The trivy dashboard breaks since this commit 4b52d9c on our clusters.

How to reproduce?

No response

Expected behavior

The dashboard continues to work like on the commit before. Other dashboards don't seem to have this issue.

Additional context

Is there any chance it is because of the missing cluster label on trivy metrics? Should we configure a specific setting to include this cluster label on the trivy operator?

[bug] should use last non-null value, rather than mean

Describe the bug

3/4 of these guages use the mean, rather than the last non-null value. This can cause strangeness like incorrect reporting of current cpu requests and limits. They should also be consistent.

Current:

image

Last *:

image

How to reproduce?

  1. Observe the global view
  2. Change some cpu requests and limits
  3. Observe incorrect reporting of cpu requests and limits

Expected behavior

Should probably use "Last *" rather than "Mean" for calculating the value.

Additional context

No response

[bug] Namespace dashboard shows double resource usage

Describe the bug

The cumulative resource usage in the namespace seems to be 1.25 cpu and 2.5Gi (I changed the two graphs to stack), but it appears as 2.5 cpu and 5Gi respectively.

image

I imagine the queries need the label selector image!="".

How to reproduce?

N/A

Expected behavior

N/A

Additional context

N/A

[enhancement] cluster variable support

Thanks for very nice dashboards.

One thing missing is a variable "cluster" maybe. Having multiple clusters it is useful to limit scope to a single cluster. A multi-select variable accepting all and queries adding "cluster=~"$cluster".

All dashboards with cluster variable is broken in VictoriaMetrics [bug]

Describe the bug

Popup message in grafana when opening dashboards:

Templating
Failed to upgrade legacy queries Datasource prometheus was not found

Previsious version working fine

How to reproduce?

Install VictoriaMetrics as prometheus datasource and try open namespace dashboard

Expected behavior

Dashboards works correctly

Additional context

No response

[bug] Trivy Operator Dashboard: The Prometheus data source variable is not used everywhere

Describe the bug

There are panels in the Trivy Operator dashboard which do not properly use the Prometheus data source variable.

How to reproduce?

  1. Import the dashboard
  2. Change between Prometheus data sources in the global variable filter
  3. See that the "Vulnerability count per image and severity in $namespace namespace" panel does not pick up the Prometheus data source correctly

Expected behavior

The global Prometheus data source variable should be applied to all panels.

Additional context

Here are the places I spotted where the Prometheus data source variable is not used:

https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-addons-trivy-operator.json#L785
https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-addons-trivy-operator.json#L882

Question: How I should export dashboard json

Hi,

I'm prepare #79 and I have some trouble to export the JSON file a grafana instance.

If I import a dashboard and export again without any modifications, I get a lot of changes:

For example this commit does not contain any changes, from a lot of changes of JSON level: jkroepke@706315b

Thats how I export the JSON

image

What the recommend way? If the mention approch is the correct one, would it be possible to import and export all dashboard the keep my PR clean as possible? Otherwise, I had tons of non related changes.

https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/trivy

Describe the bug

i am using the dashboard ( grafana-dashboards-kubernetes/dashboards/trivy ) but I am not getting any values for 'CVE vulnerabilities in All namespace(s)' and 'Other vulnerabilities in All namespace(s)', I have enabled OPERATOR_METRICS_VULN_ID_ENABLED= true in my trivy deployment and I am using the latest version of trivy operator and prometheus. could you please help

How to reproduce?

1.install latest trivy-operator and try to use the grafana dashboard

Expected behavior

show cve values

Additional context

No response

[bug] CPU dashboard can report negative values

Describe the bug

image

How to reproduce?

I don't know

Expected behavior

The dashboard should not produce negative CPU usage values.

Additional context

I adjusted some resource limits, which caused some pods to restart.

[bug] Dashboard kubernetes-views-pods shows unexpected values for memory requests / limits

Describe the bug

First of all: amazing dashboards...Thanks a ton :)

The panel "Resources by container" in the "kubernetes-views-pods" uses the metrics
kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", unit="byte"}
kube_pod_container_resource_usage{namespace="$namespace", pod="$pod", unit="byte"}

Unfortunately this leads to unexpected values as the label "resource" in these metrics can have the values "memory" and "ephemeral_storage" and counts them together.

How to reproduce?

No response

Expected behavior

The metrics should probably be:
kube_pod_container_resource_requests{namespace="$namespace", pod="$pod", unit="byte", resource="memory"}
kube_pod_container_resource_usage{namespace="$namespace", pod="$pod", unit="byte", resource="memory"}

Additional context

No response

[enhancement] Add support for monitoring node runtime & system resource usage

Describe the enhancement you'd like

I'd like the nodes dashboard to show the runtime and system resource usage, as are exported by kubelet.

Additional context

This requires that the cAdvisor metrics for cgroup slices aren't being dropped. For this to work with Kube Prometheus Stack the kubelet ServiceMonitor cAdvisorMetricRelabelings value needs to be overridden to keep the required values.

[bug] default resolution is too low

Describe the bug

The default resolution of 30s is too low and renders some dashboards with "No Data". This is likely because I'm using Grafana Mimir, as opposed to a standard Prometheus install.

image

How to reproduce?

  1. Collect metrics with Grafana Mimir.
  2. Load the dashboard.

Expected behavior

Changing the resolution from 30s to 1m shows the data as expected.

image

Additional context

No response

[bug] `kube-prometheus-stack` installation steps broken

This worked for me in the past, but I am building a new k3s cluster and I can't install it with the previous documentation: https://github.com/dotdc/grafana-dashboards-kubernetes#install-with-helm-values.

The error I get is a little specific to me since I am using terraform:

│ Error: unable to build kubernetes objects from release manifest: unable to decode "": json: cannot unmarshal number into Go struct field ObjectMeta.metadata.labels of type string
│
│   with module.monitoring.helm_release.prometheus-stack,
│   on ../modules/monitoring/main.tf line 2, in resource "helm_release" "prometheus-stack":
│    2: resource "helm_release" "prometheus-stack" {

I tried to read https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml, but looks like a few things have changed.

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] CoreDNS Dashboard No Data

Describe the bug

Hi, and thanks for the good set of Dashboards @dotdc !

I'm having some trouble with the CoreDNS dashboard.

Several graphs and statuses don't show any data, displaying the "No Data" placeholder.

I've noticed that the filter for CoreDNS is a job and not a pod.

At least in my EKS, the CoreDNS is a daemonset and not a job.

image

Is there something I could do or change?

Thanks =D !

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] created_by variable is not refreshed on Time Range Change

Describe the bug

Hi,
on the "Kubernetes / Views / Namespaces" Dashboard exists a Variable "created_by" that is filled ONLY on dashboard loading. If I change to yesterday, PODs created are not shown. The only thing to be changed is in the variable "properties the refresh from 1 => 2:

        "refresh": 1, // Bug
        "refresh": 2, // Correct Value

Regards Philipp

How to reproduce?

Always

Expected behavior

created_by should be "refilled" on every Time Range Change

Additional context

No response

[bug] "FS - Device Errors" query in Nodes dashboard is not scoped

Describe the bug

In k8s-views-nodes.json, the "FS - Device Errors" query is sum(node_filesystem_device_error) by (mountpoint), which aggregates data from the entire datasource.

How to reproduce?

No response

Expected behavior

{instance="$instance"} should be added to the query.

Additional context

No response

[bug] broken panels on k8s-views-nodes in specific cases

Describe the bug

The k8s-views-nodes.json dashboard will have many broken panels in specific Kubernetes setups.
This is currently the case on OKE.

Apparently, this happens when the node label from kube_node_info doesn't match the nodename label from node_uname_info.

Here's some extracted metrics from a broken setup where the labels differ.

TL;DR: node="k8s-wrk-002" and nodename="kind-kube-prometheus-stack-worker2".

kube_node_info:

{
    __name__="kube_node_info",
    container="kube-state-metrics",
    container_runtime_version="containerd://1.6.19-46-g941215f49",
    endpoint="http", 
    instance="10.27.3.148:8080", 
    internal_ip="172.18.0.2", 
    job="kube-state-metrics", 
    kernel_version="6.2.12-arch1-1", 
    kubelet_version="v1.26.3", 
    kubeproxy_version="v1.26.3", 
    namespace="monitoring",
    node="k8s-wrk-002",
    os_image="Ubuntu 22.04.2 LTS",
    pod="kube-prometheus-stack-kube-state-metrics-6df68756d8-zvd58",
    pod_cidr="10.27.1.0/24",
    provider_id="kind://docker/kind-kube-prometheus-stack/kind-kube-prometheus-stack-worker2", 
    service="kube-prometheus-stack-kube-state-metrics", 
    system_uuid="8422f117-6154-45bd-97c0-e3dec80a3f60"
}

node_uname_info:

{
    __name__="node_uname_info", 
    container="node-exporter", 
    domainname="(none)", 
    endpoint="http-metrics", 
    instance="172.18.0.2:9100", 
    job="node-exporter", 
    machine="x86_64", 
    namespace="monitoring", 
    nodename="kind-kube-prometheus-stack-worker2", 
    pod="kube-prometheus-stack-prometheus-node-exporter-qvn22", 
    release="6.2.12-arch1-1", 
    service="kube-prometheus-stack-prometheus-node-exporter", 
    sysname="Linux", 
    version="#1 SMP PREEMPT_DYNAMIC Thu, 20 Apr 2023 16:11:55 +0000"
}

This issue will continue the discussion started in #41

@fcecagno @Chewie

How to reproduce?

You can use https://github.com/dotdc/kind-lab, that will create a kind cluster with renamed nodes.

# Create the kind cluster
./start.sh

# Export configuration
export KUBECONFIG="$(pwd)/kind-kubeconfig.yml"

# Expose Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana -n monitoring 3000:80

Open http://localhost:3000

login: admin
password: prom-operator

Open broken dashboard:

http://localhost:3000/d/k8s_views_nodes/kubernetes-views-nodes?orgId=1&refresh=30s

Expected behavior

Dashboard should work with a relabel_configs like suggested @Chewie.
The solution should be described in https://github.com/dotdc/grafana-dashboards-kubernetes#known-issues

Additional context

No response

[bug] Node metrics names on AWS EKS nodes mismatch

Describe the bug

The metrics for kube_node_info & node_uname_info produce different names for nodes, resulting in the Node dashboard not working.

Eg:

node_uname_info:

  • nodename="ip-10-10-11-100.ec2.internal"

kube_node_info

  • node="ip-10-10-10-110.us-east-2.compute.internal"

Node exporter version: 1.3.1
Kube state metrics version: 2.5.0

I acknowledge this is not a bug on the dashboard itself but rather the naming standards on the different metric exporters.

However just wanted to know if other aws eks users are experiencing the same issue before I start manually editing the dashboard in an attempt to get the dashboards working.

Thanks

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

suggest lower cardinality variables for the pod dashboard[bug]

Describe the bug

When in a cluster with a lot of churn on pods, the high cardinality pod metrics cause queries to fail due to the large number of series returns. For instance I doubled the max returned label sets in victoriametrics to 60k and I still fail when trying to use the pod dashboard:

2024-04-22T18:17:33.527Z	warn	VictoriaMetrics/app/vmselect/main.go:231	error in "/api/v1/series?start=1713806220&end=1713809880&match%5B%5D=%7B__name__%3D%22kube_pod_info%22%7D": cannot fetch time series for "filters=[{__name__=\"kube_pod_info\"}], timeRange=[2024-04-22T17:17:00Z..2024-04-22T18:18:00Z]": cannot find metric names: error when searching for metricIDs in the current indexdb: the number of matching timeseries exceeds 60000; either narrow down the search or increase -search.max* command-line flag values at vmselect; see https://docs.victoriametrics.com/#resource-usage-limits

How to reproduce?

Have a cluster with a lot of pods being created...

Expected behavior

No response

Additional context

I have a fix suggestion that seems to work fine for me. It involves changing the namespace and job queries to not query "all pods" for labels. Like this:

namespace: label_values(kube_namespace_created{cluster="$cluster"},namespace)
job: label_values(kube_pod_info{namespace="$namespace", cluster="$cluster"},job)

[bug] Global Network Utilization

Describe the bug

On my simple test cluster, I have no issues with the Global Netowrk Utilization, but on my production cluster that does cluster and host networking the numbers are crazy:

image

No way I have sustained rates like that. I think this is related to the metric:

sum(rate(container_network_receive_bytes_total[$__rate_interval]))

If I look at rate(container_network_receive_bytes_total[30s]), I get:

{id="/", interface="cni0", job="kubernetes-cadvisor"} | 2041725438.15131
{id="/", interface="enp1s0", job="kubernetes-cadvisor"} | 4821605692.45648
{id="/", interface="flannel.1", job="kubernetes-cadvisor"} | 337125370.2678834

I'm not sure what to actually look at here. I tried sum(rate(node_network_receive_bytes_total[$__rate_interval])) and I get a reasonable traffic graph:

image

This is 5 nodes, pretty much at idle. Showing I/O by instance:

image

Here is BTOP+ on k3s01 running for a bit, lines up very will with data above:
image

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

[bug] Wrong query on the Network - Bandwidth panel

Describe the bug

On Kubernetes / Views / Pods dashboard on Network - Bandwidth panel wrong query of Transmitted

It is

- sum(rate(container_network_receive_bytes_total{namespace="$namespace", pod="$pod"}[$__rate_interval]))

Should be

- sum(rate(container_network_transmit_bytes_total{namespace="$namespace", pod="$pod"}[$__rate_interval]))

How to reproduce?

No response

Expected behavior

No response

Additional context

https://github.com/dotdc/grafana-dashboards-kubernetes/blob/master/dashboards/k8s-views-pods.json#L1417

Running pods panel in Global dashboard

Currently, "Running Pods" panel uses the expression sum(kube_pod_container_info), which sums the containers, but not the pods. I believe the metric kube_pod_info would be the best for this panel.

Should be updated here:

"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "${datasource}"
},
"expr": "sum(kube_pod_container_info)",
"interval": "",
"legendFormat": "",
"refId": "A"
}
],
"title": "Running Pods",
"type": "stat"

P.S. Thank you for the dashboards, they look awesome!

[bug] exclude iowait, steal, idle from CPU uages

Describe the bug

Based on

The CPU modes idle, iowait, steal should be excluded from the CPU utilization.

How to reproduce?

No response

Expected behavior

No response

Additional context

Per the iostat man page:

%idle
Show the percentage of time that the CPU or CPUs were idle and the
system did not have an outstanding disk I/O request.

%iowait
Show the percentage of time that the CPU or CPUs were idle during
which the system had an outstanding disk I/O request.

%steal
Show the percentage of time spent in involuntary wait by the
virtual CPU or CPUs while the hypervisor was servicing another
virtual processor.

Total pod RAM request usage & Total Pod RAM limit usage gauge is showing wrong value

Describe the bug

First of all I want to thank you for your effort for creating amazing Grafana dashboard for K8s I have deployed Prometheus helm chart stack and passed the dashboard provider value to values.yaml, everything went smooth except one issue that I am facing in /kubernetes/view/pods, which total pod RAM request usage and Total RAM limit usage gauge is showing wrong value as you can see in the below screenshot, I wonder if someone can help me to fix it.

image

image

How to reproduce?

No response

Expected behavior

No response

Additional context

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.