Comments (19)
Hello sunya-ch and thanks for giving this some attention!
- Pod's metrics are correctly reported with expected pods from expected nodes
- I can see pod_energy_stat and other expected metrics in Prometheus, thus confirming that they are being sent
- Prometheus address is set to 0.0.0.0:9102
The "grafana dashboard should be updated" note is what it boils down to, I think.
I can see that dashboards use "pod_cpu_energy_total", "pod_dram_energy_total" and "pod_energy_total" metrics (and others different from the list specified above), which I can also find in the Prometheus. Both the Grafana-defined names and the new ones can be found there, the new ones are being reported to Prometheus.
Is my understanding correct that there has been a metric name-change in the meantime and as so the Grafana dashboards found in grafana-dashboards are incompatible with the new metric names?
If that is so, thanks for getting this sorted out, I mean, thanks for helping flesh out the issue :)
from kepler.
@Feelas the message dial unix /tmp/estimator.sock: connect: no such file or directory
is benign. The short story is that, the estimator sidecar is not yet started (this is being worked on in #104 and the estimator repo). Upcoming PRs will start up the estimator sidecar and create the sock.
Thanks for testing!
from kepler.
@Feelas thanks for the detailed test! If you can submit a PR on the grafana name change, that'll be great.
from kepler.
cc @sunya-ch
from kepler.
I just verified that rolling over to previously working image (sha256:4ad0c2f56538c383f1b3a90ccc756fcf937f0a436fafb88a23b9a780164f7be9) gets the metric gathering process working again, thus I think that it is not a local issue.
from kepler.
thank you for confirming this @Feelas!
The estimator socket feature is a work-in-progress. We'll test this out more thoroughly and keep you posted
from kepler.
Thank you for a confirmation of that :) Is it expected that the aforementioned version will log no metrics to Prometheus and should we stick to the estimator-less version for now?
from kepler.
@Feelas The metrics are still logged.
But if you want to give it a spin, please run the previous (aka latest) kepler image, and patch the deployment to kick off the estimator, that'll make the warning message go away.
kubectl patch -n monitoring daemonset kepler-exporter --patch-file https://raw.githubusercontent.com/sustainable-computing-io/kepler-estimator/main/deploy/patch.yaml
(based on this instruction )
from kepler.
Thank you for a confirmation of that :) Is it expected that the aforementioned version will log no metrics to Prometheus and should we stick to the estimator-less version for now?
Only dynamic power that will be always exported as 0 (not estimated).
The hardware counters, cgroups and pod package power are expected to be exported to prometheus if they are available.
Otherwise, we have to investigate the issue.
I suggest to use the updated version because the previous one collects cgroup metric in a wrong way and the overflow issue is not handled on some metrics.
Please confirm the following points
- Pod's metrics reported in the kepler log with detected pod name/namespace?
- Names of prometheus metric
Note: the grafana dashboard should be updated.
- Prometheus address parsed to the Kepler command (--address 0.0.0.0:9102)
from kepler.
Hello sunya-ch and thanks for giving this some attention!
- Pod's metrics are correctly reported with expected pods from expected nodes
- I can see pod_energy_stat and other expected metrics in Prometheus, thus confirming that they are being sent
- Prometheus address is set to 0.0.0.0:9102
The "grafana dashboard should be updated" note is what it boils down to, I think. I can see that dashboards use "pod_cpu_energy_total", "pod_dram_energy_total" and "pod_energy_total" metrics (and others different from the list specified above), which I can also find in the Prometheus. Both the Grafana-defined names and the new ones can be found there, the new ones are being reported to Prometheus.
Is my understanding correct that there has been a metric name-change in the meantime and as so the Grafana dashboards found in grafana-dashboards are incompatible with the new metric names?
If that is so, thanks for getting this sorted out, I mean, thanks for helping flesh out the issue :)
Thank you so much for your kind testing. It's very good to see the expected behaviour there :)
And yes, your understanding is perfectly correct 👍
from kepler.
Good to hear then : )
Going back to issue topic, the estimator.sock issue has been well explained (as WIP & expected) and I think we can close this to not keep this ticket unnecessarily open.
from kepler.
sounds great @Feelas
from kepler.
keep this issue open till the dashboard and metrics are consistent
from kepler.
Not going to commit right now since I don't know whether there will be time to address this, but will look into it.
Quick question: is there a direct replacement for previous "pod_energy_total" & "pod_energy_current", which were a sum of cpu+dram? I see there are many more metrics exported right now and summing them up inside the dashboard would be cumbersome.
from kepler.
Not going to commit right now since I don't know whether there will be time to address this, but will look into it.
Quick question: is there a direct replacement for previous "pod_energy_total" & "pod_energy_current", which were a sum of cpu+dram? I see there are many more metrics exported right now and summing them up inside the dashboard would be cumbersome.
Thank you for pointing this out.
pkg energy will include cpu, dram, and uncore which are reported by RAPL.
pod package energy computed from RAPL package power is pod_<curr|total>_energy_in_pkg_millijoule
However, we might add another metric that is package energy + GPU energy + the other part (node - package - GPU if node energy is available) to replace the pod_energy_total. Currently, this value is reported as a value of pod_energy_stat.
Should we separate it as a new metric?
btw, I didn't put this metric at the first place because it could be an inconsistent metric between the system that node energy is available and the system that does not have it.
kepler/pkg/collector/collector.go
Line 285 in 267dd31
from kepler.
So it seems that "pod_energy_total" & "pod_energy_current" is almost same as "pod_<curr|total>_energy_in_pkg_millijoule" and these could theoretically be used as a replacement, correct?
from kepler.
So it seems that "pod_energy_total" & "pod_energy_current" is almost same as "pod_<curr|total>_energy_in_pkg_millijoule" and these could theoretically be used as a replacement, correct?
I am not sure about the original purpose of the pod_energy_total and pod_energy_current but if it includes just only the energy from package (mainly core+dram), my answer is yes.
from kepler.
Taking a look at commit de3584a it seem to historically be calculated as core+dram.
from kepler.
the dashboard picks up the latest metrics name now.
from kepler.
Related Issues (20)
- Huge difference between Kepler power consumption and real PDU power consumption HOT 3
- Could not find any ACPI power meter path. Is it a VM? HOT 7
- RAPL and ACPI in AMD and INTEL CPU environment. HOT 1
- Kepler not reporting correct process name in metrics
- Add support to show Kepler images info in CI HOT 1
- Use make targets instead of docker actions for building and pushing images on CI HOT 1
- Can we add kwok support for integration test?
- Remove BCC Code
- Evaluate using tracepoint/sched_switch over kprobe/finish_task_switch
- No value reported by Kepler latest on OpenShift HOT 7
- Audit usage of build tags
- CPU Ref Cycles in unused HOT 1
- Using bpf_perf_event_read_value causes verifier error
- Add eBPF Testing HOT 1
- "Other" metrics are out of proportion on local Kind cluster
- Kepler rpm not installable HOT 4
- Replace libbpfgo with cilium/ebpf
- Add workflow to build and push Kepler builder image
- Keep ginkgo versions in sync
- Expose Kepler version in exported metric HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kepler.