Describe the bug After rolling over the daemonset to the latest i

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

thank you for confirming this <a class="user-mention notranslate" data-hovercard-type=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

dial error: dial unix /tmp/estimator.sock: connect: no such file or directory about kepler HOT 19 CLOSED

Feelas commented on May 30, 2024

dial error: dial unix /tmp/estimator.sock: connect: no such file or directory

from kepler.

Comments (19)

Feelas commented on May 30, 2024 2

Hello sunya-ch and thanks for giving this some attention!

Pod's metrics are correctly reported with expected pods from expected nodes
I can see pod_energy_stat and other expected metrics in Prometheus, thus confirming that they are being sent
Prometheus address is set to 0.0.0.0:9102

The "grafana dashboard should be updated" note is what it boils down to, I think.
I can see that dashboards use "pod_cpu_energy_total", "pod_dram_energy_total" and "pod_energy_total" metrics (and others different from the list specified above), which I can also find in the Prometheus. Both the Grafana-defined names and the new ones can be found there, the new ones are being reported to Prometheus.

Is my understanding correct that there has been a metric name-change in the meantime and as so the Grafana dashboards found in grafana-dashboards are incompatible with the new metric names?

If that is so, thanks for getting this sorted out, I mean, thanks for helping flesh out the issue :)

from kepler.

rootfs commented on May 30, 2024 1

@Feelas the message dial unix /tmp/estimator.sock: connect: no such file or directory is benign. The short story is that, the estimator sidecar is not yet started (this is being worked on in #104 and the estimator repo). Upcoming PRs will start up the estimator sidecar and create the sock.

Thanks for testing!

from kepler.

rootfs commented on May 30, 2024 1

@Feelas thanks for the detailed test! If you can submit a PR on the grafana name change, that'll be great.

from kepler.

rootfs commented on May 30, 2024

cc @sunya-ch

from kepler.

Feelas commented on May 30, 2024

I just verified that rolling over to previously working image (sha256:4ad0c2f56538c383f1b3a90ccc756fcf937f0a436fafb88a23b9a780164f7be9) gets the metric gathering process working again, thus I think that it is not a local issue.

from kepler.

rootfs commented on May 30, 2024

thank you for confirming this @Feelas!

The estimator socket feature is a work-in-progress. We'll test this out more thoroughly and keep you posted

from kepler.

Feelas commented on May 30, 2024

Thank you for a confirmation of that :) Is it expected that the aforementioned version will log no metrics to Prometheus and should we stick to the estimator-less version for now?

from kepler.

rootfs commented on May 30, 2024

@Feelas The metrics are still logged.

But if you want to give it a spin, please run the previous (aka latest) kepler image, and patch the deployment to kick off the estimator, that'll make the warning message go away.

kubectl patch -n monitoring daemonset kepler-exporter --patch-file https://raw.githubusercontent.com/sustainable-computing-io/kepler-estimator/main/deploy/patch.yaml

(based on this instruction )

from kepler.

sunya-ch commented on May 30, 2024

Thank you for a confirmation of that :) Is it expected that the aforementioned version will log no metrics to Prometheus and should we stick to the estimator-less version for now?

Only dynamic power that will be always exported as 0 (not estimated).
The hardware counters, cgroups and pod package power are expected to be exported to prometheus if they are available.
Otherwise, we have to investigate the issue.

I suggest to use the updated version because the previous one collects cgroup metric in a wrong way and the overflow issue is not handled on some metrics.

Please confirm the following points

Pod's metrics reported in the kepler log with detected pod name/namespace?
Names of prometheus metric
- pod_energy_stat
- pod_<curr|total>energy_in<core|dram|uncore|gpu|other|pkg>_millijoule
Note: the grafana dashboard should be updated.
Prometheus address parsed to the Kepler command (--address 0.0.0.0:9102)

from kepler.

sunya-ch commented on May 30, 2024

Hello sunya-ch and thanks for giving this some attention!

Pod's metrics are correctly reported with expected pods from expected nodes

I can see pod_energy_stat and other expected metrics in Prometheus, thus confirming that they are being sent

Prometheus address is set to 0.0.0.0:9102

The "grafana dashboard should be updated" note is what it boils down to, I think. I can see that dashboards use "pod_cpu_energy_total", "pod_dram_energy_total" and "pod_energy_total" metrics (and others different from the list specified above), which I can also find in the Prometheus. Both the Grafana-defined names and the new ones can be found there, the new ones are being reported to Prometheus.

Is my understanding correct that there has been a metric name-change in the meantime and as so the Grafana dashboards found in grafana-dashboards are incompatible with the new metric names?

If that is so, thanks for getting this sorted out, I mean, thanks for helping flesh out the issue :)

Thank you so much for your kind testing. It's very good to see the expected behaviour there :)

And yes, your understanding is perfectly correct 👍

from kepler.

Feelas commented on May 30, 2024

Good to hear then : )

Going back to issue topic, the estimator.sock issue has been well explained (as WIP & expected) and I think we can close this to not keep this ticket unnecessarily open.

from kepler.

rootfs commented on May 30, 2024

sounds great @Feelas

from kepler.

rootfs commented on May 30, 2024

keep this issue open till the dashboard and metrics are consistent

from kepler.

Feelas commented on May 30, 2024

Not going to commit right now since I don't know whether there will be time to address this, but will look into it.

Quick question: is there a direct replacement for previous "pod_energy_total" & "pod_energy_current", which were a sum of cpu+dram? I see there are many more metrics exported right now and summing them up inside the dashboard would be cumbersome.

from kepler.

sunya-ch commented on May 30, 2024

Not going to commit right now since I don't know whether there will be time to address this, but will look into it.

Quick question: is there a direct replacement for previous "pod_energy_total" & "pod_energy_current", which were a sum of cpu+dram? I see there are many more metrics exported right now and summing them up inside the dashboard would be cumbersome.

Thank you for pointing this out.
pkg energy will include cpu, dram, and uncore which are reported by RAPL.
pod package energy computed from RAPL package power is pod_<curr|total>_energy_in_pkg_millijoule

However, we might add another metric that is package energy + GPU energy + the other part (node - package - GPU if node energy is available) to replace the pod_energy_total. Currently, this value is reported as a value of pod_energy_stat.
Should we separate it as a new metric?
btw, I didn't put this metric at the first place because it could be an inconsistent metric between the system that node energy is available and the system that does not have it.

kepler/pkg/collector/collector.go

Line 285 in 267dd31

float64(v.EnergyInPkg.Aggr+v.EnergyInGPU.Aggr+v.EnergyInOther.Aggr),

from kepler.

Feelas commented on May 30, 2024

So it seems that "pod_energy_total" & "pod_energy_current" is almost same as "pod_<curr|total>_energy_in_pkg_millijoule" and these could theoretically be used as a replacement, correct?

from kepler.

sunya-ch commented on May 30, 2024

So it seems that "pod_energy_total" & "pod_energy_current" is almost same as "pod_<curr|total>_energy_in_pkg_millijoule" and these could theoretically be used as a replacement, correct?

I am not sure about the original purpose of the pod_energy_total and pod_energy_current but if it includes just only the energy from package (mainly core+dram), my answer is yes.

from kepler.

Feelas commented on May 30, 2024

Taking a look at commit de3584a it seem to historically be calculated as core+dram.

from kepler.

rootfs commented on May 30, 2024

the dashboard picks up the latest metrics name now.

from kepler.

dial error: dial unix /tmp/estimator.sock: connect: no such file or directory about kepler HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs