GithubHelp home page GithubHelp logo

Comments (19)

Feelas avatar Feelas commented on May 30, 2024 2

Hello sunya-ch and thanks for giving this some attention!

  • Pod's metrics are correctly reported with expected pods from expected nodes
  • I can see pod_energy_stat and other expected metrics in Prometheus, thus confirming that they are being sent
  • Prometheus address is set to 0.0.0.0:9102

The "grafana dashboard should be updated" note is what it boils down to, I think.
I can see that dashboards use "pod_cpu_energy_total", "pod_dram_energy_total" and "pod_energy_total" metrics (and others different from the list specified above), which I can also find in the Prometheus. Both the Grafana-defined names and the new ones can be found there, the new ones are being reported to Prometheus.

Is my understanding correct that there has been a metric name-change in the meantime and as so the Grafana dashboards found in grafana-dashboards are incompatible with the new metric names?

If that is so, thanks for getting this sorted out, I mean, thanks for helping flesh out the issue :)

from kepler.

rootfs avatar rootfs commented on May 30, 2024 1

@Feelas the message dial unix /tmp/estimator.sock: connect: no such file or directory is benign. The short story is that, the estimator sidecar is not yet started (this is being worked on in #104 and the estimator repo). Upcoming PRs will start up the estimator sidecar and create the sock.

Thanks for testing!

from kepler.

rootfs avatar rootfs commented on May 30, 2024 1

@Feelas thanks for the detailed test! If you can submit a PR on the grafana name change, that'll be great.

from kepler.

rootfs avatar rootfs commented on May 30, 2024

cc @sunya-ch

from kepler.

Feelas avatar Feelas commented on May 30, 2024

I just verified that rolling over to previously working image (sha256:4ad0c2f56538c383f1b3a90ccc756fcf937f0a436fafb88a23b9a780164f7be9) gets the metric gathering process working again, thus I think that it is not a local issue.

from kepler.

rootfs avatar rootfs commented on May 30, 2024

thank you for confirming this @Feelas!

The estimator socket feature is a work-in-progress. We'll test this out more thoroughly and keep you posted

from kepler.

Feelas avatar Feelas commented on May 30, 2024

Thank you for a confirmation of that :) Is it expected that the aforementioned version will log no metrics to Prometheus and should we stick to the estimator-less version for now?

from kepler.

rootfs avatar rootfs commented on May 30, 2024

@Feelas The metrics are still logged.

But if you want to give it a spin, please run the previous (aka latest) kepler image, and patch the deployment to kick off the estimator, that'll make the warning message go away.

kubectl patch -n monitoring daemonset kepler-exporter --patch-file https://raw.githubusercontent.com/sustainable-computing-io/kepler-estimator/main/deploy/patch.yaml

(based on this instruction )

from kepler.

sunya-ch avatar sunya-ch commented on May 30, 2024

Thank you for a confirmation of that :) Is it expected that the aforementioned version will log no metrics to Prometheus and should we stick to the estimator-less version for now?

Only dynamic power that will be always exported as 0 (not estimated).
The hardware counters, cgroups and pod package power are expected to be exported to prometheus if they are available.
Otherwise, we have to investigate the issue.

I suggest to use the updated version because the previous one collects cgroup metric in a wrong way and the overflow issue is not handled on some metrics.

Please confirm the following points

  • Pod's metrics reported in the kepler log with detected pod name/namespace?
  • Names of prometheus metric
    • pod_energy_stat
    • pod_<curr|total>energy_in<core|dram|uncore|gpu|other|pkg>_millijoule
      Screenshot 2022-08-25 at 23 05 36

    Note: the grafana dashboard should be updated.

  • Prometheus address parsed to the Kepler command (--address 0.0.0.0:9102)

from kepler.

sunya-ch avatar sunya-ch commented on May 30, 2024

Hello sunya-ch and thanks for giving this some attention!

  • Pod's metrics are correctly reported with expected pods from expected nodes
  • I can see pod_energy_stat and other expected metrics in Prometheus, thus confirming that they are being sent
  • Prometheus address is set to 0.0.0.0:9102

The "grafana dashboard should be updated" note is what it boils down to, I think. I can see that dashboards use "pod_cpu_energy_total", "pod_dram_energy_total" and "pod_energy_total" metrics (and others different from the list specified above), which I can also find in the Prometheus. Both the Grafana-defined names and the new ones can be found there, the new ones are being reported to Prometheus.

Is my understanding correct that there has been a metric name-change in the meantime and as so the Grafana dashboards found in grafana-dashboards are incompatible with the new metric names?

If that is so, thanks for getting this sorted out, I mean, thanks for helping flesh out the issue :)

Thank you so much for your kind testing. It's very good to see the expected behaviour there :)

And yes, your understanding is perfectly correct 👍

from kepler.

Feelas avatar Feelas commented on May 30, 2024

Good to hear then : )

Going back to issue topic, the estimator.sock issue has been well explained (as WIP & expected) and I think we can close this to not keep this ticket unnecessarily open.

from kepler.

rootfs avatar rootfs commented on May 30, 2024

sounds great @Feelas

from kepler.

rootfs avatar rootfs commented on May 30, 2024

keep this issue open till the dashboard and metrics are consistent

from kepler.

Feelas avatar Feelas commented on May 30, 2024

Not going to commit right now since I don't know whether there will be time to address this, but will look into it.

Quick question: is there a direct replacement for previous "pod_energy_total" & "pod_energy_current", which were a sum of cpu+dram? I see there are many more metrics exported right now and summing them up inside the dashboard would be cumbersome.

from kepler.

sunya-ch avatar sunya-ch commented on May 30, 2024

Not going to commit right now since I don't know whether there will be time to address this, but will look into it.

Quick question: is there a direct replacement for previous "pod_energy_total" & "pod_energy_current", which were a sum of cpu+dram? I see there are many more metrics exported right now and summing them up inside the dashboard would be cumbersome.

Thank you for pointing this out.
pkg energy will include cpu, dram, and uncore which are reported by RAPL.
pod package energy computed from RAPL package power is pod_<curr|total>_energy_in_pkg_millijoule

However, we might add another metric that is package energy + GPU energy + the other part (node - package - GPU if node energy is available) to replace the pod_energy_total. Currently, this value is reported as a value of pod_energy_stat.
Should we separate it as a new metric?
btw, I didn't put this metric at the first place because it could be an inconsistent metric between the system that node energy is available and the system that does not have it.

float64(v.EnergyInPkg.Aggr+v.EnergyInGPU.Aggr+v.EnergyInOther.Aggr),

from kepler.

Feelas avatar Feelas commented on May 30, 2024

So it seems that "pod_energy_total" & "pod_energy_current" is almost same as "pod_<curr|total>_energy_in_pkg_millijoule" and these could theoretically be used as a replacement, correct?

from kepler.

sunya-ch avatar sunya-ch commented on May 30, 2024

So it seems that "pod_energy_total" & "pod_energy_current" is almost same as "pod_<curr|total>_energy_in_pkg_millijoule" and these could theoretically be used as a replacement, correct?

I am not sure about the original purpose of the pod_energy_total and pod_energy_current but if it includes just only the energy from package (mainly core+dram), my answer is yes.

from kepler.

Feelas avatar Feelas commented on May 30, 2024

Taking a look at commit de3584a it seem to historically be calculated as core+dram.

from kepler.

rootfs avatar rootfs commented on May 30, 2024

the dashboard picks up the latest metrics name now.

from kepler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.