sustainable-computing-io / kepler Goto Github PK

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics

Home Page: https://sustainable-computing.io

License: Apache License 2.0

Makefile 0.20% C 91.72% Dockerfile 0.04% Go 7.30% Shell 0.53% Ruby 0.02% Python 0.19%

kubernetes sustainability ebpf prometheus-exporter energy-consumption energy-monitor energy-efficiency prometheus cloud-native machine-learning

kepler's Introduction

Kepler

Kepler (Kubernetes Efficient Power Level Exporter) uses eBPF to probe energy-related system stats and exports them as Prometheus metrics.

As a CNCF Sandbox project, Kepler uses CNCF Code of Conduct

Architecture

Kepler Exporter exposes a variety of metrics about the energy consumption of Kubernetes components such as Pods and Nodes.

Install Kepler

Instructions to install Kepler can be found in the Kepler docs.

Visualise Kepler metrics with Grafana

To visualise the power consumption metrics made available by the Kepler Exporter, import the pre-generated Kepler Dashboard into Grafana:

Contribute to Kepler

Interested in contributing to Kepler? Follow the Contributing Guide to get started!

Talks & Demos

A full list of talks and demos about Kepler can be found here.

Community Meetings

Please join the biweekly community meetings. The meeting calendar and agenda can be found here

License

With the exception of eBPF code, everything is distributed under the terms of the Apache License (version 2.0).

eBPF

All eBPF code is distributed under either:

The terms of the GNU General Public License, Version 2 or the BSD 2 Clause license, at your option.
The terms of the GNU General Public License, Version 2.

The exact license text varies by file. Please see the SPDX-License-Identifier header in each file for details.

Files that originate from the authors of kepler use (GPL-2.0-only OR BSD-2-Clause). Files generated from the Linux kernel i.e vmlinux.h use GPL-2.0-only.

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, as defined in the GPL-2 license, shall be dual licensed as above, without any additional terms or conditions.

kepler's People

Contributors

Stargazers

Watchers

Forkers

marceloamaral therealsibasishbehera williamcaban husky-parul sunya-ch nikimanoledaki sfishel18 tatsuhirochiba eloycoto wangchen615 kaiyiliu1234 holly-cummins ssinji project-flotta amruta-bandhu-chaudhury ruomengh kenplusplus feelas sallyom suigh samyuan1990 jichenjc xiaopeng-zh huoqifeng cooktheryan dylwong ocp22 yrizvi2 marioferh rhuss vincent-pli ciecierski metacosm mariusbrill summercms leizhou-97 chr15p rinana jiere deepika0196 pinoogni bharathappali andressadotpy rootfs prjteresi eschercloudai daosman apercelsi fabiorenn kidmam davidszegedi adrianhammond amandawinkles saurabh3460 jbu alysondeives ankit-kumar-singh-01 ukumar009 zhaozmibm kingdonb stevenleizhang fshiga-01 bradmccoydev vprashar2929 iq-scm odidev subhamkumarpuresoftware nanyte25 pulsezeng giantank haojue mcalman valeriafreese sthaha japita-se lbemi vimalk78 yanbo0101 anju-alexander cisco-emea-cx-cto nextnet-works morningsong feven-nit andersonandrei michaeldtz pritam27 gpgn aspanner eklee15 abe-hpe yasuenag tiwatsuka astoycos openshift-power-monitoring sympatheticmoose sbdtu5498 datadoc24 dhanrajlab omahs platformtestnet2

kepler's Issues

Kepler on s390 platform

Is your feature request related to a problem? Please describe.
As an s390 application developer or SRE, I'd like to get the energy consumption measurement on x390, so that I can have real-time monitoring of power consumption of my environments on s390.

Describe the solution you'd like
I would like to have Kepler running on s390x platform and measure the power consumption.

Describe alternatives you've considered
Currently, there is no alternative on s390.

Additional context
Need check if there is any compatibility issue of Kepler running on s390x platform and fix it if any.

consolidate installation instructions and prerequisites

Opening this issue for tracking purposes

Consolidate installation instructions from the main README.md with the README.md under manifests
Clearly identify prerequisites and OS dependencies (e.g. cgroup v2, kernel-headers, etc)
Specify the K8s versions for which it is known to work

[RFE] Correlate Pod and Intel ACC100 FEC power consumption

Is your feature request related to a problem? Please describe.
The Telco 5G RAN workloads have high power consumption. Today there is not a consistent way to collect these metrics to understand how much is the actual power consumption of these accelerators (e.g. Intel ACC100) based on the use of each Pod, and much less how it relates to the power consumption of a specialized CNF like the Distributed Unit (DU) workload.

Describe the solution you'd like

Using metrics collected by Intel ACC100 FEC and node-level information, correlate/map the power consumption of hardware accelerators (e.g. Intel ACC100) to the specific DU workloads using it, and how much of the power consumption each Pod represents.

Additional context
Intel Operator for Wireless FEC Accelerators
https://catalog.redhat.com/software/operators/detail/6001a748e4e3f23b0b6ad765

Accelerate unit_test_with_bcc

Is your feature request related to a problem? Please describe.
Accelerate unit_test_with_bcc

Describe the solution you'd like
Currently the test spends much time on preparing the bcc on ubuntu.

Describe alternatives you've considered
Since we already build bcc in kepler base container image, this could be optimized by running tests inside supported containers

Additional context
@SamYuan1990 @cooktheryan

Deploy prometheus operator as a part of integration test

Is your feature request related to a problem? Please describe.
so far we just handle deployment test for kepler pod itself, moving next we hope a deployment with prometheus operator and running integration test.

Describe the solution you'd like
Prometheus operator
https://github.com/prometheus-operator/prometheus-operator#quickstart

VERSION="v0.59.1"

kubectl create -f 
https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/${VERSION}/bundle.yaml

Describe alternatives you've considered
n/A

Additional context
n/A

why use http to get pod instead of using native k8s call

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

looks like we are using this to get pods info ,it works but why we don't use k8s native way as it seems
widely used in k8s world ?

podUrl = "https://" + nodeName + ":" + port + "/pods"
        metricsUrl = "https://" + nodeName + ":" + port + "/metrics/resource"

something like

return client.CoreV1().Pods("").Watch(context.TODO(), metav1.ListOptions{} etc?

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Track PID/Cgroup through /proc.

Currently the PID/CgroupID mapping are through /sys. This works for Cgroup V2, but not v1.

On Cgroup V1 systems, accessing /proc is needed to convert PID/CgroupID to Pod name

Build s390x image via buildx cross-platform

Leverage docker buildx cross-platform to build s390x image.

~~builder~~
base #194, #202
kepler #203
Build scripts for multi-arch for local #241

github-action user can't push commit back

Describe the bug
github-action user can't push commit back

To Reproduce
Steps to reproduce the behavior:

See error https://github.com/sustainable-computing-io/kepler/actions/runs/3079415202/jobs/4980117646#step:9:7

Expected behavior
github-action can push commit back as sample SamYuan1990@62d342d

Screenshots
n/A

Desktop (please complete the following information):
n/A

Smartphone (please complete the following information):
n/A

Additional context
n/A

Prometheus not able monitor the metrics from kepler namespace by default

Describe the bug
In commit, the namespace is changed from monitoring => kepler. This break prometheus discovery the metrics from the namespace monitoring by default.

To Reproduce
Steps to reproduce the behavior:

Follow the readme to deploy Kelper according to https://github.com/sustainable-computing-io/kepler/blob/main/manifests/kubernetes/deployment.yaml
Enable service monitoring via https://github.com/sustainable-computing-io/kepler/blob/main/manifests/kubernetes/keplerExporter-serviceMonitor.yaml
On promethus-k8s service web http://<prometheus_service_ip>:9090/, there is no active target found for serviceMonitor/monitoring/kepler-exporter/0 (0 / 42 active targets)

Expected behavior
serviceMonitor should found target for kepler-exporter in prometheus Service Monitor target.

Screenshots

Update installation manifests to include a serviceaccount access token secret

Describe the bug
Manifests for deploying kepler require an update wrt the grafana serviceaccount access tokens.
As of latest kube version, serviceaccounts no longer automatically generate access tokens. Now a token secret is required. I'll submit an update.

https://docs.openshift.com/container-platform/4.11/authentication/using-service-accounts-in-applications.html#auto-generated-sa-token-secrets_using-service-accounts

Resolve Container ID by both PID and Cgroup ID

There are two following ways to associate a process to its container ID:

by reading the process's /proc/pid/cgroup and find the container ID. This requires exposing host's /proc filesystem to Kepler container.
by using the process's Cgroup ID and resolve it in cgroup FS at /sys/fs/cgroup. This doesn't require host's /proc

The Cgroup ID is the current way in practice. However, CgroupID resolution requires Cgroup V2 support. For environment without Cgroup V2 enabled, a fallthrough to PID resolution is needed.

@marceloamaral

dial error: dial unix /tmp/estimator.sock: connect: no such file or directory

Describe the bug
After rolling over the daemonset to the latest image on quay.io registry (sha256:01a86339a8acb566ddcee848640ed4419ad0bffac98529e9b489a3dcb1e671f5) there is the message from title being shown constantly. Example output of the problem:

2022/08/25 12:30:53 Kubelet Read: map[<pod-list-trimmed>]
2022/08/25 12:30:53 dial error: dial unix /tmp/estimator.sock: connect: no such file or directory
energy from pod (0 processes): name: <some-pod> namespace: <some-namespace>

Is the estimator.sock expected to be missing in current state of the project?

Each node is reporting the same error.
As a sidenote, since then nodes are not logging any new kepler metrics to Prometheus. I am in no place to suggest that these are connected issues and the missing metrics might be some other local issue, but there's that.

To Reproduce
Steps to reproduce the behavior:

Run kepler on OpenShift 4.11
Check kepler-exporter container logs for presence of '/tmp/estimator.sock: connect: no such file or directory'

Expected behavior
/tmp/estimator.sock error is not reported.

Desktop (please complete the following information):

OS: RedHat CoreOS 4.11

Possible wrong spelling on variable name

It looks to me the NODE_ENERGY_STAT_METRRIC = "node_energy_stat" on line29

kepler/pkg/collector/collector.go

Line 29 in 5d4f4e9

NODE_ENERGY_STAT_METRRIC = "node_energy_stat"

might be spelling error.

Can anyone confirm? The same spelling is also used on the following files:

kepler/pkg/collector/collector.go

Line 123 in 424d2f8

NODE_ENERGY_STAT_METRRIC,

kepler/pkg/collector/collector_test.go

Line 133 in fb8af8c

val, err = convertPromToValue(body, NODE_ENERGY_STAT_METRRIC)

running UT at /kepler/pkg/pod_lister always Success

Describe the bug
A clear and concise description of what the bug is.

no matter condition it is , always success due to

failed to get response: failed to read from "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directorytesting: warning: no tests to run
PASS

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Startup configuration

There are multiple hardwired const. A startup configuration file is needed.

panic: inconsistent label cardinality: expected 21 label values but got 20

Describe the bug
A clear and concise description of what the bug is.

follow https://github.com/sustainable-computing-io/kepler

after ~/kube-prometheus# kubectl apply -f manifests/, saw this error

seems something inconsistent ..

panic: inconsistent label cardinality: expected 21 label values but got 20 in []string{"system_processes", "system", "containerd", "388246", "2674068", "0", "0", "0", "0", "0", "0", "3", "151", "17428480", "1070764032", "0", "0", "0", "0", "0"}

goroutine 101 [running]:
github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/vendor/github.com/prometheus/client_golang/prometheus/value.go:107
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Collect(0xc000400710, 0xc0000fdf60?)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/collector/collector.go:315 +0x1a76
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/vendor/github.com/prometheus/client_golang/prometheus/registry.go:446 +0xfb
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/vendor/github.com/prometheus/client_golang/prometheus/registry.go:538 +0xb0b

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Adding Kepler to GitOps pipeline

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Initial Helm-based installation

Create Helm charts for the installation of kepler-exporter

kepler-exporter-lvr6f in CrashLoopBackOff

install from https://www.youtube.com/watch?v=P5weULiBl60
on a kind container.. got following error

# kubectl logs kepler-exporter-lvr6f -n monitoring
2022/09/05 03:33:36 InitSliceHandler: &{map[] /sys/fs/cgroup/cpu /sys/fs/cgroup/memory /sys/fs/cgroup/blkio}
cpu architecture Haswell, dram in GB 62
use power estimate to obtain power
2022/09/05 03:33:36 Available counter metrics: [cpu_cycles cpu_instr cache_miss]
2022/09/05 03:33:36 Available cgroup metrics: []
2022/09/05 03:33:36 Available kubelet metrics: [container_cpu_usage_seconds_total container_memory_working_set_bytes]
2022/09/05 03:33:36 set coreMetricIndex = 1
2022/09/05 03:33:36 set generalMetricIndex = 1
2022/09/05 03:33:36 set dramMetricIndex = 3
config EnabledEBPFCgroupID enabled:  true
config getKernelVersion:  4.15
config set EnabledEBPFCgroupID to  false
modprobe: FATAL: Module kheaders not found in directory /lib/modules/4.15.0-20-generic
chdir(/lib/modules/4.15.0-20-generic/build): No such file or directory
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x812ad6]

goroutine 1 [running]:
github.com/iovisor/gobpf/bcc.(*Module).Load(0x0, {0xd3706f, 0xc}, 0x2?, 0x2?, 0xc0000ddc58?)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/vendor/github.com/iovisor/gobpf/bcc/module.go:202 +0x36
github.com/iovisor/gobpf/bcc.(*Module).LoadTracepoint(...)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/vendor/github.com/iovisor/gobpf/bcc/module.go:182
github.com/sustainable-computing-io/kepler/pkg/attacher.loadModule({0x15c0200?, 0xd35bc9?, 0xc000599a40?}, {0xc000237b00, 0x2, 0x2})
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/attacher/bcc_attacher.go:63 +0x94
github.com/sustainable-computing-io/kepler/pkg/attacher.AttachBPFAssets()
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/attacher/bcc_attacher.go:104 +0x25d
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Attach(0xc000227eb0)
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/pkg/collector/collector.go:96 +0x25
main.main()
        /opt/app-root/src/github.com/sustainable-computing-io/kepler/cmd/exporter.go:67 +0x23a

cover: open coverage.out: no such file or directory

Describe the bug
go coverage CI error

--- PASS: TestPodLoader (0.00s)
PASS
coverage: 24.6% of statements
ok  	github.com/sustainable-computing-io/kepler/pkg/podlister	0.052s
?   	github.com/sustainable-computing-io/kepler/pkg/power/acpi	[no test files]
?   	github.com/sustainable-computing-io/kepler/pkg/power/gpu	[no test files]
?   	github.com/sustainable-computing-io/kepler/pkg/power/rapl	[no test files]
?   	github.com/sustainable-computing-io/kepler/pkg/power/rapl/source	[no test files]
cover: open coverage.out: no such file or directory
Error: Process completed with exit code 1.

To Reproduce
This is in this run result

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

CI needs more instrumentation

Describe the bug
CI is getting flaky

+ ./hack/cluster-deploy.sh
waiting for cluster-clean to finish
+ source cluster-up/common.sh
++ set -e
++ '[' kubernetes = kind ']'
Deploying manifests...
+ CLUSTER_PROVIDER=kubernetes
+ MANIFESTS_OUT_DIR=_output/manifests/kubernetes/generated
+ main pipefail
+ '[' '!' -d _output/manifests/kubernetes/generated ']'
+ echo 'Deploying manifests...'
+ kubectl apply -f _output/manifests/kubernetes/generated
namespace/kepler created
clusterrole.rbac.authorization.k8s.io/kepler-clusterrole created
clusterrolebinding.rbac.authorization.k8s.io/kepler-clusterrole-binding created
serviceaccount/kepler-sa created
daemonset.apps/kepler-exporter created
service/kepler-exporter created
servicemonitor.monitoring.coreos.com/kepler-exporter created
+ kubectl rollout status daemonset kepler-exporter -n kepler --timeout 60s
Waiting for daemon set "kepler-exporter" rollout to finish: 0 of 1 updated pods are available...
error: timed out waiting for the condition
make: *** [Makefile:165: cluster-sync] Error 1
Error: Process completed with exit code 2.

To Reproduce
Recent CI results here
Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Developer Guideline

This task is to provide a developer guideline with:

Code of Conduct
Development environment
Architecture overview

some pod energy not reported

Describe the bug
A clear and concise description of what the bug is.

Ubuntu env (kind running on a Ubuntu host)

# cat /proc/18005/cgroup
1:cpuacct:/
0::/system.slice/docker-3f3488e110e93031b1528ecfb2bc5332eae24cf8e578b38e660755c894768196.scope/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-podd59c0a68_38a6_49a9_8228_f58bde917e74.slice/cri-containerd-dc39c4066a25edda33a0c58d85a96cf27bbb7d3d429a9a7e3f243e9c3507c1b8.scope


but in logs

pid 18005 cgroup 0 cmd kindnetd

2022/09/20 01:30:32 failed to resolve pod for cGroup ID 0: failed to open cgroup description file for pid 18005: open /proc/18005/cgroup: no such file or directory, set podName=system_processes


there is no pod info in final output 

energy from pod (1 processes): name: local-path-provisioner-9cd9bd544-whx5v namespace: local-path-storage
        cgrouppid: 0 pid: 1891 comm: uwsgi
        ePkg (mJ): 14 (14) (eCore: 14 (14) eDram: 176 (176) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0)
        eDyn (mJ): 0 (0)
        avgFreq: 0.00
        CPUTime:  0 (0)
        counters: map[cache_miss:0 (0) cpu_cycles:0 (0) cpu_instr:0 (0)]
        cgroupfs: map[]
        kubelets: map[container_cpu_usage_seconds_total:0 (437) container_memory_working_set_bytes:0 (10362880)]

energy from pod (140 processes): name: system_processes namespace: system
        cgrouppid: 0 pid: 1319 comm: kcs-term
        ePkg (mJ): 14 (6023) (eCore: 14 (6023) eDram: 176 (78436) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0)
        eDyn (mJ): 0 (0)
        avgFreq: 0.00
        CPUTime:  246000 (295147978162387)
        counters: map[cache_miss:0 (0) cpu_cycles:0 (0) cpu_instr:0 (0)]
        cgroupfs: map[]
        kubelets: map[container_cpu_usage_seconds_total:1 (47506) container_memory_working_set_bytes:266240 (1093697536)]

energy from pod (2 processes): name: coredns-6d4b75cb6d-5kzhk namespace: kube-system
        cgrouppid: 0 pid: 2127 comm: python3
        ePkg (mJ): 14 (6023) (eCore: 14 (6023) eDram: 176 (78436) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0)
        eDyn (mJ): 0 (0)
        avgFreq: 0.00
        CPUTime:  1005 (307397)
        counters: map[cache_miss:0 (0) cpu_cycles:0 (0) cpu_instr:0 (0)]
        cgroupfs: map[]
        kubelets: map[container_cpu_usage_seconds_total:0 (2658) container_memory_working_set_bytes:0 (19816448)]

energy from pod (1 processes): name: kube-controller-manager-kind-control-plane namespace: kube-system
        cgrouppid: 0 pid: 688 comm: multipathd
        ePkg (mJ): 14 (6023) (eCore: 14 (6023) eDram: 176 (78436) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0)
        eDyn (mJ): 0 (0)
        avgFreq: 0.00
        counters: map[cache_miss:0 (0) cpu_cycles:0 (0) cpu_instr:0 (0)]
        cgroupfs: map[]
        kubelets: map[container_cpu_usage_seconds_total:0 (15764) container_memory_working_set_bytes:0 (52715520)]

node energy (mJ):
        ePkg: 53 (eCore: 53 eDram: 701 eUncore: 0) eGPU: 0 eOther: 0

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Add OpenTelemetry converter to export Prometheus metrics

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Fix golint issues highlighted by golangci-lint

Is your feature request related to a problem? Please describe.
The PR #168 introduces go linters to improve the code quality highlighting problems.

After executing golangci-lint locally, it has identified 125 issues in the current code. As you can see here.

Describe the solution you'd like
Fix the issues in the code.

Additional context
This is PR is part of the effort for testing the code quality, as describe in the issue #161

Kepler Documentation

Describe the solution you'd like
Add documentation on:

About Kepler
Describe the modules, bpf, model server, collector, power, attacher etc
Run on Kubernetes
Run on OpenShift
Run locally

Doc repo

Add Kepler to eBPF landscape

Is your feature request related to a problem? Please describe.
Add Kepler to the eBPF Landscape https://ebpf.io/applications

Describe the solution you'd like
It would be great if you could open a PR like ebpf-io/ebpf.io-website#246

Describe alternatives you've considered
N/A

Additional context
We love to see the eBPF community grow 🐝

Enhance CI with basic integration test

Is your feature request related to a problem? Please describe.
So far, we just verify by until kubectl get svc --all-namespaces ...
to make it better, we'd better have a test suite and targeting on running service.

Describe the solution you'd like
as above.

Describe alternatives you've considered
N/A

Additional context
N/A

microshift as an environment for integration test

Is your feature request related to a problem? Please describe.
add microshift as micro of openshift as integration test env

Describe the solution you'd like
considering github action providing agent as ubuntu, should try command below to see if we are able to use microshift as a k8s cluster for integration test.
https://github.com/thinkahead/microshift/blob/main/install-ubuntu22.04.sh

Describe alternatives you've considered
n/A

Additional context
n/A

docker(containerd): not able to read container, plan to support?

Describe the bug
A clear and concise description of what the bug is.

https://github.com/sustainable-computing-io/kepler/blob/main/pkg/pod_lister/resolve_container.go#L202

expecting pod/crio
this works on RHEL

1:name=systemd:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb4e8e0dc_caf3_4a43_bf50_f84ef5ef7a8f.slice/crio-conmon-927cffdca4f374600fe49a62bfd7f2e73ea6d37cdafb1bf8f89afb64d7e6c6a5.scope

but I run this on docker(containerd) got following

# cat /proc/1959/cgroup
0::/system.slice/containerd.service

# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID              POD
db90aabe3ba00       295c7be079025       About an hour ago   Running             nginx                     0                   6bdfe6263b2e1       nginx-deployment-6595874d85-h7csj
677824b3d4028       295c7be079025       About an hour ago   Running             nginx                     0                   caf965c196f45       nginx-deployment-6595874d85-wdtm6

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Add lint and memory leak detection as a basic test case in CI

Is your feature request related to a problem? Please describe.
Before unit test running, we'd better have a lint test and memory leak detection as a part of CI.

Describe the solution you'd like
A clear and concise description of what you want to happen.
go fmt, vet, goimports ....

go build -gcflags="-m -l" ./... | grep "escapes to heap" || true

Describe alternatives you've considered
N/A

Additional context
N/A

Documentation issue: kube-prometheus/manifests/kubernetes does not exist

Describe the bug

Hopefully this is quick to sort out.
README.md advises the following steps:

# cd kube-prometheus
# kubectl apply --server-side -f manifests/setup
# until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
# kubectl apply -f manifests/kubernetes/

However, kube-prometheus/manifests/kubernetes does not exist.

To Reproduce
Steps to reproduce the behavior:

git clone https://github.com/prometheus-operator/kube-prometheus
cd kube-prometheus
kubectl apply --server-side -f manifests/setup
until kubectl get servicemonitors --all-namespaces ; do date; sleep 1; echo ""; done
kubectl apply -f manifests/kubernetes/

An error will be reported because manifests/kubernetes/ does not exist.

Additional context
I've looked through the commit history for https://github.com/prometheus-operator/kube-prometheus in case files were just moved, but I can't see that manifests/kubernetes/ ever existed.

add badges in readme

Is your feature request related to a problem? Please describe.
N/A

Describe the solution you'd like

go doc
go test coverage
CI

Describe alternatives you've considered
N/A

Additional context
N/A

Not providing CPU energy consumption read for Intel 8255C in a KVM cluster

Hi,

I deployed Kepler in OpenShift 4.10.9 and Kepler is only reporting 0 value for energy consumption.

The cluster is deployed on KVM, provided by Red Hat Enterprise Linux release 8.5 (Ootpa)

Here are some logs I could observed in kepler pod

2022/04/14 14:42:22 	energy from pod: name: infrastructure_operator_6655bb796d_qwrhj namespace: open-cluster-management
	eCore: 0 eDram: 0
	CPUTime: 26 (0.000000)
	cycles: 72397 (0.007132)
	misses: 487 (0.015755)
	avgCPUFreq: 0 LastCPUFreq 0
	pid: 143846 comm: assisted-servic
2022/04/14 14:42:22 	energy from pod: name: cluster_manager_registration_controller_68c5c9f9cc_g54sq namespace: open-cluster-management-hub
	eCore: 0 eDram: 0
	CPUTime: 28 (0.000000)
	cycles: 76463 (0.007533)
	misses: 455 (0.014720)
	avgCPUFreq: 0 LastCPUFreq 0
	pid: 16256 comm: registration
2022/04/14 14:42:22 	energy from pod: name: multicluster_operators_standalone_subscription_7c7bfdd85c_k757x namespace: open-cluster-management
	eCore: 0 eDram: 0
	CPUTime: 3 (0.000000)
	cycles: 8482 (0.000836)
	misses: 41 (0.001326)
	avgCPUFreq: 0 LastCPUFreq 0
	pid: 146882 comm: multicluster-op
2022/04/14 14:42:22 	energy from pod: name: cluster_manager_registration_webhook_75cd6479c9_wzkrr namespace: open-cluster-management-hub
	eCore: 0 eDram: 0
	CPUTime: 6 (0.000000)
	cycles: 16846 (0.001660)
	misses: 64 (0.002070)
	avgCPUFreq: 0 LastCPUFreq 0
	pid: 17244 comm: registration

Pod eCore, eUncore is reported as 0, pod in_core metrics reported as 0

Describe the bug
In kepler-exporter pod logs, eCore is reported as 0 for all pods.
pod_<curr|total>_energy_in_core_millijoule parameters are correctly sent to Prometheus, but also set to 0 all of the time.

To Reproduce

Run kepler-exporter
Check values of eCore metrics and pod_<curr|total>_energy_in_core_millijoule metrics inside Prometheus.

Expected behavior
eCore metrics are not reported as 0, pod_<curr|total>_energy_in_core_millijoule metrics report proper values.

Additional information
Kepler-exporter is ran from image sha256:819a0b056f86a754c3b58ef31c3dd2fbcf279dcb02caf7e3bfd8a471683081a6

getKernelVersion doesn't work at all

Describe the bug
A clear and concise description of what the bug is.

https://github.com/sustainable-computing-io/kepler/blob/main/pkg/config/config.go#L65

paste those to https://go.dev/play/ and run it

package main

import (
	"encoding/json"
	"fmt"

	"github.com/zcalusic/sysinfo"
)

func main() {

	var si sysinfo.SysInfo

	si.GetSysInfo()

	data, err := json.MarshalIndent(&si, "", "  ")
	if err == nil {
		var result map[string]map[string]string
		if err = json.Unmarshal(data, &result); err != nil {
			fmt.Println("----")
			fmt.Println(err)
			fmt.Println("----")
		}
	}
	fmt.Println("done")
}

----
json: cannot unmarshal number into Go value of type string
----
done

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

More hwmon data.

Currently power meter is read from hwmon. More hwmon data is needed, including CPU temperature, cooling, other power draw, etc.

Cannot get pod energy information

Hi, I'm trying to use Kepler in my k8s cluster. It was deployed on one node (node1) together with Prometheus and Grafana.
There are many pods running on this node. I was expecting all the pod energy can be displayed in Grafana dashboard, however I can only see one record of "pod_energy_stat" in Prometheus, which is about "pod_name="system_processes", pod_namespace="system"", and this pod/namespace doesn't even exist in my cluster. Do you have any clue on what the issue is about?

cpu time becomes zero after long run

Describe the bug
After running for extended hours, the curr_cpu_time becomes zero and never recovers

{
  "__name__": "node_energy_stat",
  "container": "kepler-exporter",
  "cpu_architecture": "Cascade Lake",
  "curr_cache_misses": "10019190",
  "curr_cpu_cycles": "4005839184",
  "curr_cpu_instructions": "5068047632",
  "curr_cpu_time": "0.000000",              <----------- this never recovers
  "curr_energy_in_core": "154835.000000",
  "curr_energy_in_dram": "5.000000",
  "curr_energy_in_gpu": "0.000000",
  "curr_energy_in_other": "0.000000",
  "curr_resident_memory": "10457391104.000000",
  "endpoint": "http",
  "instance": "xxxxx",
  "job": "kepler-exporter",
  "namespace": "monitoring",
  "node_name": "xxxx",
  "pod": "kepler-exporter-9k75l",
  "service": "kepler-exporter"
}

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

Testing

Unit and integration tests

Consolidate MachineConfig

Is your feature request related to a problem? Please describe.
An enhancement

Describe the solution you'd like
As @williamcaban found, the two MachineConfig can be consolidated

Dashboard issues - Datasource not found, OpenShift installation URLs are incorrect

Describe the bug
Importing Grafana dashboard with process found in manifests/openshift/dashboard 04-grafana-dashboard.yaml fails due to changes introduced in #116. The URL inside no longer exists.

When dashboard is imported by hand to avoid the previous issue, several datasource errors are thrown by Grafana, related to the dashboard being non-standarized (has hardcoded Prometheus UIDs). In PR #118 I have proposed the necessary changes, but these may only be a subset. It would be good to further test this for repeated deployments of dashboard.

To Reproduce

Try importing dashboard by running ./manifests/openshift/03-grafana-datasource-define.sh & oc create -f manifests/openshift/04-grafana-dashboard.yaml
Import grafana-dashboard/Kepler-Exporter.json by hand to Grafana
Review errors which are thrown by Grafana when trying to populate Dashboard options and other data

Expected behavior

It is possible to create the dashboard using 03-grafana-datasource-define.sh & oc create -f manifests/openshift/04-grafana-dashboard.yaml commands
Loading the dashboard in a generic deployment is possible and doesn't cause any errors, working properly

Screenshots

Desktop

OpenShift 4.11 cluster

Kepler on OpenShift, energy estimates are all zeros

I deployed with the following commands on OpenShift
(from PR #128 manifests/openshift/README)

oc apply --kustomize $(pwd)/manifests/openshift/cluster-prereqs
# The cluster-prereqs modifies all nodes to enable cgroupsv2. This takes a long time
# Each node is decommissioned and rebooted - may take ~20 minutes.

# Check before proceeding that all nodes are Ready and Schedulable
oc get nodes

oc apply --kustomize $(pwd)/manifests/openshift/kepler
# Check that kepler pods are up and running before proceeding

# The following script applies the kustomize files in $(pwd)/manifests/openshift/dashboard
$(pwd)/deploy-grafana.sh

All resources are healthy, but energy estimates are all showing zeros, like so

In Grafana dashboard the "CPU" metric is sometimes bigger than "pkg" metric

Describe the bug
CPU metric from the same timeframe is sometimes bigger than pkg metric on a platform where (as discussed in #120) Core is derived from Pkg. On the presented screenshot, the green metric is pkg (labeled as "Total") and the yellow one is Core - there are numerous periods when Core is higher than pkg, despite pkg is derived from Core.

This is possibly an artifact of Grafana queries & processing done when fetching the data, as querying same data directly from Prometheus through a different UI shows no cuts between two graphs.

To Reproduce
No specific trigger scenario, observe workload metrics in Grafana.

Expected behavior
'Core' metric is not higher than 'Pkg' in Grafana observable metrics.

Screenshots
Total being lower than core in Grafana in the specific time period - there are cuts between both graphs

Pkg not crossing with core on different graph (fetched through OpenShift -> Observe -> Metrics query): Blue is curr_energy_in_core, yellow is curr_energy_in_pkg.

Desktop

OpenShift 4.11 cluster

add DCO check?

Is your feature request related to a problem? Please describe.
I am not sure if we should enable https://github.com/dcoapp/app to ensure everyone signed their commit in pr.

Describe the solution you'd like
n/A

Describe alternatives you've considered
n/A

Additional context
n/A

Conflict between Kepler-exporter with OVN Kubernetes CNI ovn-master

Describe the bug

Kepler cannot run on the OpenShift control plane nodes when OVN Kubernetes is the CNI. OVN Kubernetes ovn-master DaemonSet already uses hostPort 9102 as seen in the following extract:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    kubernetes.io/description: |
      This daemonset launches the ovn-kubernetes controller (master) networking components.
...
spec:
  selector:
    matchLabels:
      app: ovnkube-master
...
        ports:
        - containerPort: 9102
          hostPort: 9102
          name: https
          protocol: TCP

Which is the same port the Kepler exporter daemonset tries to use

---
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kepler-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: kepler-exporter
...
        ports:
        - containerPort: 9102
          hostPort: 9102
          name: http

So, the kepler-exporter Pods remain in Pending state

kepler-exporter-8gkfs   0/2     Pending   0          15h   <none>          <none>   <none>           <none>
kepler-exporter-8ql8n   0/2     Pending   0          26m   <none>          <none>   <none>           <none>
kepler-exporter-nqxpx   0/2     Pending   0          15h   <none>          <none>   <none>           <none>

and the events shows

7m7s        Warning   FailedScheduling         pod/kepler-exporter-8gkfs   0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.
27m         Warning   FailedScheduling         pod/kepler-exporter-8ql8n   0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.
27m         Warning   ErrorAddingLogicalPort   pod/kepler-exporter-8ql8n   failed to ensurePod monitoring/kepler-exporter-8ql8n since it is not yet scheduled
7m7s        Warning   FailedScheduling         pod/kepler-exporter-8ql8n   0/6 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 5 node(s) didn't match Pod's node affinity/selector.

To Reproduce
Steps to reproduce the behavior:

Deploy Kepler to an OpenShift control plane using OVN Kubernetes as the CNI.

Additional context

OpenShift version 4.10.15

Looking at this issue, it might be possible to solve it by having Kepler running privileged (which it needs) but NOT having it running in hostNetwork mode.

03-grafana-datasource-define.sh uses deprecated 'oc serviceaccounts get-token'

Describe the bug
As can be seen in OpenShift 4.11 release notes, the dashboard creation command found in manifests/openshift/dashboard/03-grafana-datasource-define.sh uses deprecated oc serviceaccounts get-token grafana-serviceaccount -n monitoring command. As a result, the command does not return a valid token. This causes the Grafana "Prometheus" datasource to not be created.

Example output form the command (using oc cli 4.11):

[root@workstation dashboard]$ oc serviceaccounts get-token grafana-serviceaccount -n monitoring
Command "get-token" is deprecated, and will be removed in the future version. Use oc create token instead.
error: could not find a service account token for service account "grafana-serviceaccount"

To Reproduce
Steps to reproduce the behavior:

Use an OpenShift 4.11 cluster
Run manifests/openshift/dashboard/03-grafana-datasource-define.sh script and note the output of the script.

Expected behavior
Token is created successfully and 03-grafana-datasource-define.sh works properly, creating a Prometheus datasource.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OpenShift 4.11 cluster
oc tool 4.11.0-rc.1 (any version of 4.11 oc cli can be used)

Additional context
I used a straight replacement command oc create token grafana-serviceaccount --duration 31536000s -n monitoring to create a 365d token for the dashboard. Not sure whether this is the right duration, since I didn't ever use the old get-token command.

Same eCore being reported for workloads running on different frequency cores

Describe the discussion topic
Something I have noticed today when running another bunch of tests. Let's say I have three different pods, each scheduled on a CPU with a different frequency setup. What I can see in logs is similar to this (grepped to contain the barest minimum):

--
energy from pod (12 processes): name: fastest-pod namespace: some-namespace
        cgrouppid: 4295012128 pid: 681667144608761 comm: some-process
        ePkg (mJ): 38117 (2554154) (eCore: 35297 (2369735) eDram: 2820 (184419) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0)
        eDyn (mJ): 0 (0)
--
energy from pod (12 processes): name: normal-pod namespace: some-namespace
        cgrouppid: 4295011720 pid: 681229057944467 comm: some-process
        ePkg (mJ): 37950 (2536826) (eCore: 35297 (2369735) eDram: 2653 (167091) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0)
        eDyn (mJ): 0 (0)
--
energy from pod (12 processes): name: slowest-pod namespace: some-namespace
        cgrouppid: 4295011924 pid: 681413741538238 comm: some-process
        ePkg (mJ): 40101 (2657345) (eCore: 35297 (2369735) eDram: 4804 (287610) eUncore: 0 (0)) eGPU (mJ): 0 (0) eOther (mJ): 0 (0)
        eDyn (mJ): 0 (0)

Since the fastest pod is using using 4x CPU freq of slowest-pod and about 25% more than the normal-pod, does it make sense for eCore delta to be the same value for each pod?

Same behaviour continues with each refresh of the metrics. At the end, I get very similar total energy for each of the pod, despite one of those has done 4x less job due to much lower frequency. I've been able to observe eCore deltas between workloads using different frequencies using previous images.

This is using Kepler image sha256:19b72e21aa1a84f16dc35a29b8ad17a8b28097b195b75a661196657dec12da90.

Is the current behaviour the correct one?

Additional context
As a side note (might possibly help), I'm seeing about 15-50% of previously reported energy (when compared to running on older images) with current images.

cc @sunya-ch

ArgoCD setup for Kepler on OperateFirst

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

sustainable-computing-io / kepler Goto Github PK

kepler's Introduction

Kepler

Architecture

Install Kepler

Visualise Kepler metrics with Grafana

Contribute to Kepler

Talks & Demos

Community Meetings

License

eBPF

kepler's People

Contributors

Stargazers

Watchers

Forkers

kepler's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs