nvidia / gpu-monitoring-tools Goto Github PK
View Code? Open in Web Editor NEWTools for monitoring NVIDIA GPUs on Linux
License: Apache License 2.0
Tools for monitoring NVIDIA GPUs on Linux
License: Apache License 2.0
I've deployed dgcm-exporter on kubernetes on gpu node (https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/k8s/node-exporter/gpu-node-exporter-daemonset.yaml) , and it failed with error:
Failed to get unit file state for nvidia-fabricmanager.service: Unknown error 1540613216
Starting NVIDIA host engine...
Collecting metrics at /run/prometheus/dcgm.prom every 1000ms...
Stopping NVIDIA host engine...
Unable to terminate host engine, it may not be running.
/usr/local/bin/dcgm-exporter: line 154: kill: (38698) - No such process
Done
Here is output of nvidia-smi:
# ./nvidia-smi
Tue Jan 14 11:53:22 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 50C P8 30W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
thx
I tried to install the https://nvidia.github.io/gpu-monitoring-tools/helm-charts
on my k8n cluster seems to be a compatibility problem with K8N 1.16
is there a solution?
../gpu-monitoring-tools/bindings/go/samples/dcgm/restApi > go build && ./restApi
2018/08/17 11:33:16 Running http server on localhost:8070
2018/08/17 11:33:31 error: localhost:8070/dcgm/device/info/id/GPUID: strconv.ParseUint: parsing "GPUID": invalid syntax
2018/08/17 11:33:46 error: localhost:8070/dcgm/device/info/id/GPUID/json: strconv.ParseUint: parsing "GPUID": invalid syntax
2018/08/17 11:33:55 error: localhost:8070/dcgm/process/info/pid/PID: strconv.ParseUint: parsing "PID": invalid syntax
2018/08/17 11:34:05 error: localhost:8070/dcgm/health/id/GPUID: strconv.ParseUint: parsing "GPUID": invalid syntax
I am trying to get the helm charts to work on kubernetes 1.17. This doesn's work because there are changes to the template.
Can you please let me know how i can get access to the helm charts configuration files from repo?
After trying to get the bare-metal install of prometheus-dcgm working, I have a little feedback:
One of the prerequisites listed on the prometheus-dcgm README should be NVIDIA datacenter-gpu-manager. If you end up just at the prometheus-dcgm page instead of going to the root gpu-monitoring-tools page, it's easy to miss. It would also be nice to expose the repository containing the datacenter-gpu-manager instead of forcing the user to go through the loginwall to download it. This makes automation much more practical.
Another prerequisite that should be mentioned is node_exporter (and the configuration of). Though it's obvious when looking at the docker-compose.yml this information should be front and center in the README so one knows that this isn't a standalone monitoring daemon, but requires node_exporter to expose it's information to the prometheus daemon.
Related, the dcgm-exporter script presumes the user has installed the correct prerequisites. It would be nice if there was something like a simple:
if [ -x /usr/bin/dcgmi ];then
>&2 echo "ERROR: You're missing the dcgmi binary. Install the datacenter-gpu-manager package"
exit 1
fi
if [ -x /usr/bin/nv-hostengine];then
>&2 echo "ERROR: You're missing the nv-hostengine binary. Install the datacenter-gpu-manager package"
exit 1
fi
The prometheus-dcgm.service file should have the -e
arguement like the docker container has, otherwise the service just restarts all day long without nv-hostengine to get the shell script to daemonize. Like so
ExecStart=/usr/local/bin/dcgm-exporter -e
DCGM Diagnostics is very useful for us to detect GPU error.
But I can not found the bindings of this feature.
Can somebody support it?
Thx~
I'd like to collect node/pod metrics for GPU, and when applying pod-gpu-node-exporter-daemonset.yaml, I find several bugs and suggestion
the output of nvidia-dcgm-exporter
can NOT be integrated into node-exporter
, the root cause is: the metrics of nvidia-dcgm-exporter
is redirect to /run/prometheus/dcgm.prom, but in node-exporter it is set to collect from /run/dcgm , so of course you can get NOTHING
pod exporter does NOT work and throw an error says
failed to get devices Pod information: failure connecting to /var/lib/kubelet/pod-resources/kubelet.sock: context deadline exceeded
I check the /var/lib/kubelet/pod-resources under my node but it's an empty folder. I am using k8s 1.14 where kubelet.sock is located under /var/lib/kubelet/device-plugins
.
Then following this issue, I modify the hostpath to the folder of kubelet.sock, while the pod-exporter raise a new exception:
failed to get devices Pod information: failure getting pod resources rpc error: code = Unimplemented desc = unknown service v1alpha1.PodResourcesLister
I assume this is due to incompatible k8s protobuf, so may you provide the corresponding docker image?
dcgm_gpu_utilization
, which is not a good pratice as I can not simply distinguish them. You should give them different key, like dcgm_gpu_utilization
for node level, and pod_gpu_utilization
for pod levelHi,
when I use $ go build && ./deviceInfo ,there are some errors:
[root@gpu07 processInfo]# pwd
/opt/go/src/github.com/gpu-monitoring-tools/bindings/go/samples/dcgm/processInfo
[root@gpu07 processInfo]# go build .
# github.com/NVIDIA/gpu-monitoring-tools/bindings/go/dcgm
../../../../../../NVIDIA/gpu-monitoring-tools/bindings/go/dcgm/device_info.go:99: cannot use &values[0] (type *_Ctype_struct___9) as type *_Ctype_struct___11 in argument to func literal
../../../../../../NVIDIA/gpu-monitoring-tools/bindings/go/dcgm/device_status.go:129: cannot use &values[0] (type *_Ctype_struct___9) as type *_Ctype_struct___11 in argument to func literal
[root@gpu07 processInfo]# go version
go version go1.10 linux/amd64
[root@gpu07 processInfo]#
The code of https://github.com/NVIDIA/gpu-monitoring-tools/blob/b70474fb8511ed7d9af02d8306c11b9828da3b66/bindings/go/nvml/bindings.go#L600
has logic bug.
look at the src code:
600 func (h handle) getDisplayInfo() (display Display, err error) {
601 var mode, isActive C.nvmlEnableState_t
602
603 r := C.nvmlDeviceGetDisplayActive(h.dev, &mode)
604 if r == C.NVML_ERROR_NOT_SUPPORTED {
605 return
606 }
607
608 if r != C.NVML_SUCCESS {
609 return display, errorString(r)
610 }
611
612 r = C.nvmlDeviceGetDisplayMode(h.dev, &isActive)
613 if r == C.NVML_ERROR_NOT_SUPPORTED {
614 return
615 }
616 if r != C.NVML_SUCCESS {
617 return display, errorString(r)
618 }
619 display = Display{
620 Mode: ModeState(mode),
621 Active: ModeState(isActive),
622 }
623 return
624 }
The var mode
and isActive
should be exchanged for its real meaning. The ModeState's String method is not correct:
25 const (
26 Enabled ModeState = iota
27 Disabled
28 )
29
30 func (m ModeState) String() string {
31 switch m {
32 case Enabled:
33 return "Enabled"
34 case Disabled:
35 return "Disabled"
36 }
37 return "N/A"
38 }
The const Enabled
shall be 1 but the src code defines it to 0 at line 26.
I tested the func of getDisplayInfo at my real world, nvidia-smi shows display mode 'disabled', but the func getDisplayInfo tells me 'enabled'.
I have seen the help message "# HELP dcgm_pcie_rx_throughput Total number of bytes received through PCIe RX (in KB)". I think that it means the dcgm_pcie_rx_throughput is the sum of bytes received since the GPU up. But , i found that the value of dcgm_pcie_rx_throughput is not monotonous growth. So, what's the real meaning of dcgm_pcie_rx_throughput?
I need some help.
My node-exporter daemonset was running fine for approx. 12 hours before terminated due to OOMKilled. It was using nvidia/dcgm-exporter:1.4.6 and I tried different version of node-exporter images, e.g. quay.io/prometheus/node-exporter:v0.16.0, v0.17.0 and v0.18.1. and disabled unnecessary collectors (e.g. wifi) but still have the same issue.
I noticed that on Kibana GPU metric (dcgm_gpu_temp shown below) somehow keeps growing:
Any idea what's going on and how to debug? Thanks.
is it possible to access gpu metrics dashboard access outside cluster??
Following the guide in DCGM exporter and demo in here, I only find that it does can expose GPU usage on node level.
My question is, can it expose GPU usage on container level ?
If so , how can we access that?
Hi,
I have an issue using the collector on a dgx server.
These two metrics are empty for every gpu(no value inserted at the end of line in the .prom file):
dcgm_nvlink_replay_error_count_total
dcgm_nvlink_recovery_error_count_total
Some info on the system:
GPU: Tesla V100-SXM2-16GB
Driver Version: 384.125
On another server with the same package but different hardware (Tesla P100-SXM2-16GB) The metrics are all fine.
I find the pid by using nvidia-smi,it show
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 5611 C python3 5073MiB | | 0 9334 C python 1805MiB | +-----------------------------------------------------------------------------+
Then
dcgmi stats -p 5611 -v
always result
Error: Unable to get information for process with PID 5611. Return: No data is available.
When I build project using gpu-monitoring-tools on Darwin, there raises error:
ld: unknown option: --unresolved-symbols=ignore-in-object-files
I find https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/bindings/go/nvml/bindings.go#L5.
--unresolved-symbols=ignore-in-object-files
Above not for darwin os.
I use
// #cgo linux LDFLAGS: -ldl -Wl,--unresolved-symbols=ignore-in-object-files
// #cgo darwin LDFLAGS: -ldl -Wl,-undefined,dynamic_lookup
to solve my problem, but I not sure that works for others.
I'm running the prometheus exporter nodes as part of my K8S cluster.
All of the dcgm metrics have an element with the name"instance" which is basically the IP address of the node (e.g instance="172.21.4.101:9100")
I'm having trouble when trying to create a rather complex promql query that "joins" elements from kube_state_metrics (e.g. kube_node_lables).
the kube_state_metrics has an instance element as well but unfortunately it holds only the K8S cluster ip and not the actual node IP. in some metrics there is a node element which is the hostname.
Is it possible that the dcgm exporter will export the hostname/nodename element in addition to instance?
Thanks
I install datacenter-gpu-manager-1.6.3-1.x86_64.rpm in centos7, and write some code in golang1.12.7. And then build the project, two errors are encountered:
dcgm/device_info.go:99:141: cannot use &values[0] (type *_Ctype_struct___9) as type *_Ctype_struct___11 in assignment
dcgm/device_status.go:129:144: cannot use &values[0] (type *_Ctype_struct___9) as type *_Ctype_struct___11 in assignment
Any help is appreciated
Hey,
I installed the daemon set in my EKS cluster for GPU node only, and I already labelled my GPU node as described in the instructions.
yaml file: gpu-node-exporter-daemonset.yaml
However I found there is 0 pod which is running for this daemonset.
The gpu node is using ami with below prerequisites:
NVIDIA drivers
The nvidia-docker2 package
The nvidia-container-runtime (as the default runtime)
https://docs.aws.amazon.com/eks/latest/userguide/gpu-ami.html
I installed the NVIDIA device plugin for Kubernetes with below commands:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta/nvidia-device-plugin.yml
Could you please help look into this?
Thanks.
Chengning
We'd like to deploy NVML-based monitoring tools to each task container, providing GPU information for ML engineers to take performance analysis.
However, if the PID namespace of the task container is isolated from the host machine's, we found that, even deployed within the container, the NVML (func nvmlDeviceGetComputeRunningProcesses
) gives the PID(s) on the host machine. That makes the following info processing difficult because only the container PID namespace is visible to users (ML engineers).
Is there any solution to overcome this pid namespace isolation? Or does NVML has any plan to extend nvmlDeviceGetComputeRunningProcesses
so that it can return pid in the container PID namespace?
dcgm-exporter on a node with hung GPU's results in a hung call to nvidia-smi in this awk mashup:
The only alertable/actionable thing in this scenario is stale metrics in node exporter, which are not immediate, or necessarily reliable.
Additionally, sometimes nvidia-smi exits immediately in a crashed GPU scenario indicating a GPU "fell off the bus", and the host requires reboot. In this scenario, nvidia-smi does not return any data to dcgm-exporter, and it simply stops reporting metrics.
dcgm-exporter should be GPU fault tolerant and also expose GPU health status as a metric itself.
I try get 8 devices' status by call func (d *Device) Status() (status *DeviceStatus, err error)
in range, and cost 1 second to complete it
and I only care about MemoryInfo
, I think it's better support more detail method
I've deployed dgcm-exporter on kubernetes on gpu node (https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/k8s/node-exporter/gpu-node-exporter-daemonset.yaml) .
@guptaNswati did fix, I've redeployed gpu exporter image, and now it fails with next error:
Starting NVIDIA host engine...
Collecting metrics at /run/prometheus/dcgm.prom every 1000ms...
Stopping NVIDIA host engine...
Unable to terminate host engine, it may not be running.
/usr/local/bin/dcgm-exporter: line 167: kill: (13453) - No such process
Done
Hi, I think dcgm-exporter should not fail on CPU nodes. My usecase is following: I am using standard https://github.com/coreos/prometheus-operator and want to use same node-exporter yaml for all kubernetes nodes
Starting NVIDIA host engine...
Collecting metrics at /run/prometheus/dcgm.prom every 1000ms...
/usr/local/bin/dcgm-exporter: line 55: nvidia-smi: command not found
Stopping NVIDIA host engine...
Unable to terminate host engine, it may not be running.
/usr/local/bin/dcgm-exporter: line 143: kill: (22212) - No such process
Done```
Image nvidia/dcgm-exporter:1.4.6 does not have nvidia-smi installed in it.
Dockerfile starts from a ubuntu 16.04 base image, and does not install nvidia-utils.
I ran a nvidia/cuda container, installed the utils and successfully ran nvidia-smi.
could you please update the Dockerfile or make a guide on how to build the image?
On Tesla T4, with ECC mode enabled, the nvml binding does not report ecc error count correctly.
I got nil value with error NVML_ERROR_NOT_SUPPORTED, however nvidia-smi command line tool works without problem
Metrics for gpu0 appear underneath their respective "# TYPE" line as expected but all other gpu's metrics are grouped together at the bottom under "# TYPE dcgm_fb_used gauge"
See output:
https://gist.github.com/thim22/7e81f30796cfad2012262f8abb0929f4
Is there some example that shows how to configure the monitoring dashboard using the tools TRTIS+Prometheus+Grafana?
My goal is to build a Monitoring Dashboard where I can monitor my data center with models deployed on TRTIS. Question: should I build a DCGM+Prometheus+Grafana Monitoring Dashboard or a TRTIS+Prometheus+Grafana Monitoring Dashboard? Which one will provide broader metrics of my data center and less overhead?
After deploy on kubernetes following instructions.
Thanks
I apply this:https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/k8s/node-exporter/gpu-node-exporter-daemonset.yaml , show failed from pod logs
root@XP005:/home/gpu-monitoring-tools# kubectl logs -f node-exporter-c99c5 -c nvidia-dcgm-exporter
Starting NVIDIA host engine...
Failed to start host engine server
Collecting metrics at /run/prometheus/dcgm.prom every 1000ms...
Stopping NVIDIA host engine...
Host engine successfully terminated.
Done
Hi, i use gpu-monitoring-tools
to collect my nvidia gpu metrics, i want to get memory total, like
# nvidia-smi --format=csv --query-gpu=memory.total,memory.used,memory.free,name
memory.total [MiB], memory.used [MiB], memory.free [MiB], name
7840 MiB, 20 MiB, 7820 MiB, Tesla P4
i see deviceGetMemoryInfo
not return mem.total in bindings/go/nvml/bindings.go
. Should it add this value?
In addition, i run nvidia-smi -q
, return below information:
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 4000
how can i use this library to get these information? Thanks.
cavan@cavan:~/gopath/chesscloud/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/samples/nvml/deviceInfo$ ls
main.go
cavan@cavan:~/gopath/chesscloud/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/samples/nvml/deviceInfo$ gb
cavan@cavan:~/gopath/chesscloud/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/samples/nvml/deviceInfo$ ls
deviceInfo main.go
cavan@cavan:~/gopath/chesscloud/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/samples/nvml/deviceInfo$ ./deviceInfo
./deviceInfo: symbol lookup error: ./deviceInfo: undefined symbol: nvmlDeviceGetCount_v2
Trying to deploy
https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/dcgm-exporter-daemonset.yaml
daemonset "dcgm-exporter-daemonset" created
But curl: (7) Failed to connect to 10.233.109.167 port 9100: Connection refused
Could you please provide correct yaml file for Kubernetes metrics
Hi, all
I write a tool to get nvidia gpu metrics, i already install nvidia device on my host and works well, but this host not install nvidia-docker runtime.
Now i want to run my application in docker without nvidia-docker runtime, i want to know which libs should be mount in docker.
Thanks.
DCGM metrics not showing via curl xxxx:9100/metrics, only node_*_ type metrics showing.
When I cat /run/prometheus/dcgm I do see what I want to collect.
Really no idea what I'm doing. Im new to this. Trying to scrape DCGM stats into prometheus on another server. Im working in an air gapped environment, so finding it hard to piece together, Im new enough to docker. I can see my GPUS so I think its running
# docker exec nvidia-dcgm-exporter dcgmi discovery -i a -v | grep -c 'GPU ID:'
4
# nvidia-smi -L | wc -l
4
dcgm_power_usage{gpu="3",uuid="GPU-807436aXXXYYYZZZZZZZ0547d1624eae"} 24.196
I think I need to output curl xxxx:9100/metrics to display all the dcgm_ metrics ?
Any help appreciated
Right now, gpu-monitoring-tools doesn't have a prometheus server. It relies on node exporter to read static file /run/prometheus/dcgm.prom
.
Most of the users probably already install node exporters. I am curious if we should decouple these two components and create prometheus server directly.
For visibility and tracking.
In exporters/prometheus-dcgm/dcgm-exporter/dcgm-exporter we can see that the nvlink_bandwidth_total
counter is disabled. Can anyone give an explanation to why and and what is blocking from enabling this counter?
In the dmon example, memory utilization and GPU utilization are calculated. For example, in the following example, memory utilization is 1%, but how is this value defined? Is it the percentage of bandwidth used during the last second ? And is there any reference document?
$ go build && ./dmon
# sample output
Started host engine version 1.4.3 using socket path: /tmp/dcgmrxvqro.socket
# gpu pwr temp sm mem enc dec mclk pclk
# Idx W C % % % % MHz MHz
0 43 48 0 1 0 0 3505 936
0 43 48 0 1 0 0 3505 936
I'm running a bare metal k8s cluster (kube_version: v1.14.1), with nvidia-gpu-device-plugin pods on each one of my GPU servers to enable GPU resources allocation. Right now I have two GPU enabled servers - (let's call them servers A and B).
I've deployed gpu-monitoring-tools using the Helm chart available.
As stated in the documentation, gpu-monitoring-tools only works if the default runtime is "nvidia", and because of that I've installed nvidia-docker2 on Server A and changed the /etc/docker/daemon.json accordingly.
But then the following issue arised...
Now when I schedule a pod via kubectl with GPU resources requests, the gpu isolation settings are not enforced for Server A. As a result the scheduled pod has visibility to all GPUs available on that server (two), as shown below:
gpu-pod at Server A, with 'nvidia' as Default Runtime:
root@gpu-pod:/# nvidia-smi
Thu Jul 18 10:08:20 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.74 Driver Version: 418.74 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:0A:00.0 Off | N/A |
| 0% 38C P8 20W / 300W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:41:00.0 Off | N/A |
| 35% 35C P8 25W / 260W | 0MiB / 10981MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
gpu-pod at Server B, with runc as Default Runtime:
root@gpu-pod:/# nvidia-smi
Thu Jul 18 10:04:40 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.74 Driver Version: 418.74 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:63:00.0 Off | N/A |
| 0% 29C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
gpu-pod.yaml - Kubernetes manifest used
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: tensorflow/tensorflow:1.13.2-gpu-py3
command: ["sleep"]
args: ["100000"]
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
kubernetes.io/hostname: serverA
#kubernetes.io/hostname: serverB
Has anyone experienced this problem previously? Any help would be much appreciated!
I am trying to figure out how to add field #4 here: https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/dcgm-exporter/dcgm-exporter#L43
and then how to call like here: https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/dcgm-exporter/dcgm-exporter#L70
I am sure I am not doing something right and I apologize for what may be a super easy question
Currently, dcgm exporter run dcgmi dmon
command to populate metrics. I am thinking if it's ok to move to golang and use nvml
library to get these metrics instead. This would be easier for development and maintenance in long run.
The nvml README "processInfo" show pid with "sm",the sample code not show
https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/bindings/go/samples/nvml/processInfo/main.go
I am unable to build go software against the nvml go bindings due to the following error:
/tmp/go-build280384238/b001/exe/build: symbol lookup error: /tmp/go-build280384238/b001/exe/build: undefined symbol: nvmlDeviceGetCount_v2
Can anyone provide guidance on the explicit packages and requirements to compile this code.
I am installing the following packages cuda-nvml-dev-8-0 and nvidia-cuda-dev
One thing worth mentioning is that the build is occurring within a container and so I am looking to install the development libraries etc without having running hardware.
Hi there, I'd like to do some diy on pod-gpu-metrics-exporter image, so I tried to rebuild the dockfile under the path of gpu-monitoring-tools/exporters/prometheus-dcgm/k8s/pod-gpu-metrics-exporter/. However it stuck with the 4th step of dockerfile RUN go install -v pod-gpu-metrics-exporter
with this error occurred
# pod-gpu-metrics-exporter
./server.go:17:2: podResourcesMaxSize redeclared in this block
previous declaration at ./kubelet_server.go:17:38
./server.go:20:6: connectToServer redeclared in this block
previous declaration at ./kubelet_server.go:20:56
./server.go:26:19: connectToServer.func1 redeclared in this block
previous declaration at ./kubelet_server.go:26:19
./server.go:36:6: getListOfPods redeclared in this block
previous declaration at ./kubelet_server.go:36:79
The command '/bin/sh -c go install -v pod-gpu-metrics-exporter' returned a non-zero code: 2
could you please take a look at it if available? huge THANKS in advance
hi,
When I use the bindings/go/samples/dcgm/deviceInfo
file to get information about a process, I get the following error
runtime: bad pointer in frame main.main at 0xc0000edb98: 0x1
fatal error: invalid pointer found on stack
but when I use the dcgmi stats --pid 22574 -v
,I can get the correct result.
I am using cuda8 and the GPU device information is as follows:
Driver Version : 390.12
DCGMSupported : Yes
UUID : GPU-fe260d42-a1b6-2e82-5e69-c1029b1bef56
Brand : Tesla
Model : Tesla M40 24GB
i ran the command
docker run --runtime=nvidia --rm --name=nvidia-dcgm-exporter nvidia/dcgm-exporter
and i get this error. I think It still uses --display command
'''
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused "process_linux.go:413: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --compat32 --graphics --utility --video --display --pid=10963 /home1/docker/overlay2/b4d42f8cac3866227520ba83541a4e57ff52533bdfd21a6fc25194a2ae6380db/merged]\\nnvidia-container-cli configure: unrecognized option '--display'\\nTry nvidia-container-cli configure --help' or
nvidia-container-cli configure\\n--usage' for more information.\\n\""": unknown.
'''
nvidia-container-cli version:
version: 1.0.0
build date: 2018-01-11T00:23+0000
build revision: 4a618459e8ba522d834bb2b4c665847fae8ce0ad
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-16)
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
nvidia-docker version
NVIDIA Docker: 2.0.3
Client:
Version: 18.09.7
API version: 1.39
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:56:06 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.7
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:26:28 2019
OS/Arch: linux/amd64
Experimental: false
dcgm-exporter use shell scripts to wrote metrics to a static file ('/run/prometheus/dcgm.prom') which expose statistics on local disk. Node-exporter read this file via --collector.textfile.directory
.
The disadvantage of this pattern is dcgm has dependency on node exporter to collect GPU metrics. A lots of users may already has node exporter setup for their non accelerator workloads. I would suggest to host a prom server and using dcgm bindings to get metrics separately.
I can help on the work
I deployed node-device-exporter-daemonset.yaml, but it did not find anything about pod info with gpu. I found that i did not set KubeletPodResources, but i did not find this path /etc/default/kubelet in the deployed node, should i create one and then add the KUBELET_EXTRA_ARGS=--feature-gates=KubeletPodResources=true? Or is it releated with the kubernetes version? my kubernetes version is v1.10.
thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.