nvidia / gpu-monitoring-tools Goto Github PK

View Code? Open in Web Editor NEW

1.0K 35.0 305.0 6.44 MB

Tools for monitoring NVIDIA GPUs on Linux

License: Apache License 2.0

Go 26.41% Makefile 0.31% Dockerfile 0.06% Shell 0.71% HCL 0.01% Mustache 0.20% C 72.30%

gpu-monitoring-tools's People

Contributors

Stargazers

Watchers

Forkers

yan234280533 andfoy guptanswati nikolayvoronchikhin mistshi neverlock karlmutch swiftdiaries sharanyad fmoctezuma itnuri dashpole bhartiagrawalnvidia alienflash01 dereklstinson fauzansasmita yiching 694982827 zmoon111 qieqieplus anight volume-ji sjug gaomochi simbazad klueska junfugithub bnulwh ligenvidia lexuszhi1990 connectionmaster zvonkok rockyfun opsnull dewey363 jetmuffin ceizner xjas guang7niu supremind lichaoguang973 matthewygf tchfit houjun41544 k8s-tools-collection run-ai yashiang1986 brockchen batermj dankamongmen cheyang ma2331550908 sb96324 jsenon isgasho xp10102232 takmatsu infrastructure-kubernetes hholst80 yzs981130 sunnyregion wsxiaozhang bashimao kangwoo king-jingxiang evandhoffman rirl jeffwan tklebanoff junsheng-wu leezake jacobsy itayvallach edward0128 wxypro mhdbs tomeyday hzome chrijonesnv orygin zhangxingdeppon ultradio hama1080 amruta-bandhu-chaudhury wenlian kubertest visionthinking zhanghw94 krzemienski aruninnanje will-do jadeluo yan3nian bokibi awadmhamad hayunjong83 airlian jinwoongkim virtable zheshimajia

gpu-monitoring-tools's Issues

dgcm-exporter failed on GPU node

I've deployed dgcm-exporter on kubernetes on gpu node (https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/k8s/node-exporter/gpu-node-exporter-daemonset.yaml) , and it failed with error:

Failed to get unit file state for nvidia-fabricmanager.service: Unknown error 1540613216
Starting NVIDIA host engine...
Collecting metrics at /run/prometheus/dcgm.prom every 1000ms...
Stopping NVIDIA host engine...
Unable to terminate host engine, it may not be running.
/usr/local/bin/dcgm-exporter: line 154: kill: (38698) - No such process
Done

Here is output of nvidia-smi:

# ./nvidia-smi 
Tue Jan 14 11:53:22 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

is there a dashboard json file for grafana?

thx

Compatibility with K8N 1.16

I tried to install the https://nvidia.github.io/gpu-monitoring-tools/helm-charts
on my k8n cluster seems to be a compatibility problem with K8N 1.16
is there a solution?

../gpu-monitoring-tools/bindings/go/samples/dcgm/restApi > go build && ./restApi
2018/08/17 11:33:16 Running http server on localhost:8070
2018/08/17 11:33:31 error: localhost:8070/dcgm/device/info/id/GPUID: strconv.ParseUint: parsing "GPUID": invalid syntax
2018/08/17 11:33:46 error: localhost:8070/dcgm/device/info/id/GPUID/json: strconv.ParseUint: parsing "GPUID": invalid syntax
2018/08/17 11:33:55 error: localhost:8070/dcgm/process/info/pid/PID: strconv.ParseUint: parsing "PID": invalid syntax
2018/08/17 11:34:05 error: localhost:8070/dcgm/health/id/GPUID: strconv.ParseUint: parsing "GPUID": invalid syntax

Getting helm chart to work on kubernetes 1.17

I am trying to get the helm charts to work on kubernetes 1.17. This doesn's work because there are changes to the template.
Can you please let me know how i can get access to the helm charts configuration files from repo?

Improve the bare-metal install experience of prometheus-dcgm

After trying to get the bare-metal install of prometheus-dcgm working, I have a little feedback:

README Prerequisites

One of the prerequisites listed on the prometheus-dcgm README should be NVIDIA datacenter-gpu-manager. If you end up just at the prometheus-dcgm page instead of going to the root gpu-monitoring-tools page, it's easy to miss. It would also be nice to expose the repository containing the datacenter-gpu-manager instead of forcing the user to go through the loginwall to download it. This makes automation much more practical.

Another prerequisite that should be mentioned is node_exporter (and the configuration of). Though it's obvious when looking at the docker-compose.yml this information should be front and center in the README so one knows that this isn't a standalone monitoring daemon, but requires node_exporter to expose it's information to the prometheus daemon.

dcgm-exporter bash script

Related, the dcgm-exporter script presumes the user has installed the correct prerequisites. It would be nice if there was something like a simple:

if [ -x /usr/bin/dcgmi ];then
  >&2 echo "ERROR: You're missing the dcgmi binary. Install the datacenter-gpu-manager package"
  exit 1
fi
if [ -x /usr/bin/nv-hostengine];then
  >&2 echo "ERROR: You're missing the nv-hostengine binary. Install the datacenter-gpu-manager package"
  exit 1
fi

prometheus-dcgm.service file

The prometheus-dcgm.service file should have the -e arguement like the docker container has, otherwise the service just restarts all day long without nv-hostengine to get the shell script to daemonize. Like so

ExecStart=/usr/local/bin/dcgm-exporter -e

Support for Diagnostics

DCGM Diagnostics is very useful for us to detect GPU error.

But I can not found the bindings of this feature.

Can somebody support it?

Thx~

Several bugs for node-exporter/pod-gpu-node-exporter-daemonset.yaml

I'd like to collect node/pod metrics for GPU, and when applying pod-gpu-node-exporter-daemonset.yaml, I find several bugs and suggestion

the output of nvidia-dcgm-exporter can NOT be integrated into node-exporter, the root cause is: the metrics of nvidia-dcgm-exporter is redirect to /run/prometheus/dcgm.prom, but in node-exporter it is set to collect from /run/dcgm , so of course you can get NOTHING
pod exporter does NOT work and throw an error says

failed to get devices Pod information: failure connecting to /var/lib/kubelet/pod-resources/kubelet.sock: context deadline exceeded

I check the /var/lib/kubelet/pod-resources under my node but it's an empty folder. I am using k8s 1.14 where kubelet.sock is located under /var/lib/kubelet/device-plugins.
Then following this issue, I modify the hostpath to the folder of kubelet.sock, while the pod-exporter raise a new exception:

failed to get devices Pod information: failure getting pod resources rpc error: code = Unimplemented desc = unknown service v1alpha1.PodResourcesLister

I assume this is due to incompatible k8s protobuf, so may you provide the corresponding docker image?

From the guide, I assume the key of GPU metrics between node-exporter and pod-exporter is totally seem, such as dcgm_gpu_utilization, which is not a good pratice as I can not simply distinguish them. You should give them different key, like dcgm_gpu_utilization for node level, and pod_gpu_utilization for pod level

DCGM go bindings failed build

Hi,
when I use $ go build && ./deviceInfo ,there are some errors:

[root@gpu07 processInfo]# pwd
/opt/go/src/github.com/gpu-monitoring-tools/bindings/go/samples/dcgm/processInfo
[root@gpu07 processInfo]# go build .
# github.com/NVIDIA/gpu-monitoring-tools/bindings/go/dcgm
../../../../../../NVIDIA/gpu-monitoring-tools/bindings/go/dcgm/device_info.go:99: cannot use &values[0] (type *_Ctype_struct___9) as type *_Ctype_struct___11 in argument to func literal
../../../../../../NVIDIA/gpu-monitoring-tools/bindings/go/dcgm/device_status.go:129: cannot use &values[0] (type *_Ctype_struct___9) as type *_Ctype_struct___11 in argument to func literal
[root@gpu07 processInfo]# go version
go version go1.10 linux/amd64
[root@gpu07 processInfo]#

[bug] GPU display mode wrong

The code of https://github.com/NVIDIA/gpu-monitoring-tools/blob/b70474fb8511ed7d9af02d8306c11b9828da3b66/bindings/go/nvml/bindings.go#L600 has logic bug.

look at the src code:

600 func (h handle) getDisplayInfo() (display Display, err error) {                  
601   var mode, isActive C.nvmlEnableState_t                                         
602                                                                                  
603   r := C.nvmlDeviceGetDisplayActive(h.dev, &mode)                                
604   if r == C.NVML_ERROR_NOT_SUPPORTED {                                           
605     return                                                                       
606   }                                                                              
607                                                                                  
608   if r != C.NVML_SUCCESS {                                                       
609     return display, errorString(r)                                               
610   }                                                                              
611                                                                                  
612   r = C.nvmlDeviceGetDisplayMode(h.dev, &isActive)                               
613   if r == C.NVML_ERROR_NOT_SUPPORTED {                                           
614     return                                                                       
615   }                                                                              
616   if r != C.NVML_SUCCESS {                                                       
617     return display, errorString(r)                                               
618   }                                                                              
619   display = Display{                                                             
620     Mode:   ModeState(mode),                                                     
621     Active: ModeState(isActive),                                                 
622   }                                                                              
623   return                                                                         
624 }

The var mode and isActive should be exchanged for its real meaning. The ModeState's String method is not correct:

 25 const (                                                                          
 26   Enabled ModeState = iota                                                       
 27   Disabled                                                                       
 28 )                                                                                
 29                                                                                  
 30 func (m ModeState) String() string {                                             
 31   switch m {                                                                     
 32   case Enabled:                                                                  
 33     return "Enabled"                                                             
 34   case Disabled:                                                                 
 35     return "Disabled"                                                            
 36   }                                                                              
 37   return "N/A"                                                                   
 38 }

The const Enabled shall be 1 but the src code defines it to 0 at line 26.

I tested the func of getDisplayInfo at my real world, nvidia-smi shows display mode 'disabled', but the func getDisplayInfo tells me 'enabled'.

what is the meaning of "dcgm_pcie_rx_throughput"?

 I have seen the help message "# HELP dcgm_pcie_rx_throughput Total number of bytes received through PCIe RX (in KB)". I think that it means the dcgm_pcie_rx_throughput is the sum of bytes received since the GPU up. But , i found that the value of dcgm_pcie_rx_throughput is not monotonous growth. So, what's the real meaning of dcgm_pcie_rx_throughput?
 I need some help.

node-exporter OOMKilled

My node-exporter daemonset was running fine for approx. 12 hours before terminated due to OOMKilled. It was using nvidia/dcgm-exporter:1.4.6 and I tried different version of node-exporter images, e.g. quay.io/prometheus/node-exporter:v0.16.0, v0.17.0 and v0.18.1. and disabled unnecessary collectors (e.g. wifi) but still have the same issue.

I noticed that on Kibana GPU metric (dcgm_gpu_temp shown below) somehow keeps growing:

Any idea what's going on and how to debug? Thanks.

GPU metrics Dashboard outside cluster access

is it possible to access gpu metrics dashboard access outside cluster??

Can DCGM exporter exposes container level GPU usage?

Following the guide in DCGM exporter and demo in here, I only find that it does can expose GPU usage on node level.
My question is, can it expose GPU usage on container level ?
If so , how can we access that?

Error parsing dcgm.prom

Hi,
I have an issue using the collector on a dgx server.
These two metrics are empty for every gpu(no value inserted at the end of line in the .prom file):
dcgm_nvlink_replay_error_count_total
dcgm_nvlink_recovery_error_count_total

Some info on the system:
GPU: Tesla V100-SXM2-16GB
Driver Version: 384.125

On another server with the same package but different hardware (Tesla P100-SXM2-16GB) The metrics are all fine.

dcgm processInfo return "No data is available"

I find the pid by using nvidia-smi,it show
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 5611 C python3 5073MiB | | 0 9334 C python 1805MiB | +-----------------------------------------------------------------------------+

Then

dcgmi stats -p 5611 -v

always result

Error: Unable to get information for process with PID 5611. Return: No data is available.

ld: unknown option: --unresolved-symbols=ignore-in-object-files

When I build project using gpu-monitoring-tools on Darwin, there raises error:

ld: unknown option: --unresolved-symbols=ignore-in-object-files

I find https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/bindings/go/nvml/bindings.go#L5.

--unresolved-symbols=ignore-in-object-files

Above not for darwin os.
I use

// #cgo linux LDFLAGS: -ldl -Wl,--unresolved-symbols=ignore-in-object-files
// #cgo darwin LDFLAGS: -ldl -Wl,-undefined,dynamic_lookup

to solve my problem, but I not sure that works for others.

K8S incompatibility- Hostname in metrics

I'm running the prometheus exporter nodes as part of my K8S cluster.
All of the dcgm metrics have an element with the name"instance" which is basically the IP address of the node (e.g instance="172.21.4.101:9100")

I'm having trouble when trying to create a rather complex promql query that "joins" elements from kube_state_metrics (e.g. kube_node_lables).
the kube_state_metrics has an instance element as well but unfortunately it holds only the K8S cluster ip and not the actual node IP. in some metrics there is a node element which is the hostname.

Is it possible that the dcgm exporter will export the hostname/nodename element in addition to instance?

Thanks

The DCGM go bindings library build error

I install datacenter-gpu-manager-1.6.3-1.x86_64.rpm in centos7, and write some code in golang1.12.7. And then build the project, two errors are encountered:

dcgm/device_info.go:99:141: cannot use &values[0] (type *_Ctype_struct___9) as type *_Ctype_struct___11 in assignment
dcgm/device_status.go:129:144: cannot use &values[0] (type *_Ctype_struct___9) as type *_Ctype_struct___11 in assignment

Any help is appreciated

GPU metrics node exporter doesn't work in EKS

Hey,
I installed the daemon set in my EKS cluster for GPU node only, and I already labelled my GPU node as described in the instructions.
yaml file: gpu-node-exporter-daemonset.yaml
However I found there is 0 pod which is running for this daemonset.
The gpu node is using ami with below prerequisites:
NVIDIA drivers
The nvidia-docker2 package
The nvidia-container-runtime (as the default runtime)
https://docs.aws.amazon.com/eks/latest/userguide/gpu-ami.html

I installed the NVIDIA device plugin for Kubernetes with below commands:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta/nvidia-device-plugin.yml

Could you please help look into this?

Thanks.
Chengning

container PID namespace isolation with NVML

We'd like to deploy NVML-based monitoring tools to each task container, providing GPU information for ML engineers to take performance analysis.

However, if the PID namespace of the task container is isolated from the host machine's, we found that, even deployed within the container, the NVML (func nvmlDeviceGetComputeRunningProcesses) gives the PID(s) on the host machine. That makes the following info processing difficult because only the container PID namespace is visible to users (ML engineers).

Is there any solution to overcome this pid namespace isolation? Or does NVML has any plan to extend nvmlDeviceGetComputeRunningProcesses so that it can return pid in the container PID namespace?

dcgm[-exporter] should detect crashed/hung GPU's / not be dependent CLI tools

dcgm-exporter on a node with hung GPU's results in a hung call to nvidia-smi in this awk mashup:

https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/dcgm-exporter/dcgm-exporter#L51-L62

The only alertable/actionable thing in this scenario is stale metrics in node exporter, which are not immediate, or necessarily reliable.

Additionally, sometimes nvidia-smi exits immediately in a crashed GPU scenario indicating a GPU "fell off the bus", and the host requires reboot. In this scenario, nvidia-smi does not return any data to dcgm-exporter, and it simply stops reporting metrics.

dcgm-exporter should be GPU fault tolerant and also expose GPU health status as a metric itself.

nvml.Device should support more method for detail request

at nvml.go#L433L433

I try get 8 devices' status by call func (d *Device) Status() (status *DeviceStatus, err error) in range, and cost 1 second to complete it

and I only care about MemoryInfo, I think it's better support more detail method

/usr/local/bin/dcgm-exporter: line 167: kill: (13453) - No such process

I've deployed dgcm-exporter on kubernetes on gpu node (https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/k8s/node-exporter/gpu-node-exporter-daemonset.yaml) .

@guptaNswati did fix, I've redeployed gpu exporter image, and now it fails with next error:

Starting NVIDIA host engine...
Collecting metrics at /run/prometheus/dcgm.prom every 1000ms...
Stopping NVIDIA host engine...
Unable to terminate host engine, it may not be running.
/usr/local/bin/dcgm-exporter: line 167: kill: (13453) - No such process
Done

dcgm-exporter failed on CPU node

Hi, I think dcgm-exporter should not fail on CPU nodes. My usecase is following: I am using standard https://github.com/coreos/prometheus-operator and want to use same node-exporter yaml for all kubernetes nodes

Starting NVIDIA host engine...
Collecting metrics at /run/prometheus/dcgm.prom every 1000ms...
/usr/local/bin/dcgm-exporter: line 55: nvidia-smi: command not found
Stopping NVIDIA host engine...
Unable to terminate host engine, it may not be running.
/usr/local/bin/dcgm-exporter: line 143: kill: (22212) - No such process
Done```

nvidia-smi not found

Image nvidia/dcgm-exporter:1.4.6 does not have nvidia-smi installed in it.

Dockerfile starts from a ubuntu 16.04 base image, and does not install nvidia-utils.

I ran a nvidia/cuda container, installed the utils and successfully ran nvidia-smi.

could you please update the Dockerfile or make a guide on how to build the image?

add hostname to dcgm_* metrics

Memory.ECCErrors is null in nvml binding

On Tesla T4, with ECC mode enabled, the nvml binding does not report ecc error count correctly.
I got nil value with error NVML_ERROR_NOT_SUPPORTED, however nvidia-smi command line tool works without problem

Exporter output appears out of order

Metrics for gpu0 appear underneath their respective "# TYPE" line as expected but all other gpu's metrics are grouped together at the bottom under "# TYPE dcgm_fb_used gauge"

See output:
https://gist.github.com/thim22/7e81f30796cfad2012262f8abb0929f4

exporters/prometheus-TRTIS

Is there some example that shows how to configure the monitoring dashboard using the tools TRTIS+Prometheus+Grafana?

My goal is to build a Monitoring Dashboard where I can monitor my data center with models deployed on TRTIS. Question: should I build a DCGM+Prometheus+Grafana Monitoring Dashboard or a TRTIS+Prometheus+Grafana Monitoring Dashboard? Which one will provide broader metrics of my data center and less overhead?

"/usr/local/bin/dcgm-exporter: line 55: nvidia-smi: command not found" in DCGM exporter

After deploy on kubernetes following instructions.

Thanks

node-exporter

I apply this:https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/k8s/node-exporter/gpu-node-exporter-daemonset.yaml , show failed from pod logs

root@XP005:/home/gpu-monitoring-tools# kubectl logs -f node-exporter-c99c5 -c nvidia-dcgm-exporter
Starting NVIDIA host engine...
Failed to start host engine server
Collecting metrics at /run/prometheus/dcgm.prom every 1000ms...
Stopping NVIDIA host engine...
Host engine successfully terminated.
Done

How to get mem.total with nvml

Hi, i use gpu-monitoring-tools to collect my nvidia gpu metrics, i want to get memory total, like

# nvidia-smi --format=csv --query-gpu=memory.total,memory.used,memory.free,name
memory.total [MiB], memory.used [MiB], memory.free [MiB], name
7840 MiB, 20 MiB, 7820 MiB, Tesla P4

i see deviceGetMemoryInfo not return mem.total in bindings/go/nvml/bindings.go. Should it add this value?

In addition, i run nvidia-smi -q, return below information:

Display Mode                    : Enabled
Display Active                  : Disabled
Persistence Mode                : Enabled
Accounting Mode                 : Enabled
Accounting Mode Buffer Size     : 4000

how can i use this library to get these information? Thanks.

nvml sample cannot work

cavan@cavan:~/gopath/chesscloud/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/samples/nvml/deviceInfo$ ls
main.go
cavan@cavan:~/gopath/chesscloud/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/samples/nvml/deviceInfo$ gb
cavan@cavan:~/gopath/chesscloud/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/samples/nvml/deviceInfo$ ls
deviceInfo  main.go
cavan@cavan:~/gopath/chesscloud/src/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/samples/nvml/deviceInfo$ ./deviceInfo 
./deviceInfo: symbol lookup error: ./deviceInfo: undefined symbol: nvmlDeviceGetCount_v2

Have you tested kubernetes dcgm-exporter-daemonset.yaml?

Trying to deploy
https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/dcgm-exporter-daemonset.yaml

daemonset "dcgm-exporter-daemonset" created
But curl: (7) Failed to connect to 10.233.109.167 port 9100: Connection refused

Could you please provide correct yaml file for Kubernetes metrics

How to Load NVML library in docker without nvidia-docker runtime

Hi, all

I write a tool to get nvidia gpu metrics, i already install nvidia device on my host and works well, but this host not install nvidia-docker runtime.

Now i want to run my application in docker without nvidia-docker runtime, i want to know which libs should be mount in docker.

Thanks.

DCGM metrics not showing via curl xxxx:9100/metrics

DCGM metrics not showing via curl xxxx:9100/metrics, only node_*_ type metrics showing.

When I cat /run/prometheus/dcgm I do see what I want to collect.

Really no idea what I'm doing. Im new to this. Trying to scrape DCGM stats into prometheus on another server. Im working in an air gapped environment, so finding it hard to piece together, Im new enough to docker. I can see my GPUS so I think its running

#  docker exec nvidia-dcgm-exporter dcgmi discovery -i a -v | grep -c 'GPU ID:'
4
# nvidia-smi -L | wc -l
4

dcgm_power_usage{gpu="3",uuid="GPU-807436aXXXYYYZZZZZZZ0547d1624eae"} 24.196

I think I need to output curl xxxx:9100/metrics to display all the dcgm_ metrics ?

Any help appreciated

Decouple gpu-monitoring-tools with node exporter

Right now, gpu-monitoring-tools doesn't have a prometheus server. It relies on node exporter to read static file /run/prometheus/dcgm.prom.

Most of the users probably already install node exporters. I am curious if we should decouple these two components and create prometheus server directly.

nvlink bandwidth metrics

For visibility and tracking.

In exporters/prometheus-dcgm/dcgm-exporter/dcgm-exporter we can see that the nvlink_bandwidth_total counter is disabled. Can anyone give an explanation to why and and what is blocking from enabling this counter?

How the memory and GPU utilization defined

In the dmon example, memory utilization and GPU utilization are calculated. For example, in the following example, memory utilization is 1%, but how is this value defined? Is it the percentage of bandwidth used during the last second ? And is there any reference document?

$ go build && ./dmon

# sample output

Started host engine version 1.4.3 using socket path: /tmp/dcgmrxvqro.socket
# gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     %     %     %     %   MHz   MHz
    0    43    48     0     1     0     0  3505   936
    0    43    48     0     1     0     0  3505   936

GPU isolation not working after setting default runtime to nvidia

I'm running a bare metal k8s cluster (kube_version: v1.14.1), with nvidia-gpu-device-plugin pods on each one of my GPU servers to enable GPU resources allocation. Right now I have two GPU enabled servers - (let's call them servers A and B).

I've deployed gpu-monitoring-tools using the Helm chart available.
As stated in the documentation, gpu-monitoring-tools only works if the default runtime is "nvidia", and because of that I've installed nvidia-docker2 on Server A and changed the /etc/docker/daemon.json accordingly.

But then the following issue arised...

Now when I schedule a pod via kubectl with GPU resources requests, the gpu isolation settings are not enforced for Server A. As a result the scheduled pod has visibility to all GPUs available on that server (two), as shown below:

gpu-pod at Server A, with 'nvidia' as Default Runtime:

root@gpu-pod:/# nvidia-smi 
Thu Jul 18 10:08:20 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.74       Driver Version: 418.74       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   38C    P8    20W / 300W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:41:00.0 Off |                  N/A |
| 35%   35C    P8    25W / 260W |      0MiB / 10981MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

gpu-pod at Server B, with runc as Default Runtime:

root@gpu-pod:/# nvidia-smi 
Thu Jul 18 10:04:40 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.74       Driver Version: 418.74       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:63:00.0 Off |                  N/A |
|  0%   29C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

gpu-pod.yaml - Kubernetes manifest used

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: tensorflow/tensorflow:1.13.2-gpu-py3
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          nvidia.com/gpu: 1
  nodeSelector:
    kubernetes.io/hostname: serverA
    #kubernetes.io/hostname: serverB

Has anyone experienced this problem previously? Any help would be much appreciated!

Help pulling field #4

I am trying to figure out how to add field #4 here: https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/dcgm-exporter/dcgm-exporter#L43

and then how to call like here: https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/dcgm-exporter/dcgm-exporter#L70

I am sure I am not doing something right and I apologize for what may be a super easy question

Replace shell command with nvml library to populate GPU metrics

Currently, dcgm exporter run dcgmi dmon command to populate metrics. I am thinking if it's ok to move to golang and use nvml library to get these metrics instead. This would be easier for development and maintenance in long run.

The nvml README "processInfo" show pid with "sm",the sample code not show

The nvml README "processInfo" show pid with "sm",the sample code not show
https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/bindings/go/samples/nvml/processInfo/main.go

Build instructions for nvml go bindings needed

I am unable to build go software against the nvml go bindings due to the following error:

/tmp/go-build280384238/b001/exe/build: symbol lookup error: /tmp/go-build280384238/b001/exe/build: undefined symbol: nvmlDeviceGetCount_v2

Can anyone provide guidance on the explicit packages and requirements to compile this code.

I am installing the following packages cuda-nvml-dev-8-0 and nvidia-cuda-dev

One thing worth mentioning is that the build is occurring within a container and so I am looking to install the development libraries etc without having running hardware.

Have trouble in making docker image of pod-gpu-metrics-exporter

Hi there, I'd like to do some diy on pod-gpu-metrics-exporter image, so I tried to rebuild the dockfile under the path of gpu-monitoring-tools/exporters/prometheus-dcgm/k8s/pod-gpu-metrics-exporter/. However it stuck with the 4th step of dockerfile RUN go install -v pod-gpu-metrics-exporter with this error occurred

# pod-gpu-metrics-exporter
./server.go:17:2: podResourcesMaxSize redeclared in this block
	previous declaration at ./kubelet_server.go:17:38
./server.go:20:6: connectToServer redeclared in this block
	previous declaration at ./kubelet_server.go:20:56
./server.go:26:19: connectToServer.func1 redeclared in this block
	previous declaration at ./kubelet_server.go:26:19
./server.go:36:6: getListOfPods redeclared in this block
	previous declaration at ./kubelet_server.go:36:79
The command '/bin/sh -c go install -v pod-gpu-metrics-exporter' returned a non-zero code: 2

could you please take a look at it if available? huge THANKS in advance

runtime: bad pointer in frame main.main at 0xc0000edb98: 0x1

hi,
When I use the bindings/go/samples/dcgm/deviceInfo file to get information about a process， I get the following error

runtime: bad pointer in frame main.main at 0xc0000edb98: 0x1  
fatal error: invalid pointer found on stack

but when I use the dcgmi stats --pid 22574 -v ,I can get the correct result.

I am using cuda8 and the GPU device information is as follows:

Driver Version         : 390.12
DCGMSupported          : Yes
UUID                   : GPU-fe260d42-a1b6-2e82-5e69-c1029b1bef56
Brand                  : Tesla
Model                  : Tesla M40 24GB

--display option bug while running docker image

i ran the command

docker run --runtime=nvidia --rm --name=nvidia-dcgm-exporter nvidia/dcgm-exporter

and i get this error. I think It still uses --display command

'''
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused "process_linux.go:413: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --compat32 --graphics --utility --video --display --pid=10963 /home1/docker/overlay2/b4d42f8cac3866227520ba83541a4e57ff52533bdfd21a6fc25194a2ae6380db/merged]\\nnvidia-container-cli configure: unrecognized option '--display'\\nTry nvidia-container-cli configure --help' or nvidia-container-cli configure\\n--usage' for more information.\\n\""": unknown.
'''

nvidia-container-cli version:
version: 1.0.0
build date: 2018-01-11T00:23+0000
build revision: 4a618459e8ba522d834bb2b4c665847fae8ce0ad
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-16)
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

nvidia-docker version

NVIDIA Docker: 2.0.3
Client:
Version: 18.09.7
API version: 1.39
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:56:06 2019
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 18.09.7
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:26:28 2019
OS/Arch: linux/amd64
Experimental: false

Change dcgm-exporter to expose metrics through prometheus web server

dcgm-exporter use shell scripts to wrote metrics to a static file ('/run/prometheus/dcgm.prom') which expose statistics on local disk. Node-exporter read this file via --collector.textfile.directory.

The disadvantage of this pattern is dcgm has dependency on node exporter to collect GPU metrics. A lots of users may already has node exporter setup for their non accelerator workloads. I would suggest to host a prom server and using dcgm bindings to get metrics separately.

I can help on the work

does pod-devices-exporter work in kubernetes version=1.10 ?

I deployed node-device-exporter-daemonset.yaml, but it did not find anything about pod info with gpu. I found that i did not set KubeletPodResources, but i did not find this path /etc/default/kubelet in the deployed node, should i create one and then add the KUBELET_EXTRA_ARGS=--feature-gates=KubeletPodResources=true? Or is it releated with the kubernetes version? my kubernetes version is v1.10.
thank you!