GPU Sharing Device Plugin for Kubernetes Cluster

License: Apache License 2.0

Dockerfile 1.17% Go 98.83%

gpushare-device-plugin's People

Stargazers

Watchers

Forkers

cheyang ringtail 694982827 lcasi awesome-archive skymysky wsxiaozhang yuxijin-tobeyjin bnulwh k7 fzuwill rafmonteiro monstercy fattakbar gbtyy xjas morecoffee101 melvynpan alasdairtran riverzhang yupengzte artemzholus hellolijj sakuralbj reaminjocye lunar-knights godki1eraron leviccui yashdusing davidstack jear qiankai-kwai dst1213 646677064 xiaolin1990 airlovelq luolian0 chelarua xingbu110 cicijohn1983 ppomelo box9527 zhuanght mozhata xuhaoigeneral icefed liaixiong iamsunguangzhi happy2048 danieltanyouzu vic0777 igorzan lishiyucn noliaoliao gavinljj aland-zhang vistakk cermakm xujunbj king-jingxiang acproject nkiraly kevinwang2011 ai-cloud-kubernetes guokunwang lpf20200901 cydrain wenyuan-ma cosmoplat-edge wwj-2017-1117 anhuaxiang wangzheng422 jadeluo rajitha1998 liming8502628 nicozhang guangnimabi merry1314 tzzcfrank goodrain qmloong iteratorlee ctripcloud chenaoxd rongbinz duokuiwang iliuwenjing njlkj jackyh joyxu klinghang wangjxian chenwenyan highflyjss enzoyes ogre0403 yelianjin data-race wangqiongkaka fullpolarfox

gpushare-device-plugin's Issues

containerd and nvidia-container-runtime instead of nvidia-docker2

Any chance to have the device plugin working on containerd without nvidia-docker2?

I have rebuild my cluster with Conteinerd and on my worker nodes
the following are installed
libnvidia-container
nvidia-container-toolkit
nvidia-container-runtime

but the device plugin rises the error:

0425 10:34:29.375414 1 main.go:18] Start gpushare device plugin
I0425 10:34:29.382160 1 gpumanager.go:28] Loading NVML
I0425 10:34:29.382601 1 gpumanager.go:31] Failed to initialize NVML: could not load NVML library.
I0425 10:34:29.382616 1 gpumanager.go:32] If this is a GPU node, did you set the docker default runtime to nvidia?

The default runtime has been setup to nvidia-container-runtime

[plugins."io.containerd.runtime.v1.linux"]
no_shim = false
runtime = "nvidia-container-runtime"
runtime_root = ""
shim = "containerd-shim"
shim_debug = false

Anyone has found a workaround?
Any plan to replace nvidia-docker2 with nvidia-container-runtime

Thanks

该程序在 k8s .1.25中无法使用

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown

Question: request gpushare on the same GPU

Hello,

is it possible for several pods to request gpu share on any but same gpu card? E.g., if you have stateful set consisting of Xserver container and application container, you need those two share the same gpu card.. I would request like 1G mem for each of the containers however, if I have more than one GPU per node, I have no guarantees they use the same device, right?

GPU device not detected with nvidia driver > 430.XX

When running with 450.XX or 460.XX drivers, the logs of the pod are:

gpumanager.go:28] Loading NVML
gpumanager.go:31] Failed to initialize NVML: could not load NVML library.
gpumanager.go:32] If this is a GPU node, did you set the docker default runtime to `nvidia`?

The nvidia driver is running correctly on the machine as nvidia-smi show the gpu.

We are currently trying to update the dependancies of the project and rebuilding the device plugin but have failed to solve the issue.

`aliyun.com/gpu-mem` when did you update to nodestatus

In this repo, I cannot find aliyun.com/gpu-mem update to node status, and I just find aliyun.com/gpu-count update to node status at NewNvidiaDevicePlugin func

How to install on Mac?

cd /usr/bin/
wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
chmod u+x /usr/bin/kubectl-inspect-gpushare

./kubectl-inspect-gpushare
zsh: exec format error: ./kubectl-inspect-gpushare

kubectl inspect
Error: unknown command "inspect" for "kubectl"
Run 'kubectl --help' for usage.

节点重启后，发现gpu显存超分了

当我重启GPU节点后，又发布了几个服务，发现某些卡的gpu显存超分了，效果如下：

[root@jenkins app-deploy-platform]# kubectl-inspect-gpushare 
NAME           IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU4(Allocated/Total)  GPU5(Allocated/Total)  GPU6(Allocated/Total)  GPU7(Allocated/Total)  GPU Memory(GiB)
192.168.3.4    192.168.3.4    18/11                  8/11                   9/11                   11/11                  17/11                  8/11                   8/11                   4/11                   83/88
192.168.68.4   192.168.68.4   14/10                  10/10                  6/10                   14/10                  10/10                  10/10                  9/10                   0/10                   73/80
192.168.68.68  192.168.68.68  9/10                   8/10                   4/10                   0/10                   0/10                   0/10                   0/10                   0/10                   21/80
---------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
177/248 (71%)

我想这是插件本身有些bug

Unable to schedule pod with: Insufficient aliyun.com/gpu-mem

Hi! I've installed all the software from the docs https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md

I've configured all the docker/k8s components, but scheduler still can't assign pod to node with:

  Warning  FailedScheduling  4m25s (x23 over 20m)   default-scheduler  0/72 nodes are available: 72 Insufficient aliyun.com/gpu-mem.

Everything seem to be running correctly on my nodes:

gpushare-device-plugin-ds-5wpdx            1/1     Running   0          5m50s   10.48.171.12    node-gpu13   <none>           <none>
gpushare-device-plugin-ds-5xdfm            1/1     Running   0          5m50s   10.48.171.35    node-gpu03   <none>           <none>
gpushare-device-plugin-ds-7hw6d            1/1     Running   0          5m50s   10.48.171.17    node-gpu04   <none>           <none>
gpushare-device-plugin-ds-7zwd9            1/1     Running   0          5m50s   10.48.167.16    node-gpu09   <none>           <none>
gpushare-device-plugin-ds-9zdvn            1/1     Running   0          5m50s   10.48.171.13    node-gpu12   <none>           <none>
gpushare-device-plugin-ds-fztlx            1/1     Running   0          5m50s   10.48.171.18    node-gpu02   <none>           <none>
gpushare-device-plugin-ds-g975b            1/1     Running   0          5m49s   10.48.163.19    node-gpu14   <none>           <none>
gpushare-device-plugin-ds-grfnf            1/1     Running   0          5m50s   10.48.171.14    node-gpu11   <none>           <none>
gpushare-device-plugin-ds-jjjzj            1/1     Running   0          5m50s   10.48.163.20    node-gpu08   <none>           <none>
gpushare-device-plugin-ds-k4kbl            1/1     Running   0          5m50s   10.48.167.17    node-gpu10   <none>           <none>
gpushare-device-plugin-ds-m29s9            1/1     Running   0          5m50s   10.48.163.22    node-gpu07   <none>           <none>
gpushare-device-plugin-ds-p65cq            1/1     Running   0          5m50s   10.48.163.23    node-gpu06   <none>           <none>
gpushare-device-plugin-ds-rf5x5            1/1     Running   0          5m50s   10.48.167.18    node-gpu01   <none>           <none>
gpushare-device-plugin-ds-xxqxh            1/1     Running   0          5m50s   10.48.163.24    node-gpu05   <none>           <none>
gpushare-schd-extender-68dfcdb465-m2m6z    1/1     Running   0          37m     10.48.204.105   master01     <none>           <none>

master01:~# kubectl inspect gpushare
NAME             IPADDRESS     GPU0(Allocated/Total)  GPU Memory(GiB)
node-gpu03  10.48.171.35  0/31                   0/31
node-gpu09  10.48.167.16  0/31                   0/31
node-gpu11  10.48.171.14  0/31                   0/31
node-gpu14  10.48.163.19  0/31                   0/31
node-gpu01  10.48.167.18  0/31                   0/31
node-gpu04  10.48.171.17  0/31                   0/31
node-gpu07  10.48.163.22  0/31                   0/31
node-gpu05  10.48.163.24  0/31                   0/31
node-gpu08  10.48.163.20  0/31                   0/31
node-gpu10  10.48.167.17  0/31                   0/31
node-gpu13  10.48.171.12  0/31                   0/31
node-gpu02  10.48.171.18  0/31                   0/31
node-gpu06  10.48.163.23  0/31                   0/31
node-gpu12  10.48.171.13  0/31                   0/31

scheduler output:

Aug 04 17:57:28 master01 kube-scheduler[17483]: I0804 17:57:28.955978   17483 factory.go:341] Creating scheduler from configuration: {{ } [] [] [{http://127.0.0.1:32766/gpushare-scheduler filter   0 bind false <nil> 0s true [{aliyun.com/gpu-mem false}] false}] 0 false}
...
Aug 04 18:38:44 master01 kube-scheduler[53986]: I0804 18:38:44.654499   53986 factory.go:382] Creating extender with config {URLPrefix:http://127.0.0.1:32766/gpushare-scheduler FilterVerb:filter PreemptVerb: PrioritizeVerb: Weight:0 BindVerb:bind EnableHTTPS:false TLSConfig:<nil> HTTPTimeout:0s NodeCacheCapable:true ManagedResources:[{Name:aliyun.com/gpu-mem IgnoredByScheduler:false}] Ignorable:false}

my typical gpu-node outputs:

kubectl describe:

  Hostname:    node-gpu01
Capacity:
 aliyun.com/gpu-count:        1
 aliyun.com/gpu-mem:          31
...
Allocatable:
 aliyun.com/gpu-count:        1
 aliyun.com/gpu-mem:          31

kubelet

node-gpu01 kubelet[69306]: I0804 17:53:28.207639   69306 setters.go:283] Update capacity for aliyun.com/gpu-mem to 31

docker nvidia-smi

node-gpu01:~# docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Unable to find image 'nvidia/cuda:10.0-base' locally
10.0-base: Pulling from nvidia/cuda
7ddbc47eeb70: Pull complete
c1bbdc448b72: Pull complete
8c3b70e39044: Pull complete
45d437916d57: Pull complete
d8f1569ddae6: Pull complete
de5a2c57c41d: Pull complete
ea6f04a00543: Pull complete
Digest: sha256:e6e1001f286d084f8a3aea991afbcfe92cd389ad1f4883491d43631f152f175e
Status: Downloaded newer image for nvidia/cuda:10.0-base
Tue Aug  4 14:08:26 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   32C    P0    25W / 250W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

So, here is a Pod gpu-player with the exact same image from demo video which can't be scheduled due to Insufficient aliyun.com/gpu-mem resource

kubectl -n gpu-test describe pod gpu-player-f576f5dd4-njhrs
Name:               gpu-player-f576f5dd4-njhrs
Namespace:          gpu-test
Priority:           100
PriorityClassName:  default-priority
Node:               <none>
Labels:             app=gpu-player
                    pod-template-hash=f576f5dd4
Annotations:        <none>
Status:             Pending
IP:
Controlled By:      ReplicaSet/gpu-player-f576f5dd4
Containers:
  gpu-player:
    Image:      cheyang/gpu-player
    Port:       <none>
    Host Port:  <none>
    Limits:
      aliyun.com/gpu-mem:  512
    Requests:
      aliyun.com/gpu-mem:  512
    Environment:           <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-mjdsm (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-mjdsm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-mjdsm
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 pool=automated-moderation:NoSchedule
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  3m15s (x895 over 17h)  default-scheduler  0/72 nodes are available: 72 Insufficient aliyun.com/gpu-mem.

Looks like my k8s scheduler doesn't know about custom aliyun.com/gpu-mem resource. What's wrong?
I didn't find any errors in logs, but I'm ready to post any logs, versions, if necessary.

device plugin allocate judge error

github.com/AliyunContainerService/gpushare-device-plugin/pkg/gpu/nvidia/allocate.go:79

// podReqGPU = uint(0)
for _, req := range reqs.ContainerRequests {
	podReqGPU += uint(len(req.DevicesIDs))
}
...
if getGPUMemoryFromPodResource(pod) == podReqGPU {
...
}

getGPUMemoryFromPodResource() returns pod memory usage, but podReqGPU is pod request gpu count

No Devices found. Waiting indefinitely.

Plugin cannot find my A100 80G

I use Rancher 2.5.9 to build my cluster, I think the installation steps are correct since it worked on another cluster which I use A100 40G, however, it fails on this cluster using A100 80G.

nvidia-smi gives the correct result.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:00:08.0 Off |                    0 |
| N/A   39C    P0    60W / 300W |      0MiB / 80994MiB |     14%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

But no gpu in cluster

kubectl describe node

Allocatable:
  cpu:                2
  ephemeral-storage:  48294789041
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3777904Ki
  pods:               110

I tried to find the reason, this is the log of the Pod for the plugin.

[root@data1 ~]# docker logs 6e8823f03d54
I0114 15:37:53.669065       1 main.go:18] Start gpushare device plugin
I0114 15:37:53.669146       1 gpumanager.go:28] Loading NVML
I0114 15:37:53.743358       1 gpumanager.go:37] Fetching devices.
I0114 15:37:53.743407       1 gpumanager.go:39] No devices found. Waiting indefinitely.
[root@data1 ~]#

Any idea how this happen ? Is that possible the plugin does not support A100 80G ?

PLEASE GIVE US A BINARY

IT IS A TORTURE TO COMPILE

PLEASE JUST MAKE A PROPER RELEASE INCLUDING A BINARY FILE

Concurrently create sharegpu instance will cause creation to fail

What happened:

Create a sharegpu instance of a large image concurrently, and delete some sharegpu instances when the image is pulled, which will cause the sharegpu instance creation to fail.

What you expected to happen:

Create a sharegpu instance of a large image concurrently, and delete some sharegpu instances when the image is pulled, which other sharegpu instance creation success.

How to reproduce it (as minimally and precisely as possible):

Create a sharegpu instance of a large image concurrently.
And delete some sharegpu instances when the image is pulled.
Wait for the image to be pulled, and sharegpu instance creation will fail.

Anything else we need to know?:

describe pod error events

  Warning  Failed     12m                  kubelet, ser-330 Error: failed to start container "k8s-deploy-ubhqko-1592387682017": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=no-gpu-has-8MiB-to-run --compute --compat32 --graphics --utility --video --display --require=cuda>=9.0 --pid=16101 /data/docker_rt/overlay2/b647088d3759dc873fe4f60ba3b9d9de7eb85578fe17c2b2af177bb49d048450/merged]\\\\nnvidia-container-cli: device error: unknown device id: no-gpu-has-8MiB-to-run\\\\n\\\"\"": unknown

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14-20200217", GitCommit:"883cfa7a769459affa307774b12c9b3e99f4130b", GitTreeState:"clean", BuildDate:"2020-02-17T14:06:28Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:

BareMetal User Provided Infrastructure

OS (e.g: cat /etc/os-release):

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Kernel (e.g. uname -a):

Linux ser-330 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Install tools:
Network plugin and version (if this is a network-related bug):
Others:

pod metadata annotations

 $ kubectl -n k8s-common-ns get pods k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl -o json | jq '.metadata.annotations'
{
  "ALIYUN_COM_GPU_MEM_ASSIGNED": "true",
  "ALIYUN_COM_GPU_MEM_ASSUME_TIME": "1592388290278113475",
  "ALIYUN_COM_GPU_MEM_DEV": "24",
  "ALIYUN_COM_GPU_MEM_IDX": "1",
  "ALIYUN_COM_GPU_MEM_POD": "8"
}

pod status container statuses last state

 $ kubectl -n k8s-common-ns get pods k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl -o json | jq '.status.containerStatuses[].lastState'
{
  "terminated": {
    "containerID": "docker://307060463dcf85c135d89abeb50edaa493b5042f47a4d5d74eccc30b71edf245",
    "exitCode": 128,
    "finishedAt": "2020-06-17T10:20:49Z",
    "message": "OCI runtime create failed: container_linux.go:344: starting container process caused \"process_linux.go:424: container init caused \\\"process_linux.go:407: running prestart hook 0 caused \\\\\\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=no-gpu-has-8MiB-to-run --compute --compat32 --graphics --utility --video --display --require=cuda>=9.0 --pid=5008 /data/docker_rt/overlay2/02cda4031418bb8cdf08e94213adb066981257069e48d8369cb3b9ab3e37f274/merged]\\\\\\\\nnvidia-container-cli: device error: unknown device id: no-gpu-has-8MiB-to-run\\\\\\\\n\\\\\\\"\\\"\": unknown",
    "reason": "ContainerCannotRun",
    "startedAt": "2020-06-17T10:20:49Z"
  }
}

gpushare scheduler extender log

[ debug ] 2020/06/17 09:54:43 gpushare-predicate.go:17: check if the pod name k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl can be scheduled on node ser-330
[ debug ] 2020/06/17 09:54:43 gpushare-predicate.go:31: The pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in the namespace k8s-common-ns can be scheduled on ser-330
[ debug ] 2020/06/17 09:54:43 routes.go:121: gpusharingBind ExtenderArgs ={k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl k8s-common-ns 90fddd7e-b080-11ea-9b44-0cc47ab32cea ser-330}
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:143: Allocate() ----Begin to allocate GPU for gpu mem for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns----
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:220: reqGPU for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns: 8
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:239: Find candidate dev id 1 for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns successfully.
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:147: Allocate() 1. Allocate GPU ID 1 to pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns.----
[  info ] 2020/06/17 09:54:43 controller.go:286: Need to update pod name k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns and old status is Pending, new status is Pending; its old annotation map[] and new annotation map[ALIYUN_COM_GPU_MEM_IDX:1 ALIYUN_COM_GPU_MEM_POD:8 ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1592387683318737367 ALIYUN_COM_GPU_MEM_DEV:24]
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:179: Allocate() 2. Try to bind pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in k8s-common-ns namespace to node  with &Binding{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl,GenerateName:,Namespace:,SelfLink:,UID:90fddd7e-b080-11ea-9b44-0cc47ab32cea,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Target:ObjectReference{Kind:Node,Namespace:,Name:ser-330,UID:,APIVersion:,ResourceVersion:,FieldPath:,},}
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:193: Allocate() 3. Try to add pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns to dev 1
[ debug ] 2020/06/17 09:54:43 deviceinfo.go:57: dev.addPod() Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with the GPU ID 1 will be added to device map
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:204: Allocate() ----End to allocate GPU for gpu mem for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns----

gpushare device plugin log

I0617 10:04:50.278017       1 podmanager.go:123] list pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns in node ser-330 and status is Pending
I0617 10:04:50.278039       1 podutils.go:91] Found GPUSharedAssumed assumed pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in namespace k8s-common-ns.
I0617 10:04:50.278046       1 podmanager.go:157] candidate pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with timestamp 1592387683318737367 is found.
I0617 10:04:50.278056       1 allocate.go:70] Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns request GPU Memory 8 with timestamp 1592387683318737367
I0617 10:04:50.278064       1 allocate.go:80] Found Assumed GPU shared Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with GPU Memory 8
I0617 10:04:50.354408       1 podmanager.go:123] list pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns in node ser-330 and status is Pending
I0617 10:04:50.354423       1 podutils.go:96] GPU assigned Flag for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl exists in namespace k8s-common-ns and its assigned status is true, so it's not GPUSharedAssumed assumed pod.

support for vgl

Is this plugin supporting vgl? because blender in container with vnc on my cluster crashes.
This is strange because it works perfectly with nvidia official plugin, but I wanted to share my gpu over multiple pod instances.

Is there any solution for my case?

集群内pod数量过多的情况有可能会引起集群高负载从而雪崩，另外MiB单位有可能会引起kubelet grpc单位失败

在podmanager中会有list全量pod的操作，如果集群内pod数量过多（2w以上），并扩容大量使用gpu资源的pod时，测试0-1000，就会触发集群的list apiserver qps 10以上，引发集群雪崩
在单位为MiB的时候，设备gpumem在124GB的时候，单位为MiB，所以fake device id会有12400，测试发现kubelet在listAndWatch的gRPC调用时，返回错误，修改命名的字符串凭接可以缓解

Jun 24 18:55:09 10-12-3-162 kubelet[350652]: E0624 18:55:09.869624  350652 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (7880680 vs. 4194304)

ALIYUN_COM_GPU_MEM_IDX in the annotation is different than ALIYUN_COM_GPU_MEM_IDX inside the pod

Annotations:

Annotations:      ALIYUN_COM_GPU_MEM_ASSIGNED: true
                  ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1692105746106628538
                  ALIYUN_COM_GPU_MEM_DEV: 11
                  ALIYUN_COM_GPU_MEM_IDX: 4
                  ALIYUN_COM_GPU_MEM_POD: 2

Env:

ALIYUN_COM_GPU_MEM_DEV=11
ALIYUN_COM_GPU_MEM_IDX=3
ALIYUN_COM_GPU_MEM_POD=2
ALIYUN_COM_GPU_MEM_CONTAINER=2

The device:

NVIDIA_VISIBLE_DEVICES=GPU-280dd117-09e1-2e8c-25e3-52fdfac9527f

is indeed the 3rd device so the annotation is wrong and the environment variable is correct.

question about pods with multiple containers

Hi @cheyang,

I am currently studing the code and have a question about pods that have multiple containers.
I trace the code in kubelet and find that it called the allocate function for each container.

for _, container := range pod.Spec.Containers {
		if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
			return err
		}
		m.podDevices.removeContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
	}

		devs := allocDevices.UnsortedList()
		// TODO: refactor this part of code to just append a ContainerAllocationRequest
		// in a passed in AllocateRequest pointer, and issues a single Allocate call per pod.
		klog.V(3).Infof("Making allocation request for devices %v for device plugin %s", devs, resource)
		resp, err := eI.e.allocate(devs)
		metrics.DevicePluginAllocationDuration.WithLabelValues(resource).Observe(metrics.SinceInSeconds(startRPCTime))
		metrics.DeprecatedDevicePluginAllocationLatency.WithLabelValues(resource).Observe(metrics.SinceInMicroseconds(startRPCTime))

this case may corrupt the finding pod logic in device plugin allocate function . Have you meet this issue?

GPU registered to Kubelet but not available in `kubectl inspect gpushare` and not schedulable when using --memory-unit=Mib

I have 2 types of GPU in my cluster, 2 nodes for both RTX 2080ti and P2200, when I used --memory-unit=Gib, everything works fine.
But when I change to --memory-unit=Mib to deploy more Pods in an RTX card, 2 nodes of RTX 2080ti were not shown up in kubectl inspect gpushare and not schedulable.

Failed due to invalid configuration: no server found for cluster "local"

ERROR: logging before flag.Parse: F0311 14:21:02.271695 238971 podinfo.go:40] Failed due to invalid configuration: no server found for cluster "local"
goroutine 1 [running, locked to thread]:
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.stacks(0xc42000e000, 0xc420346000, 0x76, 0xc8)
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:769 +0xcf
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.(*loggingT).output(0x1825a40, 0xc400000003, 0xc420118790, 0x17b61a5, 0xa, 0x28, 0x0)
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:720 +0x32d
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.(*loggingT).printf(0x1825a40, 0xc400000003, 0x104ecdf, 0x10, 0xc4200dfee8, 0x1, 0x1)
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:655 +0x14b
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.Fatalf(0x104ecdf, 0x10, 0xc4200dfee8, 0x1, 0x1)
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:1148 +0x67
main.kubeInit()
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/cmd/inspect/podinfo.go:40 +0x1ec
main.init.0()
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/cmd/inspect/main.go:26 +0x20

device plugin failed to detect gpu info correctly

Description

kubectl inspect gpushare
NAME             IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU Memory(GiB)
k8s-demo-slave2  192.168.2.140  0/1                    0/1                    0/2
--------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/2 (0%)

实际上这个主机有两个显卡, 显卡数量不对吧, 不能用gtx 1080ti?

```bash
nvidia-smi 
Thu Oct 10 15:03:38 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960     Off  | 00000000:17:00.0 Off |                  N/A |
| 36%   29C    P8     7W / 120W |      0MiB /  2002MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:66:00.0 Off |                  N/A |
| 14%   37C    P8    25W / 270W |      0MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

```

some problem about auto allocate GPU card

hello, my gpu server which has 4 gpu cards(every one has 7611MiB),
now three containers run on the card gpu0, they total used 7601MiB.
then i run a new container, as expect this new container will run on gpu1 or gpu2 or gpu3.
but it does not run on gpu1/gpu2/gpu3 at all!!! Actualy it run failed!(CrashLoopBackOff)!
root@server:~# root@server:~# kubectl get po NAME READY STATUS RESTARTS AGE binpack-1-5cb847f945-7dp5g 1/1 Running 0 3h33m binpack-2-7fb6b969f-s2fmh 1/1 Running 0 64m binpack-3-84d8979f89-d6929 1/1 Running 0 59m binpack-4-669844dd5f-q9wvm 0/1 **CrashLoopBackOff** 15 56m ngx-dep1-69c964c4b5-9d7cp 1/1 Running 0 102m root@server:~# root@server:~#

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 24689 C python 7227MiB |
| 0 45236 C python 151MiB |
| 0 47646 C python 213MiB |
+-----------------------------------------------------------------------------+
root@server:#
root@server:#`

and my binpack-4.yaml info is below:
`root@server:/home/guobin/gpu-repo# cat binpack-4.yaml
apiVersion: apps/v1
kind: Deployment

metadata:
name: binpack-4
labels:
app: binpack-4

spec:
replicas: 1

selector: # define how the deployment finds the pods it manages
matchLabels:
app: binpack-4

template: # define the pods specifications
metadata:
labels:
app: binpack-4

spec:
  containers:
  - name: binpack-4
    image: cheyang/gpu-player:v2
    resources:
      limits:
        # MiB
        aliyun.com/gpu-mem: 200`

as you can see, the aliyun.com/gpu-mem is 200MiB.

ok! these are all important info. Why this plugin can not auto allocate GPU card?
or is there something i need to modify?

Thanks for your help!

NVIDIA_VISIBLE_DEVICES wrong value in OCI spec

Hello, I'm trying to use gpushare device plugin only for exposing gpu_mem resource from k8s gpu node in MiB. I have all the NVIDIA things like drivers, nvidia-container-runtime etc. installed and everything works fine except one thing. For example, there is a pod YAML

apiVersion: v1
kind: Pod
metadata:
  namespace: text-detector
  name: gpu-test-bald
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "registry.k.mycompany.com/experimental/cuda-vector-add:v0.1"
      imagePullPolicy: IfNotPresent
      resources:
        requests:
          aliyun.com/gpu-mem: "151"
        limits:
          aliyun.com/gpu-mem: "151"
  nodeName: gpu-node10
  tolerations:
    - operator: "Exists"

gpu-node10

...
Capacity:
 aliyun.com/gpu_count:          1
 aliyun.com/gpu-mem:         32768
 ...
root@gpu-node10:~# nvidia-smi
Tue Nov  8 11:32:19 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   31C    P0    32W / 250W |  24237MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9268      C   python3                          1799MiB |
|    0   N/A  N/A     12821      C   python3                          1883MiB |
|    0   N/A  N/A     14311      C   python3                          2105MiB |
|    0   N/A  N/A     16938      C   python3                          1401MiB |
|    0   N/A  N/A     16939      C   python3                          1401MiB |
|    0   N/A  N/A     29183      C   python3                          2215MiB |
|    0   N/A  N/A     43383      C   python3                          1203MiB |
|    0   N/A  N/A     52358      C   python3                          1939MiB |
|    0   N/A  N/A     54439      C   python3                          1143MiB |
|    0   N/A  N/A     54788      C   python3                          2123MiB |
|    0   N/A  N/A     56272      C   python3                          1143MiB |
|    0   N/A  N/A     56750      C   python3                          2089MiB |
|    0   N/A  N/A     61595      C   python3                          2089MiB |
|    0   N/A  N/A     71269      C   python3                          1694MiB |
+-----------------------------------------------------------------------------+

I've noticed NVIDIA_VISIBLE_DEVICES became different somehow, which causes an error during container creation

Containers:
  cuda-vector-add:
    Container ID:   docker://9eae154ebc7e662985e37777354e439d47eb0e7abb45d346be200101d64a3273
    Image:          registry.k.mycompany.com/experimental/cuda-vector-add:v0.1
    Image ID:       docker-pullable://registry.k.mycompany.com/experimental/cuda-vector-add@sha256:b09d5bc4243887012cc95be04f17e997bd73f52a16cae30ade28dd01bffa5e01
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: no-gpu-has-151MiB-to-run: unknown device: unknown

this exact error

OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: no-gpu-has-151MiB-to-run: unknown device: unknown

appears due to this ENV VAR NVIDIA_VISIBLE_DEVICES gets unacceptable value

"NVIDIA_VISIBLE_DEVICES=no-gpu-has-151MiB-to-run"

I've handled it in container OCI spec

{
  "ociVersion": "1.0.1-dev",
  "process": {
    "user": {
      "uid": 0,
      "gid": 0
    },
    "args": [
      "/bin/sh",
      "-c",
      "./vectorAdd"
    ],
    "env": [
      "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "HOSTNAME=gpu-test-bald",
      "NVIDIA_VISIBLE_DEVICES=no-gpu-has-151MiB-to-run", < ------ Here it is
      "ALIYUN_COM_GPU_MEM_IDX=-1",
      "ALIYUN_COM_GPU_MEM_POD=151",
      "ALIYUN_COM_GPU_MEM_CONTAINER=151",
      "ALIYUN_COM_GPU_MEM_DEV=32768",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_PORT=8890",
      "TEXT_DETECTOR_STAGING_SERVICE_HOST=10.62.55.112",
      "TEXT_DETECTOR_STAGING_SERVICE_PORT=8890",
      "TEXT_DETECTOR_STAGING_PORT=tcp://10.62.55.112:8890",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP=tcp://10.62.55.112:8890",
      "KUBERNETES_SERVICE_HOST=10.62.0.1",
      "KUBERNETES_PORT_443_TCP=tcp://10.62.0.1:443",
      "KUBERNETES_PORT_443_TCP_PORT=443",
      "TEXT_DETECTOR_STAGING_SERVICE_PORT_HTTP=8890",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_PROTO=tcp",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_ADDR=10.62.55.112",
      "KUBERNETES_PORT_443_TCP_ADDR=10.62.0.1",
      "KUBERNETES_SERVICE_PORT=443",
      "KUBERNETES_SERVICE_PORT_HTTPS=443",
      "KUBERNETES_PORT=tcp://10.62.0.1:443",
      "KUBERNETES_PORT_443_TCP_PROTO=tcp",
      "CUDA_VERSION=8.0.61",
      "CUDA_PKG_VERSION=8-0=8.0.61-1",
      "LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
      "LIBRARY_PATH=/usr/local/cuda/lib64/stubs:"
    ],
    "cwd": "/usr/local/cuda/samples/0_Simple/vectorAdd",
    "capabilities": {
      "bounding": [
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FSETID",
        "CAP_FOWNER",
        "CAP_MKNOD",
        "CAP_NET_RAW",
        "CAP_SETGID",
        "CAP_SETUID",
        "CAP_SETFCAP",
        "CAP_SETPCAP",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT",
        "CAP_KILL",
        "CAP_AUDIT_WRITE"
      ],
      "effective": [
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FSETID",
        "CAP_FOWNER",
        "CAP_MKNOD",
        "CAP_NET_RAW",
        "CAP_SETGID",
        "CAP_SETUID",
        "CAP_SETFCAP",
        "CAP_SETPCAP",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT",
        "CAP_KILL",
        "CAP_AUDIT_WRITE"
      ],
      "inheritable": [
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FSETID",
        "CAP_FOWNER",
        "CAP_MKNOD",
        "CAP_NET_RAW",
        "CAP_SETGID",
        "CAP_SETUID",
        "CAP_SETFCAP",
        "CAP_SETPCAP",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT",
        "CAP_KILL",
        "CAP_AUDIT_WRITE"
      ],
      "permitted": [
        "CAP_CHOWN",
        "CAP_DAC_OVERRIDE",
        "CAP_FSETID",
        "CAP_FOWNER",
        "CAP_MKNOD",
        "CAP_NET_RAW",
        "CAP_SETGID",
        "CAP_SETUID",
        "CAP_SETFCAP",
        "CAP_SETPCAP",
        "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT",
        "CAP_KILL",
        "CAP_AUDIT_WRITE"
      ]
    },
    "oomScoreAdj": 1000
  },
  "root": {
    "path": "/var/lib/docker/overlay2/5b9782752b5d79f2d3646b92e41511a3b959f3d2e7ed1c57c4e299dfb8cd6965/merged"
  },
  "hostname": "gpu-test-bald",
  "mounts": [
    {
      "destination": "/proc",
      "type": "proc",
      "source": "proc",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
    {
      "destination": "/dev",
      "type": "tmpfs",
      "source": "tmpfs",
      "options": [
        "nosuid",
        "strictatime",
        "mode=755",
        "size=65536k"
      ]
    },
    {
      "destination": "/dev/pts",
      "type": "devpts",
      "source": "devpts",
      "options": [
        "nosuid",
        "noexec",
        "newinstance",
        "ptmxmode=0666",
        "mode=0620",
        "gid=5"
      ]
    },
    {
      "destination": "/sys",
      "type": "sysfs",
      "source": "sysfs",
      "options": [
        "nosuid",
        "noexec",
        "nodev",
        "ro"
      ]
    },
    {
      "destination": "/sys/fs/cgroup",
      "type": "cgroup",
      "source": "cgroup",
      "options": [
        "ro",
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
    {
      "destination": "/dev/mqueue",
      "type": "mqueue",
      "source": "mqueue",
      "options": [
        "nosuid",
        "noexec",
        "nodev"
      ]
    },
    {
      "destination": "/dev/termination-log",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/containers/cuda-vector-add/8473aa30",
      "options": [
        "rbind",
        "rprivate"
      ]
    },
    {
      "destination": "/etc/resolv.conf",
      "type": "bind",
      "source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/resolv.conf",
      "options": [
        "rbind",
        "rprivate"
      ]
    },
    {
      "destination": "/etc/hostname",
      "type": "bind",
      "source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/hostname",
      "options": [
        "rbind",
        "rprivate"
      ]
    },
    {
      "destination": "/etc/hosts",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/etc-hosts",
      "options": [
        "rbind",
        "rprivate"
      ]
    },
    {
      "destination": "/dev/shm",
      "type": "bind",
      "source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/mounts/shm",
      "options": [
        "rbind",
        "rprivate"
      ]
    },
    {
      "destination": "/var/run/secrets/kubernetes.io/serviceaccount",
      "type": "bind",
      "source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/volumes/kubernetes.io~secret/default-token-thv9d",
      "options": [
        "rbind",
        "ro",
        "rprivate"
      ]
    }
  ],
  "hooks": {
    "prestart": [
      {
        "path": "/usr/bin/nvidia-container-runtime-hook",
        "args": [
          "/usr/bin/nvidia-container-runtime-hook",
          "prestart"
        ]
      }
    ]
  },
  "linux": {
    "resources": {
      "devices": [
        {
          "allow": false,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 5,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 3,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 9,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 8,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 0,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 1,
          "access": "rwm"
        },
        {
          "allow": false,
          "type": "c",
          "major": 10,
          "minor": 229,
          "access": "rwm"
        }
      ],
      "memory": {
        "disableOOMKiller": false
      },
      "cpu": {
        "shares": 2,
        "period": 100000
      },
      "blockIO": {
        "weight": 0
      }
    },
    "cgroupsPath": "kubepods-besteffort-pod685974b9_5eb0_11ed_bada_001eb9697543.slice:docker:664e21c310b62b2e1c3537388127812c7e2f482cb5cf40fa52280e3b62cf2646",
    "namespaces": [
      {
        "type": "mount"
      },
      {
        "type": "network",
        "path": "/proc/27057/ns/net"
      },
      {
        "type": "uts"
      },
      {
        "type": "pid"
      },
      {
        "type": "ipc",
        "path": "/proc/27057/ns/ipc"
      }
    ],
    "maskedPaths": [
      "/proc/acpi",
      "/proc/kcore",
      "/proc/keys",
      "/proc/latency_stats",
      "/proc/timer_list",
      "/proc/timer_stats",
      "/proc/sched_debug",
      "/proc/scsi",
      "/sys/firmware"
    ],
    "readonlyPaths": [
      "/proc/asound",
      "/proc/bus",
      "/proc/fs",
      "/proc/irq",
      "/proc/sys",
      "/proc/sysrq-trigger"
    ]
  }
}

adding NVIDIA_VISIBLE_DEVICES=all to Pod YAML fixes it as it described here

apiVersion: v1
kind: Pod
metadata:
  namespace: text-detector
  name: gpu-test-bald
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "registry.k.mycompany.com/experimental/cuda-vector-add:v0.1"
      imagePullPolicy: IfNotPresent
      env:
      - name: NVIDIA_VISIBLE_DEVICES
        value: "all"
      resources:
        requests:
          aliyun.com/gpu-mem: "153"
        limits:
          aliyun.com/gpu-mem: "153"
  nodeName: gpu-node10
  tolerations:
    - operator: "Exists"

OCI

{
  "ociVersion": "1.0.1-dev",
  "process": {
    "user": {
      "uid": 0,
      "gid": 0
    },
    "args": [
      "/bin/sh",
      "-c",
      "./vectorAdd"
    ],
    "env": [
      "PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "HOSTNAME=gpu-test-bald",
      "ALIYUN_COM_GPU_MEM_DEV=32768",
      "NVIDIA_VISIBLE_DEVICES=no-gpu-has-153MiB-to-run",    <----------Here it is
      "ALIYUN_COM_GPU_MEM_IDX=-1",
      "ALIYUN_COM_GPU_MEM_POD=153",
      "ALIYUN_COM_GPU_MEM_CONTAINER=153",
      "NVIDIA_VISIBLE_DEVICES=all",    <-------------------Here it is
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP=tcp://10.62.55.112:8890",
      "KUBERNETES_SERVICE_PORT_HTTPS=443",
      "KUBERNETES_PORT=tcp://10.62.0.1:443",
      "TEXT_DETECTOR_STAGING_SERVICE_HOST=10.62.55.112",
      "TEXT_DETECTOR_STAGING_SERVICE_PORT=8890",
      "KUBERNETES_SERVICE_PORT=443",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_ADDR=10.62.55.112",
      "KUBERNETES_SERVICE_HOST=10.62.0.1",
      "KUBERNETES_PORT_443_TCP=tcp://10.62.0.1:443",
      "TEXT_DETECTOR_STAGING_PORT=tcp://10.62.55.112:8890",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_PORT=8890",
      "KUBERNETES_PORT_443_TCP_PROTO=tcp",
      "KUBERNETES_PORT_443_TCP_PORT=443",
      "KUBERNETES_PORT_443_TCP_ADDR=10.62.0.1",
      "TEXT_DETECTOR_STAGING_SERVICE_PORT_HTTP=8890",
      "TEXT_DETECTOR_STAGING_PORT_8890_TCP_PROTO=tcp",
      "CUDA_VERSION=8.0.61",
      "CUDA_PKG_VERSION=8-0=8.0.61-1",
      "LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
      "LIBRARY_PATH=/usr/local/cuda/lib64/stubs:"
    ],
    ...

Now the same Pod has been successfully created and completed

NAME                                     READY   STATUS      RESTARTS   AGE     IP            NODE                  NOMINATED NODE   READINESS GATES
gpu-test-bald                            0/1     Completed   0          3m40s   10.62.97.59   gpu-node10   <none>           <none>

$ kubectl -f gpu-test-bald
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

So could you explain is such a behaviour of NVIDIA_VISIBLE_DEVICES ENV VAR correct? Seems like it is not

Allocate 中getCandidatePods是否多此一举？

k8s本身就会对pending的pod进行重调度，getCandidatePods的意义何在呢？

How to guarantee the pod does not use more memory than allocation?

Is there any OOM or Signal when pod uses more memory than desired? Since the physical memory on Gpu is limited, over using memory may affect other processes by other users.

ResourceExhausted desc = grpc: received message larger than max (4986010 vs. 4194304)

For such a GPU like NVIDIA A100 PCI-E 80GB it's not possible to update extended resource in Mb due to that error:

ResourceExhausted desc = grpc: received message larger than max (4986010 vs. 4194304)

device plugin can't update the node status and it leads to GPU node has zero gpu_memory capacity

Capacity:
aliyun.com/gpu_memory:         0

Nov 17 15:09:51 node02 kubelet[11218]: I1117 15:09:51.797475   11218 manager.go:440] Mark all resources Unhealthy for resource aliyun.com/gpu_memory

device login fail to register

We have a K8s cluster which the K8s is a Nvidia-customized version for dgx. It is based on 1.10.8. We just try to check whether the gpushare device plugin works on it. Check the docker log, we find that the device plugin fail to register to kubelet registration service through /var/lib/kubelet/device-plugins/kubelet.socket. I also check this unix socket. It is opened by kubelet and at the LISTENING status. What might cause the "unknown service v1beta1.Registration" error? Thx.

I0306 08:43:57.132717       1 main.go:18] Start gpushare device plugin
I0306 08:43:57.132779       1 gpumanager.go:28] Loading NVML
I0306 08:43:57.134391       1 gpumanager.go:37] Fetching devices.
I0306 08:43:57.134409       1 gpumanager.go:43] Starting FS watcher.
I0306 08:43:57.134475       1 gpumanager.go:51] Starting OS watcher.
I0306 08:43:57.141623       1 nvidia.go:64] Deivce GPU-95061e03-5740-5360-4968-f9c567395f4a's Path is /dev/nvidia0
I0306 08:43:57.141650       1 nvidia.go:69] # device Memory: 8116
I0306 08:43:57.141655       1 nvidia.go:40] set gpu memory: 7
I0306 08:43:57.141660       1 nvidia.go:76] # Add first device ID: GPU-95061e03-5740-5360-4968-f9c567395f4a-_-0
I0306 08:43:57.141665       1 nvidia.go:79] # Add last device ID: GPU-95061e03-5740-5360-4968-f9c567395f4a-_-6
I0306 08:43:57.141669       1 server.go:43] Device Map: map[GPU-95061e03-5740-5360-4968-f9c567395f4a:0]
I0306 08:43:57.141679       1 server.go:44] Device List: [GPU-95061e03-5740-5360-4968-f9c567395f4a]
I0306 08:43:57.159087       1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I0306 08:43:57.159595       1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I0306 08:43:57.160404       1 server.go:226] Could not register device plugin: rpc error: code = Unimplemented desc = unknown service v1beta1.Registration
W0306 08:43:57.160522       1 gpumanager.go:66] Failed to start device plugin due to rpc error: code = Unimplemented desc = unknown service v1beta1.Registration
I0306 08:43:57.161182       1 nvidia.go:64] Deivce GPU-95061e03-5740-5360-4968-f9c567395f4a's Path is /dev/nvidia0

kubectl-inspect-gpushare: fatal with OpenID as auth-provider

First of all thanks for your work!

kubectl-inspect-gpushare plguin falls with FATAL when kubectl configured with OpenID as auth-provider

$ kubectl inspect gpushare
ERROR: logging before flag.Parse: F0804 17:04:14.378041   17870 podinfo.go:44] Failed due to No Auth Provider found for name "oidc"
goroutine 1 [running, locked to thread]:
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.stacks(0xc0000d6000, 0xc0003ee000, 0x62, 0xb4)
	/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:769 +0xb1
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.(*loggingT).output(0x2886b20, 0xc000000003, 0xc0003c7260, 0x2811fbb, 0xa, 0x2c, 0x0)
	/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:720 +0x2f6
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.(*loggingT).printf(0x2886b20, 0x3, 0x1ced327, 0x10, 0xc00010df08, 0x1, 0x1)
	/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:655 +0x14e
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.Fatalf(...)
	/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:1148
main.kubeInit()
	/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/cmd/inspect/podinfo.go:44 +0x1ca
main.init.0()
	/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/cmd/inspect/main.go:26 +0x20

[问题] Device Plugin allocate 选出的 pod 是否会跟 Kubelet 绑定的不一致

看 Kubelet 调用 allocate 的实现

		resp, err := eI.e.allocate(devs)
                 ....
		m.podDevices.insert(podUID, contName, resource, allocDevices, resp.ContainerResponses[0])

deviceplug.allocate 会

会列出该节点中所有状态为 Pending 并且ALIYUN_COM_GPU_MEM_ASSIGNED为false的 GPU Share Pod
选择出其中 Pod Annotation 的ALIYUN_COM_GPU_MEM_POD的数量与 Allocate 申请数量一致的 Pod。如果有多个符合这种条件的 Pod，就会选择其中ALIYUN_COM_GPU_MEM_ASSUME_TIME最早的 Pod。
将该 Pod 的 annotation ALIYUN_COM_GPU_MEM_ASSIGNED设置为true，并且将 Pod annotation 中的 GPU 信息转化为环境变量返回给 Kubelet 用以真正的创建 Pod。

但是在 Kubelet 调用deviceplug.allocate时已经确定了podUID. 两者是否会不同?

How to guarantee the pod to be running after Allocate?

Hi @cheyang
I have a question about the logic of pick pod in Allocated function.
In my sense, the Allocate params just pass the device id to Device Plugin and there's nothing about container or pod. Why picks a pod and set its ALIYUN_COM_GPU_MEM_ASSIGNED to true in Allocate function ? Does it can guarantee this pod to be running immediately after Allocate? How to realize ?

No assume timestamp for pod tf-jupyter-... so it's not GPUSharedAssumed assumed pod.

Hi, i have been able to make it work with a kubespray k8s 1.13.5 cluster, with worker single-node single-GPU.
But I have a bug with a k8s 1.15.3 single-node dual GPU.

Can you help?

k describe pod tf-jupyter-67b475bf4d-4v2nf
...
Warning Failed (x2 over ) kubelet, node-2gpu Error: failed to start container "tensorflow": Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-8129MiB-to-run\\n\""": unknown

[root@node-2gpu ~]# docker logs -f 575171f1ff33

I1112 23:06:01.678676 1 allocate.go:46] ----Allocating GPU for gpu mem is started----
I1112 23:06:01.678717 1 allocate.go:57] RequestPodGPUs: 8129
I1112 23:06:01.678733 1 allocate.go:61] checking...
I1112 23:06:01.705009 1 podmanager.go:112] all pod list [{{ } {tf-jupyter-67b475bf4d-4v2nf tf-jupyter-67b475bf4d- jhub /api/v1/namespaces/jhub/pods/tf-jupyter-67b475bf4d-4v2nf a66921cd-bded-460b-bf4d-beb35c17229a 16993630 0 2019-11-12 17:22:48 +0000 UTC map[app:tf-jupyter pod-template-hash:67b475bf4d] map[] [{apps/v1 ReplicaSet tf-jupyter-67b475bf4d 74a14098-b83d-419f-a8eb-d9bb6fe0ea93 0xc4204a65a7 0xc4204a65a8}] nil [] } {[{bin {&HostPathVolumeSource{Path:/usr/bin,Type:,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}} {lib {&HostPathVolumeSource{Path:/usr/lib,Type:,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}} {default-token-kjd8r {nil nil nil nil nil &SecretVolumeSource{SecretName:default-token-kjd8r,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}] [] [{tensorflow tensorflow/tensorflow:1.12.0-gpu [] [] [{ 0 8888 TCP }] [] [] {map[aliyun.com/gpu-mem:{{8129 0} {} 8129 DecimalSI}] map[aliyun.com/gpu-mem:{{8129 0} {} 8129 DecimalSI}]} [{bin false /usr/local/nvidia/bin } {lib false /usr/local/nvidia/lib } {default-token-kjd8r true /var/run/secrets/kubernetes.io/serviceaccount }] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}] Always 0xc4204a6850 ClusterFirst map[accelerator:nvidia-tesla-m6] default default node-2gpu false false false &PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],} [] nil default-scheduler [{node.kubernetes.io/not-ready Exists NoExecute 0xc4204a6960} {node.kubernetes.io/unreachable Exists NoExecute 0xc4204a6980}] [] 0xc4204a6990 nil []} {Pending [{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-11-12 17:22:48 +0000 UTC }] [] [] BestEffort}}]
I1112 23:06:01.705505 1 podmanager.go:123] list pod tf-jupyter-67b475bf4d-4v2nf in ns jhub in node node-2gpu and status is Pending
I1112 23:06:01.705555 1 podutils.go:81] No assume timestamp for pod tf-jupyter-67b475bf4d-4v2nf in namespace jhub, so it's not GPUSharedAssumed assumed pod.
W1112 23:06:01.705573 1 allocate.go:152] invalid allocation requst: request GPU memory 8129 can't be satisfied.

trivy image scan lists critical and high vulnerability against latest image k8s-gpushare-plugin:v2-1.11-aff8a23

What happened:
trivy image scan lists critical and high vulnerability against latest image k8s-gpushare-plugin:v2-1.11-aff8a23

What you expected to happen:
No critical or high vulnerability issues.

How to reproduce it:
trivy image --ignore-unfixed --severity HIGH,CRITICAL --format template --template "@/usr/local/share/trivy/templates/html.tpl" -o report.html k8s-gpushare-plugin:v2-1.11-aff8a23

report:
k8s-gpushare-plugin_v2-1.11-aff8a23.pdf

Can it take effect on the window node?

Can it take effect on the window node?with window container

nvidia-container-cli: device error: unknown device id: no-gpu-has-256MiB-to-run\\\\n\\\"\"": unknown

版本信息：

k8s： 1.17
gpushare-device-plugin: v2-1.11-aff8a23
nvidia-smi: 440.36

kubectl descript pod < pod name > -n zhaogaolong
pod errors log

Events:
  Type     Reason     Age                From                      Message
  ----     ------     ----               ----                      -------
  Normal   Scheduled  <unknown>          default-scheduler         Successfully assigned zhaogaolong/gpu-demo-gpushare-659fd6cbb7-6fc8v to gpu-node
  Normal   Pulling    32s (x4 over 70s)  kubelet, gpu-node  Pulling image "hub.xxxx.com/zhaogaolong/gpu-demo.build.build:bccfcbe43f43280d-1584070500-dac37f2c12024544a6cc2871440dc94a577a7ff3"
  Normal   Pulled     32s (x4 over 70s)  kubelet, gpu-node  Successfully pulled image "hub.xxx.com/zhaogaolong/gpu-demo.build.build:bccfcbe43f43280d-1584070500-dac37f2c12024544a6cc2871440dc94a577a7ff3"
  Normal   Created    31s (x4 over 70s)  kubelet, gpu-node  Created container gpu
  Warning  Failed     31s (x4 over 70s)  kubelet, gpu-node  Error: failed to start container "gpu": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-256MiB-to-run\\\\n\\\"\"": unknown
  Warning  BackOff    10s (x5 over 68s)  kubelet, ggpu-node  Back-off restarting failed container

相同问题：

NVIDIA/nvidia-docker#1042

@cheyang

节点有多个不同型号GPU(显存也不一致)时会以第一个识别到的GPU为准

这里这块代码会导致当节点有多个不同型号GPU(显存也不一致)时会以第一个识别到的GPU为准，例如节点12G +16G ，这个节点两个GPU会被都识别成12G，一共24G

gpushare-device-plugin/pkg/gpu/nvidia/nvidia.go

Line 70 in 45fb8b8

if getGPUMemory() == uint(0) {

@cheyang

插件能获取GPU的个数，但是获取不了GPU的显存，共享无法调度

Capacity:
aliyun.com/gpu-count: 8
aliyun.com/gpu-mem: 0
gpu tesla V100

日志如下
[root@localhost ~]# kubectl logs -f -n kube-system gpushare-device-plugin-ds-qjltc
I1012 05:08:46.374978 1 main.go:18] Start gpushare device plugin
I1012 05:08:46.375045 1 gpumanager.go:28] Loading NVML
I1012 05:08:46.379478 1 gpumanager.go:37] Fetching devices.
I1012 05:08:46.379497 1 gpumanager.go:43] Starting FS watcher.
I1012 05:08:46.379930 1 gpumanager.go:51] Starting OS watcher.
I1012 05:08:46.389438 1 nvidia.go:64] Deivce GPU-60805828-8ab0-6124-67c4-9baff56d087b's Path is /dev/nvidia0
I1012 05:08:46.389549 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.389564 1 nvidia.go:40] set gpu memory: 32510
I1012 05:08:46.389577 1 nvidia.go:76] # Add first device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--0
I1012 05:08:46.453844 1 nvidia.go:79] # Add last device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--32509
I1012 05:08:46.461774 1 nvidia.go:64] Deivce GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01's Path is /dev/nvidia1
I1012 05:08:46.461816 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.461827 1 nvidia.go:76] # Add first device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--0
I1012 05:08:46.559867 1 nvidia.go:79] # Add last device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--32509
I1012 05:08:46.567541 1 nvidia.go:64] Deivce GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a's Path is /dev/nvidia2
I1012 05:08:46.567574 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.567583 1 nvidia.go:76] # Add first device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--0
I1012 05:08:46.658328 1 nvidia.go:79] # Add last device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--32509
I1012 05:08:46.666367 1 nvidia.go:64] Deivce GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5's Path is /dev/nvidia3
I1012 05:08:46.666393 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.666399 1 nvidia.go:76] # Add first device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--0
I1012 05:08:46.676851 1 nvidia.go:79] # Add last device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--32509
I1012 05:08:46.683786 1 nvidia.go:64] Deivce GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991's Path is /dev/nvidia4
I1012 05:08:46.683802 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.683809 1 nvidia.go:76] # Add first device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--0
I1012 05:08:46.948055 1 nvidia.go:79] # Add last device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--32509
I1012 05:08:46.956435 1 nvidia.go:64] Deivce GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf's Path is /dev/nvidia5
I1012 05:08:46.956486 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.956504 1 nvidia.go:76] # Add first device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--0
I1012 05:08:46.972438 1 nvidia.go:79] # Add last device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--32509
I1012 05:08:46.980775 1 nvidia.go:64] Deivce GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415's Path is /dev/nvidia6
I1012 05:08:46.980797 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.980805 1 nvidia.go:76] # Add first device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--0
I1012 05:08:46.990545 1 nvidia.go:79] # Add last device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--32509
I1012 05:08:46.997877 1 nvidia.go:64] Deivce GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2's Path is /dev/nvidia7
I1012 05:08:46.997891 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.997895 1 nvidia.go:76] # Add first device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--0
I1012 05:08:47.249585 1 nvidia.go:79] # Add last device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--32509
I1012 05:08:47.249606 1 server.go:43] Device Map: map[GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415:6 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2:7 GPU-60805828-8ab0-6124-67c4-9baff56d087b:0 GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01:1 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a:2 GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5:3 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991:4 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf:5]
I1012 05:08:47.249644 1 server.go:44] Device List: [GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2 GPU-60805828-8ab0-6124-67c4-9baff56d087b GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a]
I1012 05:08:47.265532 1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I1012 05:08:47.266863 1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I1012 05:08:47.267431 1 server.go:230] Registered device plugin with Kubelet
有没有人遇见过 k8s 1.16.3 nvidia-runtime 1.1-dev

Can't pull your docker image

When I try to pull your docker image I get:
Error response from daemon: Get https://registry.cn-hangzhou.aliyuncs.com/v2/: authenticationrequired
[attempt #1] Fail to pull registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23. Retry in 2 seconds

Any idea what I do wrong ?

Many thanks

aliyuncontainerservice / gpushare-device-plugin Goto Github PK

gpushare-device-plugin's People

Stargazers

Watchers

Forkers

gpushare-device-plugin's Issues

Plugin cannot find my A100 80G

Description

Recommend Projects

Recommend Topics

Recommend Org

Jobs