aliyuncontainerservice / gpushare-device-plugin Goto Github PK
View Code? Open in Web Editor NEWGPU Sharing Device Plugin for Kubernetes Cluster
License: Apache License 2.0
GPU Sharing Device Plugin for Kubernetes Cluster
License: Apache License 2.0
Any chance to have the device plugin working on containerd without nvidia-docker2?
I have rebuild my cluster with Conteinerd and on my worker nodes
the following are installed
libnvidia-container
nvidia-container-toolkit
nvidia-container-runtime
but the device plugin rises the error:
0425 10:34:29.375414 1 main.go:18] Start gpushare device plugin
I0425 10:34:29.382160 1 gpumanager.go:28] Loading NVML
I0425 10:34:29.382601 1 gpumanager.go:31] Failed to initialize NVML: could not load NVML library.
I0425 10:34:29.382616 1 gpumanager.go:32] If this is a GPU node, did you set the docker default runtime to nvidia
?
The default runtime has been setup to nvidia-container-runtime
[plugins."io.containerd.runtime.v1.linux"]
no_shim = false
runtime = "nvidia-container-runtime"
runtime_root = ""
shim = "containerd-shim"
shim_debug = false
Anyone has found a workaround?
Any plan to replace nvidia-docker2 with nvidia-container-runtime
Thanks
该程序在 k8s .1.25中无法使用
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
Hello,
is it possible for several pods to request gpu share on any but same gpu card? E.g., if you have stateful set consisting of Xserver container and application container, you need those two share the same gpu card.. I would request like 1G mem for each of the containers however, if I have more than one GPU per node, I have no guarantees they use the same device, right?
When running with 450.XX or 460.XX drivers, the logs of the pod are:
gpumanager.go:28] Loading NVML
gpumanager.go:31] Failed to initialize NVML: could not load NVML library.
gpumanager.go:32] If this is a GPU node, did you set the docker default runtime to `nvidia`?
The nvidia driver is running correctly on the machine as nvidia-smi show the gpu.
We are currently trying to update the dependancies of the project and rebuilding the device plugin but have failed to solve the issue.
In this repo, I cannot find aliyun.com/gpu-mem
update to node status, and I just find aliyun.com/gpu-count
update to node status at NewNvidiaDevicePlugin
func
cd /usr/bin/
wget https://github.com/AliyunContainerService/gpushare-device-plugin/releases/download/v0.3.0/kubectl-inspect-gpushare
chmod u+x /usr/bin/kubectl-inspect-gpushare
./kubectl-inspect-gpushare
zsh: exec format error: ./kubectl-inspect-gpushare
kubectl inspect
Error: unknown command "inspect" for "kubectl"
Run 'kubectl --help' for usage.
当我重启GPU节点后,又发布了几个服务,发现某些卡的gpu显存超分了,效果如下:
[root@jenkins app-deploy-platform]# kubectl-inspect-gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU2(Allocated/Total) GPU3(Allocated/Total) GPU4(Allocated/Total) GPU5(Allocated/Total) GPU6(Allocated/Total) GPU7(Allocated/Total) GPU Memory(GiB)
192.168.3.4 192.168.3.4 18/11 8/11 9/11 11/11 17/11 8/11 8/11 4/11 83/88
192.168.68.4 192.168.68.4 14/10 10/10 6/10 14/10 10/10 10/10 9/10 0/10 73/80
192.168.68.68 192.168.68.68 9/10 8/10 4/10 0/10 0/10 0/10 0/10 0/10 21/80
---------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
177/248 (71%)
我想这是插件本身有些bug
Hi! I've installed all the software from the docs https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md
I've configured all the docker/k8s components, but scheduler still can't assign pod to node with:
Warning FailedScheduling 4m25s (x23 over 20m) default-scheduler 0/72 nodes are available: 72 Insufficient aliyun.com/gpu-mem.
Everything seem to be running correctly on my nodes:
gpushare-device-plugin-ds-5wpdx 1/1 Running 0 5m50s 10.48.171.12 node-gpu13 <none> <none>
gpushare-device-plugin-ds-5xdfm 1/1 Running 0 5m50s 10.48.171.35 node-gpu03 <none> <none>
gpushare-device-plugin-ds-7hw6d 1/1 Running 0 5m50s 10.48.171.17 node-gpu04 <none> <none>
gpushare-device-plugin-ds-7zwd9 1/1 Running 0 5m50s 10.48.167.16 node-gpu09 <none> <none>
gpushare-device-plugin-ds-9zdvn 1/1 Running 0 5m50s 10.48.171.13 node-gpu12 <none> <none>
gpushare-device-plugin-ds-fztlx 1/1 Running 0 5m50s 10.48.171.18 node-gpu02 <none> <none>
gpushare-device-plugin-ds-g975b 1/1 Running 0 5m49s 10.48.163.19 node-gpu14 <none> <none>
gpushare-device-plugin-ds-grfnf 1/1 Running 0 5m50s 10.48.171.14 node-gpu11 <none> <none>
gpushare-device-plugin-ds-jjjzj 1/1 Running 0 5m50s 10.48.163.20 node-gpu08 <none> <none>
gpushare-device-plugin-ds-k4kbl 1/1 Running 0 5m50s 10.48.167.17 node-gpu10 <none> <none>
gpushare-device-plugin-ds-m29s9 1/1 Running 0 5m50s 10.48.163.22 node-gpu07 <none> <none>
gpushare-device-plugin-ds-p65cq 1/1 Running 0 5m50s 10.48.163.23 node-gpu06 <none> <none>
gpushare-device-plugin-ds-rf5x5 1/1 Running 0 5m50s 10.48.167.18 node-gpu01 <none> <none>
gpushare-device-plugin-ds-xxqxh 1/1 Running 0 5m50s 10.48.163.24 node-gpu05 <none> <none>
gpushare-schd-extender-68dfcdb465-m2m6z 1/1 Running 0 37m 10.48.204.105 master01 <none> <none>
master01:~# kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB)
node-gpu03 10.48.171.35 0/31 0/31
node-gpu09 10.48.167.16 0/31 0/31
node-gpu11 10.48.171.14 0/31 0/31
node-gpu14 10.48.163.19 0/31 0/31
node-gpu01 10.48.167.18 0/31 0/31
node-gpu04 10.48.171.17 0/31 0/31
node-gpu07 10.48.163.22 0/31 0/31
node-gpu05 10.48.163.24 0/31 0/31
node-gpu08 10.48.163.20 0/31 0/31
node-gpu10 10.48.167.17 0/31 0/31
node-gpu13 10.48.171.12 0/31 0/31
node-gpu02 10.48.171.18 0/31 0/31
node-gpu06 10.48.163.23 0/31 0/31
node-gpu12 10.48.171.13 0/31 0/31
scheduler output:
Aug 04 17:57:28 master01 kube-scheduler[17483]: I0804 17:57:28.955978 17483 factory.go:341] Creating scheduler from configuration: {{ } [] [] [{http://127.0.0.1:32766/gpushare-scheduler filter 0 bind false <nil> 0s true [{aliyun.com/gpu-mem false}] false}] 0 false}
...
Aug 04 18:38:44 master01 kube-scheduler[53986]: I0804 18:38:44.654499 53986 factory.go:382] Creating extender with config {URLPrefix:http://127.0.0.1:32766/gpushare-scheduler FilterVerb:filter PreemptVerb: PrioritizeVerb: Weight:0 BindVerb:bind EnableHTTPS:false TLSConfig:<nil> HTTPTimeout:0s NodeCacheCapable:true ManagedResources:[{Name:aliyun.com/gpu-mem IgnoredByScheduler:false}] Ignorable:false}
my typical gpu-node outputs:
Hostname: node-gpu01
Capacity:
aliyun.com/gpu-count: 1
aliyun.com/gpu-mem: 31
...
Allocatable:
aliyun.com/gpu-count: 1
aliyun.com/gpu-mem: 31
node-gpu01 kubelet[69306]: I0804 17:53:28.207639 69306 setters.go:283] Update capacity for aliyun.com/gpu-mem to 31
node-gpu01:~# docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Unable to find image 'nvidia/cuda:10.0-base' locally
10.0-base: Pulling from nvidia/cuda
7ddbc47eeb70: Pull complete
c1bbdc448b72: Pull complete
8c3b70e39044: Pull complete
45d437916d57: Pull complete
d8f1569ddae6: Pull complete
de5a2c57c41d: Pull complete
ea6f04a00543: Pull complete
Digest: sha256:e6e1001f286d084f8a3aea991afbcfe92cd389ad1f4883491d43631f152f175e
Status: Downloaded newer image for nvidia/cuda:10.0-base
Tue Aug 4 14:08:26 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
| N/A 32C P0 25W / 250W | 12MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
So, here is a Pod gpu-player with the exact same image from demo video which can't be scheduled due to Insufficient aliyun.com/gpu-mem resource
kubectl -n gpu-test describe pod gpu-player-f576f5dd4-njhrs
Name: gpu-player-f576f5dd4-njhrs
Namespace: gpu-test
Priority: 100
PriorityClassName: default-priority
Node: <none>
Labels: app=gpu-player
pod-template-hash=f576f5dd4
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/gpu-player-f576f5dd4
Containers:
gpu-player:
Image: cheyang/gpu-player
Port: <none>
Host Port: <none>
Limits:
aliyun.com/gpu-mem: 512
Requests:
aliyun.com/gpu-mem: 512
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-mjdsm (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-mjdsm:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-mjdsm
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
pool=automated-moderation:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m15s (x895 over 17h) default-scheduler 0/72 nodes are available: 72 Insufficient aliyun.com/gpu-mem.
Looks like my k8s scheduler doesn't know about custom aliyun.com/gpu-mem resource. What's wrong?
I didn't find any errors in logs, but I'm ready to post any logs, versions, if necessary.
github.com/AliyunContainerService/gpushare-device-plugin/pkg/gpu/nvidia/allocate.go:79
// podReqGPU = uint(0)
for _, req := range reqs.ContainerRequests {
podReqGPU += uint(len(req.DevicesIDs))
}
...
if getGPUMemoryFromPodResource(pod) == podReqGPU {
...
}
getGPUMemoryFromPodResource()
returns pod memory usage, but podReqGPU
is pod request gpu count
I use Rancher 2.5.9 to build my cluster, I think the installation steps are correct since it worked on another cluster which I use A100 40G, however, it fails on this cluster using A100 80G.
nvidia-smi gives the correct result.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:00:08.0 Off | 0 |
| N/A 39C P0 60W / 300W | 0MiB / 80994MiB | 14% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But no gpu in cluster
kubectl describe node
Allocatable:
cpu: 2
ephemeral-storage: 48294789041
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3777904Ki
pods: 110
I tried to find the reason, this is the log of the Pod for the plugin.
[root@data1 ~]# docker logs 6e8823f03d54
I0114 15:37:53.669065 1 main.go:18] Start gpushare device plugin
I0114 15:37:53.669146 1 gpumanager.go:28] Loading NVML
I0114 15:37:53.743358 1 gpumanager.go:37] Fetching devices.
I0114 15:37:53.743407 1 gpumanager.go:39] No devices found. Waiting indefinitely.
[root@data1 ~]#
Any idea how this happen ? Is that possible the plugin does not support A100 80G ?
IT IS A TORTURE TO COMPILE
PLEASE JUST MAKE A PROPER RELEASE INCLUDING A BINARY FILE
What happened:
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Warning Failed 12m kubelet, ser-330 Error: failed to start container "k8s-deploy-ubhqko-1592387682017": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=no-gpu-has-8MiB-to-run --compute --compat32 --graphics --utility --video --display --require=cuda>=9.0 --pid=16101 /data/docker_rt/overlay2/b647088d3759dc873fe4f60ba3b9d9de7eb85578fe17c2b2af177bb49d048450/merged]\\\\nnvidia-container-cli: device error: unknown device id: no-gpu-has-8MiB-to-run\\\\n\\\"\"": unknown
Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14-20200217", GitCommit:"883cfa7a769459affa307774b12c9b3e99f4130b", GitTreeState:"clean", BuildDate:"2020-02-17T14:06:28Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
BareMetal User Provided Infrastructure
cat /etc/os-release
):NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
uname -a
):Linux ser-330 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ kubectl -n k8s-common-ns get pods k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl -o json | jq '.metadata.annotations'
{
"ALIYUN_COM_GPU_MEM_ASSIGNED": "true",
"ALIYUN_COM_GPU_MEM_ASSUME_TIME": "1592388290278113475",
"ALIYUN_COM_GPU_MEM_DEV": "24",
"ALIYUN_COM_GPU_MEM_IDX": "1",
"ALIYUN_COM_GPU_MEM_POD": "8"
}
$ kubectl -n k8s-common-ns get pods k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl -o json | jq '.status.containerStatuses[].lastState'
{
"terminated": {
"containerID": "docker://307060463dcf85c135d89abeb50edaa493b5042f47a4d5d74eccc30b71edf245",
"exitCode": 128,
"finishedAt": "2020-06-17T10:20:49Z",
"message": "OCI runtime create failed: container_linux.go:344: starting container process caused \"process_linux.go:424: container init caused \\\"process_linux.go:407: running prestart hook 0 caused \\\\\\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=no-gpu-has-8MiB-to-run --compute --compat32 --graphics --utility --video --display --require=cuda>=9.0 --pid=5008 /data/docker_rt/overlay2/02cda4031418bb8cdf08e94213adb066981257069e48d8369cb3b9ab3e37f274/merged]\\\\\\\\nnvidia-container-cli: device error: unknown device id: no-gpu-has-8MiB-to-run\\\\\\\\n\\\\\\\"\\\"\": unknown",
"reason": "ContainerCannotRun",
"startedAt": "2020-06-17T10:20:49Z"
}
}
[ debug ] 2020/06/17 09:54:43 gpushare-predicate.go:17: check if the pod name k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl can be scheduled on node ser-330
[ debug ] 2020/06/17 09:54:43 gpushare-predicate.go:31: The pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in the namespace k8s-common-ns can be scheduled on ser-330
[ debug ] 2020/06/17 09:54:43 routes.go:121: gpusharingBind ExtenderArgs ={k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl k8s-common-ns 90fddd7e-b080-11ea-9b44-0cc47ab32cea ser-330}
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:143: Allocate() ----Begin to allocate GPU for gpu mem for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns----
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:220: reqGPU for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns: 8
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:239: Find candidate dev id 1 for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns successfully.
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:147: Allocate() 1. Allocate GPU ID 1 to pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns.----
[ info ] 2020/06/17 09:54:43 controller.go:286: Need to update pod name k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns and old status is Pending, new status is Pending; its old annotation map[] and new annotation map[ALIYUN_COM_GPU_MEM_IDX:1 ALIYUN_COM_GPU_MEM_POD:8 ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1592387683318737367 ALIYUN_COM_GPU_MEM_DEV:24]
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:179: Allocate() 2. Try to bind pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in k8s-common-ns namespace to node with &Binding{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl,GenerateName:,Namespace:,SelfLink:,UID:90fddd7e-b080-11ea-9b44-0cc47ab32cea,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Target:ObjectReference{Kind:Node,Namespace:,Name:ser-330,UID:,APIVersion:,ResourceVersion:,FieldPath:,},}
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:193: Allocate() 3. Try to add pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns to dev 1
[ debug ] 2020/06/17 09:54:43 deviceinfo.go:57: dev.addPod() Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with the GPU ID 1 will be added to device map
[ debug ] 2020/06/17 09:54:43 nodeinfo.go:204: Allocate() ----End to allocate GPU for gpu mem for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns----
I0617 10:04:50.278017 1 podmanager.go:123] list pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns in node ser-330 and status is Pending
I0617 10:04:50.278039 1 podutils.go:91] Found GPUSharedAssumed assumed pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in namespace k8s-common-ns.
I0617 10:04:50.278046 1 podmanager.go:157] candidate pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with timestamp 1592387683318737367 is found.
I0617 10:04:50.278056 1 allocate.go:70] Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns request GPU Memory 8 with timestamp 1592387683318737367
I0617 10:04:50.278064 1 allocate.go:80] Found Assumed GPU shared Pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns with GPU Memory 8
I0617 10:04:50.354408 1 podmanager.go:123] list pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl in ns k8s-common-ns in node ser-330 and status is Pending
I0617 10:04:50.354423 1 podutils.go:96] GPU assigned Flag for pod k8s-deploy-ubhqko-1592387682017-7875f9fc5c-b6pxl exists in namespace k8s-common-ns and its assigned status is true, so it's not GPUSharedAssumed assumed pod.
Is this plugin supporting vgl? because blender in container with vnc on my cluster crashes.
This is strange because it works perfectly with nvidia official plugin, but I wanted to share my gpu over multiple pod instances.
Is there any solution for my case?
在podmanager中会有list全量pod的操作,如果集群内pod数量过多(2w以上),并扩容大量使用gpu资源的pod时,测试0-1000,就会触发集群的list apiserver qps 10以上,引发集群雪崩
在单位为MiB的时候,设备gpumem在124GB的时候,单位为MiB,所以fake device id会有12400,测试发现kubelet在listAndWatch的gRPC调用时,返回错误,修改命名的字符串凭接可以缓解
Jun 24 18:55:09 10-12-3-162 kubelet[350652]: E0624 18:55:09.869624 350652 endpoint.go:106] listAndWatch ended unexpectedly for device plugin aliyun.com/gpu-mem with error rpc error: code = ResourceExhausted desc = grpc: received message larger than max (7880680 vs. 4194304)
Annotations:
Annotations: ALIYUN_COM_GPU_MEM_ASSIGNED: true
ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1692105746106628538
ALIYUN_COM_GPU_MEM_DEV: 11
ALIYUN_COM_GPU_MEM_IDX: 4
ALIYUN_COM_GPU_MEM_POD: 2
Env:
ALIYUN_COM_GPU_MEM_DEV=11
ALIYUN_COM_GPU_MEM_IDX=3
ALIYUN_COM_GPU_MEM_POD=2
ALIYUN_COM_GPU_MEM_CONTAINER=2
The device:
NVIDIA_VISIBLE_DEVICES=GPU-280dd117-09e1-2e8c-25e3-52fdfac9527f
is indeed the 3rd device so the annotation is wrong and the environment variable is correct.
Hi @cheyang,
I am currently studing the code and have a question about pods that have multiple containers.
I trace the code in kubelet and find that it called the allocate function for each container.
for _, container := range pod.Spec.Containers {
if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
return err
}
m.podDevices.removeContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
}
devs := allocDevices.UnsortedList()
// TODO: refactor this part of code to just append a ContainerAllocationRequest
// in a passed in AllocateRequest pointer, and issues a single Allocate call per pod.
klog.V(3).Infof("Making allocation request for devices %v for device plugin %s", devs, resource)
resp, err := eI.e.allocate(devs)
metrics.DevicePluginAllocationDuration.WithLabelValues(resource).Observe(metrics.SinceInSeconds(startRPCTime))
metrics.DeprecatedDevicePluginAllocationLatency.WithLabelValues(resource).Observe(metrics.SinceInMicroseconds(startRPCTime))
this case may corrupt the finding pod logic in device plugin allocate function . Have you meet this issue?
I have 2 types of GPU in my cluster, 2 nodes for both RTX 2080ti and P2200, when I used --memory-unit=Gib
, everything works fine.
But when I change to --memory-unit=Mib
to deploy more Pods in an RTX card, 2 nodes of RTX 2080ti were not shown up in kubectl inspect gpushare
and not schedulable.
ERROR: logging before flag.Parse: F0311 14:21:02.271695 238971 podinfo.go:40] Failed due to invalid configuration: no server found for cluster "local"
goroutine 1 [running, locked to thread]:
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.stacks(0xc42000e000, 0xc420346000, 0x76, 0xc8)
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:769 +0xcf
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.(*loggingT).output(0x1825a40, 0xc400000003, 0xc420118790, 0x17b61a5, 0xa, 0x28, 0x0)
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:720 +0x32d
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.(*loggingT).printf(0x1825a40, 0xc400000003, 0x104ecdf, 0x10, 0xc4200dfee8, 0x1, 0x1)
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:655 +0x14b
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.Fatalf(0x104ecdf, 0x10, 0xc4200dfee8, 0x1, 0x1)
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:1148 +0x67
main.kubeInit()
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/cmd/inspect/podinfo.go:40 +0x1ec
main.init.0()
/go/src/github.com/AliyunContainerService/gpushare-device-plugin/cmd/inspect/main.go:26 +0x20
kubectl inspect gpushare
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU Memory(GiB)
k8s-demo-slave2 192.168.2.140 0/1 0/1 0/2
--------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/2 (0%)
实际上这个主机有两个显卡, 显卡数量不对吧, 不能用gtx 1080ti?
```bash
nvidia-smi
Thu Oct 10 15:03:38 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960 Off | 00000000:17:00.0 Off | N/A |
| 36% 29C P8 7W / 120W | 0MiB / 2002MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:66:00.0 Off | N/A |
| 14% 37C P8 25W / 270W | 0MiB / 11175MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
```
hello, my gpu server which has 4 gpu cards(every one has 7611MiB),
now three containers run on the card gpu0, they total used 7601MiB.
then i run a new container, as expect this new container will run on gpu1 or gpu2 or gpu3.
but it does not run on gpu1/gpu2/gpu3 at all!!! Actualy it run failed!(CrashLoopBackOff)!
root@server:~# root@server:~# kubectl get po NAME READY STATUS RESTARTS AGE binpack-1-5cb847f945-7dp5g 1/1 Running 0 3h33m binpack-2-7fb6b969f-s2fmh 1/1 Running 0 64m binpack-3-84d8979f89-d6929 1/1 Running 0 59m binpack-4-669844dd5f-q9wvm 0/1 **CrashLoopBackOff** 15 56m ngx-dep1-69c964c4b5-9d7cp 1/1 Running 0 102m root@server:~# root@server:~#
my gpu server info:
`root@server:~# nvidia-smi
Wed May 20 18:18:17 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:18:00.0 Off | 0 |
| N/A 65C P0 25W / 75W | 7601MiB / 7611MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P4 Off | 00000000:3B:00.0 Off | 0 |
| N/A 35C P8 6W / 75W | 0MiB / 7611MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P4 Off | 00000000:5E:00.0 Off | 0 |
| N/A 32C P8 6W / 75W | 0MiB / 7611MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P4 Off | 00000000:86:00.0 Off | 0 |
| N/A 38C P8 7W / 75W | 0MiB / 7611MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 24689 C python 7227MiB |
| 0 45236 C python 151MiB |
| 0 47646 C python 213MiB |
+-----------------------------------------------------------------------------+
root@server:##`
root@server:
and my binpack-4.yaml info is below:
`root@server:/home/guobin/gpu-repo# cat binpack-4.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: binpack-4
labels:
app: binpack-4
spec:
replicas: 1
selector: # define how the deployment finds the pods it manages
matchLabels:
app: binpack-4
template: # define the pods specifications
metadata:
labels:
app: binpack-4
spec:
containers:
- name: binpack-4
image: cheyang/gpu-player:v2
resources:
limits:
# MiB
aliyun.com/gpu-mem: 200`
as you can see, the aliyun.com/gpu-mem is 200MiB.
ok! these are all important info. Why this plugin can not auto allocate GPU card?
or is there something i need to modify?
Thanks for your help!
Hello, I'm trying to use gpushare device plugin only for exposing gpu_mem resource from k8s gpu node in MiB. I have all the NVIDIA things like drivers, nvidia-container-runtime etc. installed and everything works fine except one thing. For example, there is a pod YAML
apiVersion: v1
kind: Pod
metadata:
namespace: text-detector
name: gpu-test-bald
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "registry.k.mycompany.com/experimental/cuda-vector-add:v0.1"
imagePullPolicy: IfNotPresent
resources:
requests:
aliyun.com/gpu-mem: "151"
limits:
aliyun.com/gpu-mem: "151"
nodeName: gpu-node10
tolerations:
- operator: "Exists"
gpu-node10
...
Capacity:
aliyun.com/gpu_count: 1
aliyun.com/gpu-mem: 32768
...
root@gpu-node10:~# nvidia-smi
Tue Nov 8 11:32:19 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
| N/A 31C P0 32W / 250W | 24237MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 9268 C python3 1799MiB |
| 0 N/A N/A 12821 C python3 1883MiB |
| 0 N/A N/A 14311 C python3 2105MiB |
| 0 N/A N/A 16938 C python3 1401MiB |
| 0 N/A N/A 16939 C python3 1401MiB |
| 0 N/A N/A 29183 C python3 2215MiB |
| 0 N/A N/A 43383 C python3 1203MiB |
| 0 N/A N/A 52358 C python3 1939MiB |
| 0 N/A N/A 54439 C python3 1143MiB |
| 0 N/A N/A 54788 C python3 2123MiB |
| 0 N/A N/A 56272 C python3 1143MiB |
| 0 N/A N/A 56750 C python3 2089MiB |
| 0 N/A N/A 61595 C python3 2089MiB |
| 0 N/A N/A 71269 C python3 1694MiB |
+-----------------------------------------------------------------------------+
I've noticed NVIDIA_VISIBLE_DEVICES
became different somehow, which causes an error during container creation
Containers:
cuda-vector-add:
Container ID: docker://9eae154ebc7e662985e37777354e439d47eb0e7abb45d346be200101d64a3273
Image: registry.k.mycompany.com/experimental/cuda-vector-add:v0.1
Image ID: docker-pullable://registry.k.mycompany.com/experimental/cuda-vector-add@sha256:b09d5bc4243887012cc95be04f17e997bd73f52a16cae30ade28dd01bffa5e01
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: ContainerCannotRun
Message: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: no-gpu-has-151MiB-to-run: unknown device: unknown
this exact error
OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: no-gpu-has-151MiB-to-run: unknown device: unknown
appears due to this ENV VAR NVIDIA_VISIBLE_DEVICES
gets unacceptable value
"NVIDIA_VISIBLE_DEVICES=no-gpu-has-151MiB-to-run"
I've handled it in container OCI spec
{
"ociVersion": "1.0.1-dev",
"process": {
"user": {
"uid": 0,
"gid": 0
},
"args": [
"/bin/sh",
"-c",
"./vectorAdd"
],
"env": [
"PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"HOSTNAME=gpu-test-bald",
"NVIDIA_VISIBLE_DEVICES=no-gpu-has-151MiB-to-run", < ------ Here it is
"ALIYUN_COM_GPU_MEM_IDX=-1",
"ALIYUN_COM_GPU_MEM_POD=151",
"ALIYUN_COM_GPU_MEM_CONTAINER=151",
"ALIYUN_COM_GPU_MEM_DEV=32768",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_PORT=8890",
"TEXT_DETECTOR_STAGING_SERVICE_HOST=10.62.55.112",
"TEXT_DETECTOR_STAGING_SERVICE_PORT=8890",
"TEXT_DETECTOR_STAGING_PORT=tcp://10.62.55.112:8890",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP=tcp://10.62.55.112:8890",
"KUBERNETES_SERVICE_HOST=10.62.0.1",
"KUBERNETES_PORT_443_TCP=tcp://10.62.0.1:443",
"KUBERNETES_PORT_443_TCP_PORT=443",
"TEXT_DETECTOR_STAGING_SERVICE_PORT_HTTP=8890",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_PROTO=tcp",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_ADDR=10.62.55.112",
"KUBERNETES_PORT_443_TCP_ADDR=10.62.0.1",
"KUBERNETES_SERVICE_PORT=443",
"KUBERNETES_SERVICE_PORT_HTTPS=443",
"KUBERNETES_PORT=tcp://10.62.0.1:443",
"KUBERNETES_PORT_443_TCP_PROTO=tcp",
"CUDA_VERSION=8.0.61",
"CUDA_PKG_VERSION=8-0=8.0.61-1",
"LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
"LIBRARY_PATH=/usr/local/cuda/lib64/stubs:"
],
"cwd": "/usr/local/cuda/samples/0_Simple/vectorAdd",
"capabilities": {
"bounding": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
],
"effective": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
],
"inheritable": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
],
"permitted": [
"CAP_CHOWN",
"CAP_DAC_OVERRIDE",
"CAP_FSETID",
"CAP_FOWNER",
"CAP_MKNOD",
"CAP_NET_RAW",
"CAP_SETGID",
"CAP_SETUID",
"CAP_SETFCAP",
"CAP_SETPCAP",
"CAP_NET_BIND_SERVICE",
"CAP_SYS_CHROOT",
"CAP_KILL",
"CAP_AUDIT_WRITE"
]
},
"oomScoreAdj": 1000
},
"root": {
"path": "/var/lib/docker/overlay2/5b9782752b5d79f2d3646b92e41511a3b959f3d2e7ed1c57c4e299dfb8cd6965/merged"
},
"hostname": "gpu-test-bald",
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "proc",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
{
"destination": "/dev/pts",
"type": "devpts",
"source": "devpts",
"options": [
"nosuid",
"noexec",
"newinstance",
"ptmxmode=0666",
"mode=0620",
"gid=5"
]
},
{
"destination": "/sys",
"type": "sysfs",
"source": "sysfs",
"options": [
"nosuid",
"noexec",
"nodev",
"ro"
]
},
{
"destination": "/sys/fs/cgroup",
"type": "cgroup",
"source": "cgroup",
"options": [
"ro",
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/dev/mqueue",
"type": "mqueue",
"source": "mqueue",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/dev/termination-log",
"type": "bind",
"source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/containers/cuda-vector-add/8473aa30",
"options": [
"rbind",
"rprivate"
]
},
{
"destination": "/etc/resolv.conf",
"type": "bind",
"source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/resolv.conf",
"options": [
"rbind",
"rprivate"
]
},
{
"destination": "/etc/hostname",
"type": "bind",
"source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/hostname",
"options": [
"rbind",
"rprivate"
]
},
{
"destination": "/etc/hosts",
"type": "bind",
"source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/etc-hosts",
"options": [
"rbind",
"rprivate"
]
},
{
"destination": "/dev/shm",
"type": "bind",
"source": "/var/lib/docker/containers/a9b9ee7c563781578218738165e6089442e0d24bdb28ed8c320c40817680f9f7/mounts/shm",
"options": [
"rbind",
"rprivate"
]
},
{
"destination": "/var/run/secrets/kubernetes.io/serviceaccount",
"type": "bind",
"source": "/var/lib/kubelet/pods/685974b9-5eb0-11ed-bada-001eb9697543/volumes/kubernetes.io~secret/default-token-thv9d",
"options": [
"rbind",
"ro",
"rprivate"
]
}
],
"hooks": {
"prestart": [
{
"path": "/usr/bin/nvidia-container-runtime-hook",
"args": [
"/usr/bin/nvidia-container-runtime-hook",
"prestart"
]
}
]
},
"linux": {
"resources": {
"devices": [
{
"allow": false,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 5,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 3,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 9,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 8,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 5,
"minor": 0,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 5,
"minor": 1,
"access": "rwm"
},
{
"allow": false,
"type": "c",
"major": 10,
"minor": 229,
"access": "rwm"
}
],
"memory": {
"disableOOMKiller": false
},
"cpu": {
"shares": 2,
"period": 100000
},
"blockIO": {
"weight": 0
}
},
"cgroupsPath": "kubepods-besteffort-pod685974b9_5eb0_11ed_bada_001eb9697543.slice:docker:664e21c310b62b2e1c3537388127812c7e2f482cb5cf40fa52280e3b62cf2646",
"namespaces": [
{
"type": "mount"
},
{
"type": "network",
"path": "/proc/27057/ns/net"
},
{
"type": "uts"
},
{
"type": "pid"
},
{
"type": "ipc",
"path": "/proc/27057/ns/ipc"
}
],
"maskedPaths": [
"/proc/acpi",
"/proc/kcore",
"/proc/keys",
"/proc/latency_stats",
"/proc/timer_list",
"/proc/timer_stats",
"/proc/sched_debug",
"/proc/scsi",
"/sys/firmware"
],
"readonlyPaths": [
"/proc/asound",
"/proc/bus",
"/proc/fs",
"/proc/irq",
"/proc/sys",
"/proc/sysrq-trigger"
]
}
}
adding NVIDIA_VISIBLE_DEVICES=all
to Pod YAML fixes it as it described here
apiVersion: v1
kind: Pod
metadata:
namespace: text-detector
name: gpu-test-bald
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "registry.k.mycompany.com/experimental/cuda-vector-add:v0.1"
imagePullPolicy: IfNotPresent
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
resources:
requests:
aliyun.com/gpu-mem: "153"
limits:
aliyun.com/gpu-mem: "153"
nodeName: gpu-node10
tolerations:
- operator: "Exists"
OCI
{
"ociVersion": "1.0.1-dev",
"process": {
"user": {
"uid": 0,
"gid": 0
},
"args": [
"/bin/sh",
"-c",
"./vectorAdd"
],
"env": [
"PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"HOSTNAME=gpu-test-bald",
"ALIYUN_COM_GPU_MEM_DEV=32768",
"NVIDIA_VISIBLE_DEVICES=no-gpu-has-153MiB-to-run", <----------Here it is
"ALIYUN_COM_GPU_MEM_IDX=-1",
"ALIYUN_COM_GPU_MEM_POD=153",
"ALIYUN_COM_GPU_MEM_CONTAINER=153",
"NVIDIA_VISIBLE_DEVICES=all", <-------------------Here it is
"TEXT_DETECTOR_STAGING_PORT_8890_TCP=tcp://10.62.55.112:8890",
"KUBERNETES_SERVICE_PORT_HTTPS=443",
"KUBERNETES_PORT=tcp://10.62.0.1:443",
"TEXT_DETECTOR_STAGING_SERVICE_HOST=10.62.55.112",
"TEXT_DETECTOR_STAGING_SERVICE_PORT=8890",
"KUBERNETES_SERVICE_PORT=443",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_ADDR=10.62.55.112",
"KUBERNETES_SERVICE_HOST=10.62.0.1",
"KUBERNETES_PORT_443_TCP=tcp://10.62.0.1:443",
"TEXT_DETECTOR_STAGING_PORT=tcp://10.62.55.112:8890",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_PORT=8890",
"KUBERNETES_PORT_443_TCP_PROTO=tcp",
"KUBERNETES_PORT_443_TCP_PORT=443",
"KUBERNETES_PORT_443_TCP_ADDR=10.62.0.1",
"TEXT_DETECTOR_STAGING_SERVICE_PORT_HTTP=8890",
"TEXT_DETECTOR_STAGING_PORT_8890_TCP_PROTO=tcp",
"CUDA_VERSION=8.0.61",
"CUDA_PKG_VERSION=8-0=8.0.61-1",
"LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
"LIBRARY_PATH=/usr/local/cuda/lib64/stubs:"
],
...
Now the same Pod has been successfully created and completed
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpu-test-bald 0/1 Completed 0 3m40s 10.62.97.59 gpu-node10 <none> <none>
$ kubectl -f gpu-test-bald
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
So could you explain is such a behaviour of NVIDIA_VISIBLE_DEVICES
ENV VAR correct? Seems like it is not
k8s本身就会对pending的pod进行重调度,getCandidatePods的意义何在呢?
Is there any OOM or Signal when pod uses more memory than desired? Since the physical memory on Gpu is limited, over using memory may affect other processes by other users.
For such a GPU like NVIDIA A100 PCI-E 80GB it's not possible to update extended resource in Mb due to that error:
ResourceExhausted desc = grpc: received message larger than max (4986010 vs. 4194304)
device plugin can't update the node status and it leads to GPU node has zero gpu_memory
capacity
Capacity:
aliyun.com/gpu_memory: 0
Nov 17 15:09:51 node02 kubelet[11218]: I1117 15:09:51.797475 11218 manager.go:440] Mark all resources Unhealthy for resource aliyun.com/gpu_memory
We have a K8s cluster which the K8s is a Nvidia-customized version for dgx. It is based on 1.10.8. We just try to check whether the gpushare device plugin works on it. Check the docker log, we find that the device plugin fail to register to kubelet registration service through /var/lib/kubelet/device-plugins/kubelet.socket. I also check this unix socket. It is opened by kubelet and at the LISTENING status. What might cause the "unknown service v1beta1.Registration" error? Thx.
I0306 08:43:57.132717 1 main.go:18] Start gpushare device plugin
I0306 08:43:57.132779 1 gpumanager.go:28] Loading NVML
I0306 08:43:57.134391 1 gpumanager.go:37] Fetching devices.
I0306 08:43:57.134409 1 gpumanager.go:43] Starting FS watcher.
I0306 08:43:57.134475 1 gpumanager.go:51] Starting OS watcher.
I0306 08:43:57.141623 1 nvidia.go:64] Deivce GPU-95061e03-5740-5360-4968-f9c567395f4a's Path is /dev/nvidia0
I0306 08:43:57.141650 1 nvidia.go:69] # device Memory: 8116
I0306 08:43:57.141655 1 nvidia.go:40] set gpu memory: 7
I0306 08:43:57.141660 1 nvidia.go:76] # Add first device ID: GPU-95061e03-5740-5360-4968-f9c567395f4a-_-0
I0306 08:43:57.141665 1 nvidia.go:79] # Add last device ID: GPU-95061e03-5740-5360-4968-f9c567395f4a-_-6
I0306 08:43:57.141669 1 server.go:43] Device Map: map[GPU-95061e03-5740-5360-4968-f9c567395f4a:0]
I0306 08:43:57.141679 1 server.go:44] Device List: [GPU-95061e03-5740-5360-4968-f9c567395f4a]
I0306 08:43:57.159087 1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I0306 08:43:57.159595 1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I0306 08:43:57.160404 1 server.go:226] Could not register device plugin: rpc error: code = Unimplemented desc = unknown service v1beta1.Registration
W0306 08:43:57.160522 1 gpumanager.go:66] Failed to start device plugin due to rpc error: code = Unimplemented desc = unknown service v1beta1.Registration
I0306 08:43:57.161182 1 nvidia.go:64] Deivce GPU-95061e03-5740-5360-4968-f9c567395f4a's Path is /dev/nvidia0
First of all thanks for your work!
kubectl-inspect-gpushare
plguin falls with FATAL when kubectl configured with OpenID as auth-provider
$ kubectl inspect gpushare
ERROR: logging before flag.Parse: F0804 17:04:14.378041 17870 podinfo.go:44] Failed due to No Auth Provider found for name "oidc"
goroutine 1 [running, locked to thread]:
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.stacks(0xc0000d6000, 0xc0003ee000, 0x62, 0xb4)
/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:769 +0xb1
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.(*loggingT).output(0x2886b20, 0xc000000003, 0xc0003c7260, 0x2811fbb, 0xa, 0x2c, 0x0)
/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:720 +0x2f6
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.(*loggingT).printf(0x2886b20, 0x3, 0x1ced327, 0x10, 0xc00010df08, 0x1, 0x1)
/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:655 +0x14e
github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog.Fatalf(...)
/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/vendor/github.com/golang/glog/glog.go:1148
main.kubeInit()
/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/cmd/inspect/podinfo.go:44 +0x1ca
main.init.0()
/Users/vkd/go/src/github.com/AliyunContainerService/gpushare-device-plugin/cmd/inspect/main.go:26 +0x20
看 Kubelet 调用 allocate 的实现
resp, err := eI.e.allocate(devs)
....
m.podDevices.insert(podUID, contName, resource, allocDevices, resp.ContainerResponses[0])
deviceplug.allocate
会
但是在 Kubelet
调用deviceplug.allocate
时已经确定了podUID
. 两者是否会不同?
Hi @cheyang
I have a question about the logic of pick pod in Allocated function.
In my sense, the Allocate params just pass the device id to Device Plugin and there's nothing about container or pod. Why picks a pod and set its ALIYUN_COM_GPU_MEM_ASSIGNED to true in Allocate function ? Does it can guarantee this pod to be running immediately after Allocate? How to realize ?
Hi, i have been able to make it work with a kubespray k8s 1.13.5 cluster, with worker single-node single-GPU.
But I have a bug with a k8s 1.15.3 single-node dual GPU.
Can you help?
k describe pod tf-jupyter-67b475bf4d-4v2nf
...
Warning Failed (x2 over ) kubelet, node-2gpu Error: failed to start container "tensorflow": Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-8129MiB-to-run\\n\""": unknown
[root@node-2gpu ~]# docker logs -f 575171f1ff33
I1112 23:06:01.678676 1 allocate.go:46] ----Allocating GPU for gpu mem is started----
I1112 23:06:01.678717 1 allocate.go:57] RequestPodGPUs: 8129
I1112 23:06:01.678733 1 allocate.go:61] checking...
I1112 23:06:01.705009 1 podmanager.go:112] all pod list [{{ } {tf-jupyter-67b475bf4d-4v2nf tf-jupyter-67b475bf4d- jhub /api/v1/namespaces/jhub/pods/tf-jupyter-67b475bf4d-4v2nf a66921cd-bded-460b-bf4d-beb35c17229a 16993630 0 2019-11-12 17:22:48 +0000 UTC map[app:tf-jupyter pod-template-hash:67b475bf4d] map[] [{apps/v1 ReplicaSet tf-jupyter-67b475bf4d 74a14098-b83d-419f-a8eb-d9bb6fe0ea93 0xc4204a65a7 0xc4204a65a8}] nil [] } {[{bin {&HostPathVolumeSource{Path:/usr/bin,Type:,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}} {lib {&HostPathVolumeSource{Path:/usr/lib,Type:,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}} {default-token-kjd8r {nil nil nil nil nil &SecretVolumeSource{SecretName:default-token-kjd8r,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}] [] [{tensorflow tensorflow/tensorflow:1.12.0-gpu [] [] [{ 0 8888 TCP }] [] [] {map[aliyun.com/gpu-mem:{{8129 0} {} 8129 DecimalSI}] map[aliyun.com/gpu-mem:{{8129 0} {} 8129 DecimalSI}]} [{bin false /usr/local/nvidia/bin } {lib false /usr/local/nvidia/lib } {default-token-kjd8r true /var/run/secrets/kubernetes.io/serviceaccount }] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}] Always 0xc4204a6850 ClusterFirst map[accelerator:nvidia-tesla-m6] default default node-2gpu false false false &PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],} [] nil default-scheduler [{node.kubernetes.io/not-ready Exists NoExecute 0xc4204a6960} {node.kubernetes.io/unreachable Exists NoExecute 0xc4204a6980}] [] 0xc4204a6990 nil []} {Pending [{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-11-12 17:22:48 +0000 UTC }] [] [] BestEffort}}]
I1112 23:06:01.705505 1 podmanager.go:123] list pod tf-jupyter-67b475bf4d-4v2nf in ns jhub in node node-2gpu and status is Pending
I1112 23:06:01.705555 1 podutils.go:81] No assume timestamp for pod tf-jupyter-67b475bf4d-4v2nf in namespace jhub, so it's not GPUSharedAssumed assumed pod.
W1112 23:06:01.705573 1 allocate.go:152] invalid allocation requst: request GPU memory 8129 can't be satisfied.
What happened:
trivy image scan lists critical and high vulnerability against latest image k8s-gpushare-plugin:v2-1.11-aff8a23
What you expected to happen:
No critical or high vulnerability issues.
How to reproduce it:
trivy image --ignore-unfixed --severity HIGH,CRITICAL --format template --template "@/usr/local/share/trivy/templates/html.tpl" -o report.html k8s-gpushare-plugin:v2-1.11-aff8a23
Can it take effect on the window node?with window container
版本信息:
k8s: 1.17
gpushare-device-plugin: v2-1.11-aff8a23
nvidia-smi: 440.36
kubectl descript pod < pod name > -n zhaogaolong
pod errors log
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned zhaogaolong/gpu-demo-gpushare-659fd6cbb7-6fc8v to gpu-node
Normal Pulling 32s (x4 over 70s) kubelet, gpu-node Pulling image "hub.xxxx.com/zhaogaolong/gpu-demo.build.build:bccfcbe43f43280d-1584070500-dac37f2c12024544a6cc2871440dc94a577a7ff3"
Normal Pulled 32s (x4 over 70s) kubelet, gpu-node Successfully pulled image "hub.xxx.com/zhaogaolong/gpu-demo.build.build:bccfcbe43f43280d-1584070500-dac37f2c12024544a6cc2871440dc94a577a7ff3"
Normal Created 31s (x4 over 70s) kubelet, gpu-node Created container gpu
Warning Failed 31s (x4 over 70s) kubelet, gpu-node Error: failed to start container "gpu": Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-256MiB-to-run\\\\n\\\"\"": unknown
Warning BackOff 10s (x5 over 68s) kubelet, ggpu-node Back-off restarting failed container
相同问题:
这里这块代码会导致当节点有多个不同型号GPU(显存也不一致)时会以第一个识别到的GPU为准,例如节点12G +16G ,这个节点两个GPU会被都识别成12G,一共24G
Capacity:
aliyun.com/gpu-count: 8
aliyun.com/gpu-mem: 0
gpu tesla V100
日志如下
[root@localhost ~]# kubectl logs -f -n kube-system gpushare-device-plugin-ds-qjltc
I1012 05:08:46.374978 1 main.go:18] Start gpushare device plugin
I1012 05:08:46.375045 1 gpumanager.go:28] Loading NVML
I1012 05:08:46.379478 1 gpumanager.go:37] Fetching devices.
I1012 05:08:46.379497 1 gpumanager.go:43] Starting FS watcher.
I1012 05:08:46.379930 1 gpumanager.go:51] Starting OS watcher.
I1012 05:08:46.389438 1 nvidia.go:64] Deivce GPU-60805828-8ab0-6124-67c4-9baff56d087b's Path is /dev/nvidia0
I1012 05:08:46.389549 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.389564 1 nvidia.go:40] set gpu memory: 32510
I1012 05:08:46.389577 1 nvidia.go:76] # Add first device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--0
I1012 05:08:46.453844 1 nvidia.go:79] # Add last device ID: GPU-60805828-8ab0-6124-67c4-9baff56d087b--32509
I1012 05:08:46.461774 1 nvidia.go:64] Deivce GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01's Path is /dev/nvidia1
I1012 05:08:46.461816 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.461827 1 nvidia.go:76] # Add first device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--0
I1012 05:08:46.559867 1 nvidia.go:79] # Add last device ID: GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01--32509
I1012 05:08:46.567541 1 nvidia.go:64] Deivce GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a's Path is /dev/nvidia2
I1012 05:08:46.567574 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.567583 1 nvidia.go:76] # Add first device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--0
I1012 05:08:46.658328 1 nvidia.go:79] # Add last device ID: GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a--32509
I1012 05:08:46.666367 1 nvidia.go:64] Deivce GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5's Path is /dev/nvidia3
I1012 05:08:46.666393 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.666399 1 nvidia.go:76] # Add first device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--0
I1012 05:08:46.676851 1 nvidia.go:79] # Add last device ID: GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5--32509
I1012 05:08:46.683786 1 nvidia.go:64] Deivce GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991's Path is /dev/nvidia4
I1012 05:08:46.683802 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.683809 1 nvidia.go:76] # Add first device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--0
I1012 05:08:46.948055 1 nvidia.go:79] # Add last device ID: GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991--32509
I1012 05:08:46.956435 1 nvidia.go:64] Deivce GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf's Path is /dev/nvidia5
I1012 05:08:46.956486 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.956504 1 nvidia.go:76] # Add first device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--0
I1012 05:08:46.972438 1 nvidia.go:79] # Add last device ID: GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf--32509
I1012 05:08:46.980775 1 nvidia.go:64] Deivce GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415's Path is /dev/nvidia6
I1012 05:08:46.980797 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.980805 1 nvidia.go:76] # Add first device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--0
I1012 05:08:46.990545 1 nvidia.go:79] # Add last device ID: GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415--32509
I1012 05:08:46.997877 1 nvidia.go:64] Deivce GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2's Path is /dev/nvidia7
I1012 05:08:46.997891 1 nvidia.go:69] # device Memory: 32510
I1012 05:08:46.997895 1 nvidia.go:76] # Add first device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--0
I1012 05:08:47.249585 1 nvidia.go:79] # Add last device ID: GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2--32509
I1012 05:08:47.249606 1 server.go:43] Device Map: map[GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415:6 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2:7 GPU-60805828-8ab0-6124-67c4-9baff56d087b:0 GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01:1 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a:2 GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5:3 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991:4 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf:5]
I1012 05:08:47.249644 1 server.go:44] Device List: [GPU-c854bc81-34e3-0ecd-7371-e095b70b03e5 GPU-7770845b-ed41-a3cd-7ca1-92cfeffa3991 GPU-e94907ae-1d00-7b23-c45d-840b7c9daeaf GPU-fa56285a-16dc-ba8d-22bc-4da78fa1e415 GPU-4e75e7aa-bf09-9acd-0ba1-b415b61f03f2 GPU-60805828-8ab0-6124-67c4-9baff56d087b GPU-41e647db-0c4c-7817-219d-e1cd7bb8ed01 GPU-7e19808b-d7da-307c-5cbf-3d3699c82d7a]
I1012 05:08:47.265532 1 podmanager.go:68] No need to update Capacity aliyun.com/gpu-count
I1012 05:08:47.266863 1 server.go:222] Starting to serve on /var/lib/kubelet/device-plugins/aliyungpushare.sock
I1012 05:08:47.267431 1 server.go:230] Registered device plugin with Kubelet
有没有人遇见过 k8s 1.16.3 nvidia-runtime 1.1-dev
When I try to pull your docker image I get:
Error response from daemon: Get https://registry.cn-hangzhou.aliyuncs.com/v2/: authenticationrequired
[attempt #1] Fail to pull registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23. Retry in 2 seconds
Any idea what I do wrong ?
Many thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.