Comments (4)
The pod with gpu-count
can be scheduled but it won't be enabled with GPU capabilities in device plugin.
It's just an extended resource for keeping the GPU count. See gpu count definition.
I think I can make gpu-count
disable in scheduling time. Such as aliyun.com/gpu-count is not allowed to schedule, it's only for keeping the info of GPU count
. What's your idea?
from gpushare-scheduler-extender.
Oh, interesting, what I'm seeing is that I'm getting access to the GPUs regardless of what I select for the GPU request (gpu-mem, gpu-count, or no gpu at all). However, I think my teammate discovered that using the cuda-vector-add example from the k8s documentation causes the nvidia-docker2 runtime to auto add all the GPUs to the docker image.
In terms of workflow I'd like to be able to use this plugin to manage both our workflow for shared GPUs as well as whole GPUs.
If that's not possible, then I'd definitely prefer to see gpu-count be an unscheduled resource as you describe.
from gpushare-scheduler-extender.
I think the reason is that you are using nvidia's cuda docker base image which includes the environment [NVIDIA_VISIBLE_DEVICES=all]
https://github.com/NVIDIA/nvidia-container-runtime#nvidia_visible_devices. That cause the nvidia-docker2 load nvidia runtime for the container. I think you can set NVIDIA_VISIBLE_DEVICES=void
when building your docker image.
from gpushare-scheduler-extender.
你好 我在运行示例的时候报这个错误nvidia-container-cli: device error: unknown device id: no-gpu-has-1024MiB-to-run,请问一下怎么解决这个问题,和显卡驱动有关系吗?
from gpushare-scheduler-extender.
Related Issues (20)
- k3s services not started scheduler exited: stat /etc/kubernetes/scheduler.conf: no such file or directory HOT 1
- nodeinfo.go allocateGPUID method optimization
- 使用kubeflow1.6.1 使用自定义镜像有问题 HOT 3
- scheduler-policy-config.yaml文件咨询
- 多次进行删除创建Pod之后,会导致新创建Pod出现Pending状态
- 显存与真实情况不符 HOT 1
- Not able to use gpushare-scheduler-extender on EKS cluster with Kubernetes v1.24 HOT 2
- plugin does not evenly distribute the pods. 这个插件无法均匀分配Pod。 HOT 2
- GPU cores scheduling / GPU核心调度
- 你好,kubectl logs这个命令在gpu容器上无效,在普通容器上却可以
- Back-off restarting failed container: gpushare-device-plugin-ds-xxxxx HOT 1
- 这个GPU共享插件支持使用dcgm-exporter做监控吗 HOT 4
- 如果想要指定使用两张显卡多卡训练 该怎么做 HOT 1
- 如果一个机器上有两张卡,第一张卡的内存使满了,之后的任务会调度到另一张卡上吗
- Support for Horizontal Pod Autoscaling (HPA) with GPU Pods? 是否支持使用GPU Pods的水平Pod自动扩展(HPA)? HOT 1
- 该项目还在维护吗
- 关于显存申请基本单位改为MiB但不起作用的问题
- 调度层有bug吧,请求8G,实际设备最大7G,居然最终能创建成功pod
- 这个项目目前在使用过程中存在的问题
- ALIYUN_COM_GPU_MEM_IDX in the annotation is different than ALIYUN_COM_GPU_MEM_IDX inside the pod
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpushare-scheduler-extender.