nthu-lsalab / kubeshare Goto Github PK

Share GPU between Pods in Kubernetes

License: Apache License 2.0

Makefile 1.36% Go 87.97% Shell 0.80% Dockerfile 2.70% Python 7.18%

kubeshare's Introduction

KubeShare

🎉🎉 Kubeshare 2.0 is now avaible, version 1.0 will be deprecated

A topology and heterogeneous resource aware scheduler for fractional GPU allocation in Kubernetes cluster
KubeShare 2.0 is designed in the way of the scheduling framework.

Note that KubeShare 1.0 is deprecated. Refer to the KubeShare 1.0 branch for the old version.

Features

Support fractional gpu allocation(<=1) and integer gpu allocation(>1)
Support GPU Heterogeneity & Topology awareness
Support Coscheduling

Prerequisite & Limitation

A Kubernetes cluster with garbage collection, DNS enabled nvidia-continaer-runtimeinstalled.
Only support a kubernetes cluster that uses the environment variable NVIDIA_VISIBLE_DEVICES to control which GPUs will be made accessible inside the container.
You also ensures that the prometheus is installed, because we will pull the data from it.
It can't compatible with other scheduler to manage gpu resource
Go version >= v1.16
Only tested with Kuberenetes v1.18.10

Deployment

Deploy Componments

Workloads

Label description

Because floating point custom device requests is forbidden by K8s, we move GPU resource usage definitions to Labels.

sharedgpu/gpu_request (required if allocating GPU): guaranteed GPU usage of Pod, gpu_request <= "1.0".
sharedgpu/gpu_limit (required if allocating GPU): maximum extra usage if GPU still has free resources, gpu_request <= gpu_limit <= "1.0".
sharedgpu/gpu_mem (optional): maximum GPU memory usage of Pod, in bytes. The default value depends on gpu_request
sharedgpu/priority(optional): pod priority 0~100. The default value is 0.
- priority is equal to 0 represented as an Opportunistic Pod used to defragmentation
- priority is greater than 0 represented as Guarantee Pod, which optimizes performance considering locality.
sharedgpu/pod_group_name (optional): the name of pod group for a coscheduling
sharedgpu/group_headcount (optional): the total number of pods in same group
sharedgpu/group_threshold (optional): the minimum proportion of pods to be scheduled together in a pod group.

Pod specification

apiVersion: v1
kind: Pod
metadata:
  name: mnist
  labels:
    "sharedgpu/gpu_request": "0.5"
    "sharedgpu/gpu_limit": "1.0"
    "sharedgpu/gpu_model": "NVIDIA-GeForce-GTX-1080"
spec:
  schedulerName: kubeshare-scheduler
  restartPolicy: Never
  containers:
    - name: pytorch
      image:  riyazhu/mnist:test
      command: ["sh", "-c", "sleep infinity"]
      imagePullPolicy: Always #IfNotPresent

Job specification

apiVersion: batch/v1
kind: Job
metadata:
  name: lstm-g
  labels:
    app: lstm-g
spec:
  completions: 5
  parallelism: 5
  template:
    metadata:
      name: lstm-o
      labels:
        "sharedgpu/gpu_request": "0.5"
        "sharedgpu/gpu_limit": "1.0"
        "sharedgpu/group_name": "a"
        "sharedgpu/group_headcount": "5"
        "sharedgpu/group_threshold": "0.2"
        "sharedgpu/priority": "100"
    spec:
      schedulerName: kubeshare-scheduler
      restartPolicy: Never
      containers:
        - name: pytorch
          image:  riyazhu/lstm-wiki2:test
          # command: ["sh", "-c", "sleep infinity"]
          imagePullPolicy: IfNotPresent
          volumeMounts:
          - name: datasets
            mountPath: "/datasets/"
      volumes:
        - name: datasets
          hostPath:
            path: "/home/riya/experiment/datasets/"

Build

Compiling

git clone https://github.com/NTHU-LSALAB/KubeShare.git
cd KubeShare
make

bin/kubeshare-scheduler: schedules pending Pods to node and device, i.e. <nodeName, GPU UUID>.
bin/kubeshare-collector: collect the GPU specification
bin/kubeshare-aggregator(gpu register): register pod GPU requirement.
bin/kubeshare-config: update the config file for Gemini
bin/kubeshare-query-ip: inject current node ip for Gemini

Build & Push images

make build-image
make push-image

chanage variables CONTAINER_PREFIX, CONTAINER_NAME, CONTAINER_VERSION

Directories & Files

cmd/: where main function located of three binaries.
docker/: materials of all docker images in yaml files.
pkg/: includes KubeShare 2.0 core components.
deploy/: the install yaml files.
go.mod: KubeShare dependencies.

GPU Isolation Library

Please refer to Gemini.

TODO

Optimize the locality function.
Modify the prometheus to etcd.
Automatically detect GPU topology.

Issues

Any issues please open a GitHub issue, thanks.

kubeshare's People

Contributors

Stargazers

Watchers

kubeshare's Issues

Scheduler issue

在scheduler 計算Pod使用的資源總量時，有遇到可用資源變成負的情況。
目前的計算方式中似乎會把Completed的Pod所用到的資源也算進去。

Having some problem while testing

I've followed the step in https://asciinema.org/a/302094,
but after I create the sharePod and input kubectl get sharepod pod1 -o yaml | egrep -m2 'GPUID|nodeName' ,
it didn't show anything. Why?

I don't know if it is because of my cuda version is 11.2 or what.

BTW, do I need to run the makefile? I'm not sure if that is necessary.

How to mount a directory and enter the container to run python script？

Hi, I want to mount a directory and enter the container to run python script.
Below are the yaml of a pod3:

apiVersion: kubeshare.nthu/v1
kind: SharePod
metadata:
  name: pod3
  annotations:
    "kubeshare/gpu_request": "0.4"
    "kubeshare/gpu_limit": "1.0"
    "kubeshare/gpu_mem": "3145728000"
    "kubeshare/GPUID": "abcde"
spec:
  terminationGracePeriodSeconds: 0
  containers:
  - name: tf
    image: 10.166.15.29:5000/tensorflow/tensorflow:1.15.2-gpu-py3
    volumeMounts:
    - name: workspace
      mountPath: /benchmarks
    # command: ["sh", "-c", "python3 /root/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py"]
    # command: ["sh", "-c", "curl -s https://lsalab.cs.nthu.edu.tw/~ericyeh/KubeShare/demo/mnist.py | python3 -"]
  nodeName: k8s-gpu
  # restartPolicy: OnFailure
  # restartPolicy: Always
  restartPolicy: Never
  volumes:
    - name: workspace
      hostPath:
        path: "/home/hezhiming/benchmarks/"

I run:
kubectl create -f pod3.yaml
then run:
kubectl get pods
The status of pod3 is Completed.
I run:
kubectl exec -it pod3 --container tf bash
get this error:
error: cannot exec into a container in a completed pod; current phase is Succeeded

How can I keep the status of pod3 running and enter the container to run python script?

Gemini E/ attempt 1: Connection refused

Hi，I run：
kubectl logs pod3 -f

get this error:

2020-06-22 09:15:16.220809 Gemini E/ attempt 1: Connection refused
2020-06-22 09:15:26.221053 Gemini E/ attempt 2: Connection refused
2020-06-22 09:15:36.221295 Gemini E/ attempt 3: Connection refused
2020-06-22 09:15:46.221547 Gemini E/ attempt 4: Connection refused
2020-06-22 09:15:56.221820 Gemini E/ attempt 5: Connection refused
2020-06-22 09:16:06.221937 Gemini E/ Connection error: Connection refused

Can you help me? @ncy9371

Prometheus metric gpu_capacity is not found.

I checked out my prometheus does not export the metric 'gpu_capacity' and I guess it is the reason why my pods are always not able to be scheduled. How can I query gpu_capacity? I installed prometheus via NVIDIA deepops. https://github.com/NVIDIA/deepops

kubeshare-node-daemon pod in gpu node is crashed

hi, I followed the document to install the kubeshare, but the daemon pod in gpu node can not run.
My docker version is

nvidia-docker  version
NVIDIA Docker: 2.2.2
Client: Docker Engine - Community
 Version:           19.03.4
 API version:       1.40
 Go version:        go1.12.10
 Git commit:        9013bf583a
 Built:             Fri Oct 18 15:53:51 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.4
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.10
  Git commit:       9013bf583a
  Built:            Fri Oct 18 15:52:23 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 nvidia:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Is my docker version too high? which version can run?

Some problems of Gemini

Hello, I have a few questions I would like to ask.
(1) What do you mean by window, measured burst, and estimated full burst in gemini? What do you use to measure and predict them?
(2) In gemini, if gpu_limit=0.4 and window=10s, does it mean that the client can get 4s quota one time instead of getting 100ms quota multiple times and getting 4s time quota cumulatively ?

HPDC Slide

Where can I find your presentation slide please? Thanks!

Docker Version > 19 Support

Hi all,

thanks again for the very interesting work!

Why is the support limited to Docker versions below 19? Is there any roadmap to support Docker > 19?

Best,
Samed

Gemini pod manager module is not able to build locally.

Hi,
I try to rebuild the package to run. find out that Gemini pod manager (gem-pmgr) doesn't have source codes. Are Gemini source codes available to public?

Pod CrashLoopBackOff when running sample

I follow the instructions for running the KubeShare sample

When applying the sharepod1.yaml and sharepod2.yaml files , I see errors Init: CrashLoopBackOff, but no errors in logs.

kubectl create -f .

sharepod.kubeshare.nthu/sharepod1 created
sharepod.kubeshare.nthu/sharepod2 created

when I logs the pods, no errors in the following results.

kubectl logs sharepod1
GPU 0: Tesla T4 (UUID: GPU-984b0041-8fa4-82e9-6111-5c8b7c351158)

kubectl logs sharepod2
GPU 0: Tesla T4 (UUID: GPU-984b0041-8fa4-82e9-6111-5c8b7c351158)

Below are the discription of a pod :
kubectl get po

Name:           sharepod1
Namespace:      default
Priority:       0
Node:           k8s-gpu/10.166.15.26
Start Time:     Mon, 13 Apr 2020 15:45:44 +0800
Labels:         <none>
Annotations:    cni.projectcalico.org/podIP: 192.168.134.208/32
                kubeshare/GPUID: abcde
                kubeshare/gpu_limit: 1.0
                kubeshare/gpu_mem: 1073741824
                kubeshare/gpu_request: 0.5
Status:         Running
IP:             192.168.134.208
Controlled By:  SharePod/sharepod1
Containers:
  cuda:
    Container ID:  docker://f96ace2735ad6e3b0adb87a207b540d8faf10de7a8f31b60ac62dea188f391f8
    Image:         nvidia/cuda:9.0-base
    Image ID:      docker-pullable://10.166.15.29:5000/nvidia/cuda@sha256:56bfa4e0b6d923bf47a71c91b4e00b62ea251a04425598d371a5807d6ac471cb
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-smi
      -L
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 13 Apr 2020 15:48:31 +0800
      Finished:     Mon, 13 Apr 2020 15:48:31 +0800
    Ready:          False
    Restart Count:  5
    Limits:
      cpu:     1
      memory:  500Mi
    Requests:
      cpu:     1
      memory:  500Mi
    Environment:
      NVIDIA_VISIBLE_DEVICES:      GPU-984b0041-8fa4-82e9-6111-5c8b7c351158
      NVIDIA_DRIVER_CAPABILITIES:  compute,utility
      LD_PRELOAD:                  /kubeshare/library/libgemhook.so.1
      POD_MANAGER_IP:              192.168.134.192
      POD_MANAGER_PORT:            50059
      POD_NAME:                    default/sharepod1
    Mounts:
      /kubeshare/library from kubeshare-lib (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-l54xv (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kubeshare-lib:
    Type:          HostPath (bare host directory volume)
    Path:          /kubeshare/library
    HostPathType:
  default-token-l54xv:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-l54xv
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                    From              Message
  ----     ------   ----                   ----              -------
  Normal   Pulled   3m41s (x5 over 5m5s)   kubelet, k8s-gpu  Container image "nvidia/cuda:9.0-base" already present on machine
  Normal   Created  3m41s (x5 over 5m5s)   kubelet, k8s-gpu  Created container cuda
  Normal   Started  3m40s (x5 over 5m5s)   kubelet, k8s-gpu  Started container cuda
  Warning  BackOff  3m14s (x10 over 5m3s)  kubelet, k8s-gpu  Back-off restarting failed container

How can I solve this problem? @ncy9371
Thanks ~

Some problems in installation

The url https://lsalab.cs.nthu.edu.tw/~ericyeh/KubeShare/v0.9/crd.yaml is unreachable.

Use Kubeshare funcationality with native kubernetes resources

Thanks for this solution, succesfully tested this in my cluster.

Is there any way to use the solution without the SharePod custom resources?

E.g. create a Job or Pod directly using annotations / resource requests?

Im looking at the following scenarios:

Integration into frameworks, such as Kubeflow (Creates plain pod resources in a pipeline)
Running stateless deployments or jobs
Using features of the original scheduler, such as PriorityClasses

Thanks

Does KubeShare work well on cuda 11.0+?

Hello,
I run kubeshare on cuda 11.6 with NVIDIA GeForce RTX 3090 Ti (24GiB). It seems to be well that I create sharepod. But some limitations can't take effect. For example, I set the gpu_limit to 0.5, but after running， I found the actual GPU core usage more than the gpu_limit, may be 0.7.

how to test the runtime isolation of computing resource and memory？

hi,
I want to test the runtime isolation of computing resource and memory, how to test the isolation and where are testcases for it? Thanks in advance!

how to test KubeShare throughput improvement?

I deployed the environment of KubeShare on my cluster,and used a DL inference example to test its throughput improvement.I created two jobs with half of a GPU card's cores and memory , each job process 5000 images.For comparison,I created a same job in k8s using a entire GPU card and process 10000 images at all.As the result,the former is much slower and the GPU utilization has no obvious difference.
Is there anything wrong with my test?Can I have your evaluation examples to test in my cluster?

About how to write the gpu topology

Hello, when I deploy KubeShare and test it, My pod is always in Pending. After I check the log, in kubeshare-scheduler.log, there is "No corresponding gpu NVIDIA GeForce GTX 3090 in the node master".
My k8s cluster has only one node with two GPU devices. This is my kubeshare-config.yaml:
cellTypes:
GTX3090-NODE:
childCellType: "NVIDIA GeForce GTX 3090"
childCellNumber: 1
childCellPriority: 100
isNodeLevel: true

cells:

cellType: GTX3090-NODE
cellChildren:
- cellId: apple

I need your help. Plz @ncy9371 @justin0u0

Scheduling before resource updated

Hi, I appreciate your efforts in sharing GPUs in Kubernetes.
We tried to run KubeShare scheduler in our cluster, and we found an issue.

Some sharepods are waiting, because there is not enough GPU resources.
When KubeShare schedule sharepods, it need to synchronize current resource of nodes in cluster.
However, instead of waiting for the scheduled sharepod to be updated, the next sharepod will be scheduled immediately.

So maybe it needs to add code to wait for the scheduled sharepod to be updated.
We solved the issue by adding the code below to 'syncHandler' function in 'KubeShare/pkg/scheduler/controller.go.'

for sharepod.Spec.NodeName != schedNode && sharepod.ObjectMeta.Annotations[kubesharev1.KubeShareResourceGPUID] != schedGPUID {
	sharepod, err = c.sharepodsLister.SharePods(namespace).Get(name)
	if err != nil {
		if errors.IsNotFound(err) {
			utilruntime.HandleError(fmt.Errorf("SharePod '%s' in work queue no longer exists", key))
			return nil
		}
		return err
	}
}

Thank you for your great work!

symbol lookup error: /kubeshare/library/libgemhook.so.1

Hello!

I have installed KubeShare to my K8s cluster, tested with a simple pod specification.

apiVersion: v1
kind: Pod
metadata:
  name: test-kubeshare
	namespace: core
  labels:
    "sharedgpu/gpu_request": "0.1"
    "sharedgpu/gpu_limit": "0.2"
spec:
  schedulerName: kubeshare-scheduler
  containers:
    - name: pytorch
      image:  riyazhu/mnist:20220420
      imagePullPolicy: IfNotPresent

The result is positive.

GPU 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU (UUID: GPU-xxx)

However, when I tried to deploy my own Docker image, following error occurs:

/bin/bash: symbol lookup error: /kubeshare/library/libgemhook.so.1: undefined symbol: __libc_dlopen_mode, version GLIBC_PRIVATE

Not sure if this information helps but my base image is nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04 and I am compiling OpenCV for object detection using GPU. The image works fine when using the default-schduler.

Any suggestions on this issue? Thank you!

some problems while testing resource isolation

Hi, I met some problems while testing resource isolation. The KubeShare seems to be running normally, but the isolation specified by annotation fails to achieve the expected effect.

My Enviornment

GPU: NVIDIA GeForce RTX 3090 Ti (24GiB)
CPU: Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz
docker version: 20.10.12
nvidia-docker2 version: 2.11.0 ( default rumtime )
nvidia driver version: 510.73 host cuda driver version: 11.6
kubernetes client: v1.20.0 server: v1.20.15 (single node with gpu)

Resource isolation test

shared pod file:

apiVersion: kubeshare.nthu/v1
kind: SharePod
metadata:
  name: sharepod1
  annotations:
    "kubeshare/gpu_request": "0.5"
    "kubeshare/gpu_limit": "0.6"
    "kubeshare/gpu_mem": "10485760000"
spec:
  terminationGracePeriodSeconds: 0
  containers:
  - name: tensorflow-benchmark
    image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:benchmark-tensorflow-2.4.0
    command:
    - bash
    - run.sh
    - --num_batches=50000
    - --batch_size=8
    workingDir: /root

kubectl get pod -A

NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE
default       sharepod1                                  1/1     Running   0          3m39s
kube-system   calico-kube-controllers-7854b85cf7-sd5fw   1/1     Running   0          2d10h
kube-system   calico-node-ccdcp                          1/1     Running   0          2d10h
kube-system   coredns-54d67798b7-f5fv8                   1/1     Running   0          2d10h
kube-system   coredns-54d67798b7-rlvhg                   1/1     Running   0          2d10h
kube-system   etcd-k8s-master                            1/1     Running   0          2d10h
kube-system   kube-apiserver-k8s-master                  1/1     Running   0          2d10h
kube-system   kube-controller-manager-k8s-master         1/1     Running   0          2d10h
kube-system   kube-proxy-lz6jn                           1/1     Running   0          2d10h
kube-system   kube-scheduler-k8s-master                  1/1     Running   0          2d10h
kube-system   kubeshare-device-manager                   1/1     Running   0          2d10h
kube-system   kubeshare-node-daemon-f58tc                2/2     Running   0          2d10h
kube-system   kubeshare-scheduler                        1/1     Running   0          2d10h
kube-system   kubeshare-vgpu-k8s-master-gzwvx            1/1     Running   0          3m40s
kube-system   nvidia-device-plugin-daemonset-twghw       1/1     Running   0          2d10h

kubectl logs sharepod1 seems to be working

INFO:tensorflow:Running local_init_op.
I1010 01:02:41.913408 140019771205440 session_manager.py:505] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1010 01:02:41.943301 140019771205440 session_manager.py:508] Done running local_init_op.
2022-10-10 01:02:42.579839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2022-10-10 01:04:04.418663: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2022-10-10 01:17:41.610889: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
TensorFlow:  2.2
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  8 global
             8 per device
Num batches: 50000
Num epochs:  0.31
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
Initializing graph
Running warm up
Done warm up
Time    Step    Img/sec total_loss
2022-10-10 01:18        1       images/sec: 354.7 +/- 0.0 (jitter = 0.0)        nan
2022-10-10 01:18        10      images/sec: 355.0 +/- 0.4 (jitter = 0.8)        nan
2022-10-10 01:18        20      images/sec: 354.8 +/- 0.3 (jitter = 1.3)        nan
2022-10-10 01:18        30      images/sec: 354.7 +/- 0.2 (jitter = 1.2)        nan
2022-10-10 01:18        40      images/sec: 354.7 +/- 0.2 (jitter = 1.2)        nan
2022-10-10 01:18        50      images/sec: 66.9 +/- 7.0 (jitter = 1.4) nan
2022-10-10 01:18        60      images/sec: 77.3 +/- 5.8 (jitter = 1.4) nan
2022-10-10 01:18        70      images/sec: 87.0 +/- 5.0 (jitter = 1.4) nan
2022-10-10 01:18        80      images/sec: 96.0 +/- 4.4 (jitter = 1.3) nan
2022-10-10 01:18        90      images/sec: 104.5 +/- 3.9 (jitter = 1.4)        nan
2022-10-10 01:18        100     images/sec: 112.4 +/- 3.5 (jitter = 1.4)        nan
2022-10-10 01:18        110     images/sec: 119.8 +/- 3.2 (jitter = 1.5)        nan
2022-10-10 01:18        120     images/sec: 126.8 +/- 2.9 (jitter = 1.3)        nan
2022-10-10 01:18        130     images/sec: 133.4 +/- 2.7 (jitter = 1.4)        nan
2022-10-10 01:19        140     images/sec: 139.6 +/- 2.5 (jitter = 1.4)        nan
2022-10-10 01:19        150     images/sec: 145.5 +/- 2.4 (jitter = 1.5)        nan
2022-10-10 01:19        160     images/sec: 151.0 +/- 2.2 (jitter = 1.5)        nan
2022-10-10 01:19        170     images/sec: 156.3 +/- 2.1 (jitter = 1.5)        nan

However, the annotation of resource isolation does not seem to be effective

nvidia-smi on host

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  Off |
| 52%   81C    P2   328W / 450W |   8409MiB / 24564MiB |     96%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     83363      C   python                           8407MiB |
+-----------------------------------------------------------------------------+

nvidia-smi in the sharepod1

root@sharepod1:~# nvidia-smi
Mon Oct 10 01:43:31 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  Off |
| 88%   83C    P2   332W / 450W |   8409MiB / 24564MiB |     95%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I want to ask what the problem is

nvidia-k8s-device-plugin can not find gpu cards

Hi, I re-install docker and nvidia-docker, now the k8s can not find out gpu cards. It can work before I re-install the lower docker.
my nvidia-k8s-devcie-plugin version is 1.11
my docker version 18.06.3-ce
my nvidia-docker is 2.2.2
Could you show me your version? @ncy9371

some questions about KubeShare2.0

Hello!
I am installing the KubeShare2.0. I have finished the preparation and have output of kubectl describe node

Capacity:
  cpu:                16
  ephemeral-storage:  29352956Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16248988Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                16
  ephemeral-storage:  27051684205
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16146588Ki
  nvidia.com/gpu:     1
  pods:               110

When I follow deploy.md, I have some questions:

Where to place kubeshare-config.yaml file? Could you please tell me its absolute path?
I wonder how to "Make sure the enpoint of kubeshare-aggregator & kubeshare-collector of prometheus is up.".
And also, could you please show me the config files of prometheus when monitoring kubernetes?

I am a beginner in this field so I will be so much grateful if you provide more details when building the KubeShare2.0 system.
Looking forward to your reply! Thanks a lot!

How isolation is achieved?

2 pods share one GPU will effect each other?

How to configure kubeshare-config.yaml?

Hello. I recently converted from KubeShare 1.0 ver to KubeShare2.0 and it seems like many major factors are changed in the 2.0 version.
I try to use 2.0 but the pod always fails to be scheduled. I guess the cause of the issue is misconfiguration of kubeshare-config.yaml file.
When I command 'kubectl describe pod ', I get the error message like this:

Warning  FailedScheduling  9s    kubeshare-scheduler  0/3 nodes are available: 1 [Filter] Node gpu01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node gpu02 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node mgmt01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0.
Warning  FailedScheduling  9s    kubeshare-scheduler  0/3 nodes are available: 1 [Filter] Node gpu01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node gpu02 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node mgmt01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0.

I have a GPU cluster and here is the physical structure of the cluster.

And here is the config file I wrote. What is wrong with it?

cellTypes:
  T4-NODE:
    childCellType: "Tesla-T4"
    childCellNumber: 2

cells:
- cellType: T4-NODE
  cellChildren:
  - cellId: gpu01
  - cellId: gpu02