GithubHelp home page GithubHelp logo

varunrsekar / k8s-dra-driver Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nvidia/k8s-dra-driver

0.0 0.0 0.0 9.4 MB

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes

License: Apache License 2.0

Go 92.09% Makefile 6.51% Smarty 1.40%

k8s-dra-driver's Introduction

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes

This DRA resource driver is currently under active development and not yet designed for production use. We will continually be force pushing over main until we have something more stable. Use at your own risk.

A document and demo of the DRA support for GPUs provided by this repo can be found below:

Document Demo
Dynamic Resource Allocation (DRA) for GPUs in Kubernetes Demo of Dynamic Resource Allocation (DRA) for GPUs in Kubernetes

Demo

This section describes using kind to demo the functionality of the NVIDIA GPU DRA Driver.

First since we'll launch kind with GPU support, ensure that the following prerequisites are met:

  1. kind is installed. See the official documentation here.

  2. Ensure that the NVIDIA Container Toolkit is installed on your system. This can be done by following the instructions here.

  3. Configure the NVIDIA Container Runtime as the default Docker runtime:

    sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
  4. Restart Docker to apply the changes:

    sudo systemctl restart docker
  5. Set the accept-nvidia-visible-devices-as-volume-mounts option to true in the /etc/nvidia-container-runtime/config.toml file to configure the NVIDIA Container Runtime to use volume mounts to select devices to inject into a container.

    sudo nvidia-ctk config --set accept-nvidia-visible-devices-as-volume-mounts=true --in-place
  6. Show the current set of GPUs on the machine:

    nvidia-smi -L

We start by first cloning this repository and cding into it. All of the scripts and example Pod specs used in this demo are in the demo subdirectory, so take a moment to browse through the various files and see what's available:

git clone https://github.com/NVIDIA/k8s-dra-driver.git
cd k8s-dra-driver

Setting up the infrastructure

Here's a demo showing how to install and configure DRA, and run a pod in a kind cluster on a Linux workstation.

Below are the detailed, step-by-step instructions.

First, create a kind cluster to run the demo:

./demo/clusters/kind/create-cluster.sh

From here we will build the image for the example resource driver:

./demo/clusters/kind/build-dra-driver.sh

This also makes the built images available to the kind cluster.

We now install the NVIDIA GPU DRA driver:

./demo/clusters/kind/install-dra-driver.sh

This should show two pods running in the nvidia-dra-driver namespace:

kubectl get pods -n nvidia-dra-driver
NAMESPACE           NAME                                       READY   STATUS    RESTARTS   AGE
nvidia-dra-driver   nvidia-dra-controller-6bdf8f88cc-psb4r     1/1     Running   0          34s
nvidia-dra-driver   nvidia-dra-plugin-lt7qh                    1/1     Running   0          32s

Run the examples by following the steps in the demo script

Finally, you can run the various examples contained in the demo/specs/quickstart folder. The README in that directory shows the full script of the demo you can walk through.

cat demo/specs/quickstart/README.md

Deploy the example pods in the demo directory:

kubectl apply --filename=demo/specs/quickstart/gpu-test{1,2,3}.yaml

Get the pods' statuses. Depending on which GPUs are available, running the first three examples will produce output similar to the following...

Note: there is a known issue with kind. You may see an error while trying to tail the log of a running pod in the kind cluster: failed to create fsnotify watcher: too many open files. The issue may be resolved by increasing the value for fs.inotify.max_user_watches.

kubectl get pod -A -l app=pod
NAMESPACE           NAME                                       READY   STATUS    RESTARTS   AGE
gpu-test1           pod1                                       1/1     Running   0          34s
gpu-test1           pod2                                       1/1     Running   0          34s
gpu-test2           pod                                        2/2     Running   0          34s
gpu-test3           pod1                                       1/1     Running   0          34s
gpu-test3           pod2                                       1/1     Running   0          34s
kubectl logs -n gpu-test1 -l app=pod
GPU 0: A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2)
GPU 0: A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)
kubectl logs -n gpu-test2 pod --all-containers
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
kubectl logs -n gpu-test3 -l app=pod
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)

Cleaning up the environment

Remove the cluster created in the preceding steps:

./demo/clusters/kind/delete-cluster.sh

k8s-dra-driver's People

Contributors

klueska avatar elezar avatar dependabot[bot] avatar arangogutierrez avatar yuanchen8911 avatar empovit avatar coderth avatar yyzxw avatar kerthcet avatar learner0810 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.