GithubHelp home page GithubHelp logo

bartzi / labshare Goto Github PK

View Code? Open in Web Editor NEW
11.0 2.0 4.0 957 KB

Django Tool that helps everyone to get their fair share of GPU time

License: GNU General Public License v2.0

Python 82.34% HTML 11.96% JavaScript 5.70%
django deep-learning organizer

labshare's Introduction

LabShare Build Status Coverage Status

Django Tool that helps everyone to get their fair share of GPU time.

Installation

  1. clone repository
  2. make sure that OpenLDAP and SASL are installed (under Ubuntu they can be installed using this command: apt-get install libldap2-dev libsasl2-dev)
  3. install requirements with pip install -r requirements.txt (make sure to use python 3 (>=3.6)!)
  4. start a redis server instance (you can use a docker container and start it with the following command: docker run -p 6379:6379 -d redis)
  5. create database by running python manage.py migrate
  6. Install Yarn
  7. go into folder static and run yarn or yarn install to install all front end libraries.
  8. run the test server with python manage.py runserver (in the root directory)

Usage

  1. create superuser by running python manage.py createsuperuser
  2. If you want to have more users, you can create them using the Admin WebInterface (/admin).
  3. create a new Device in the django admin for every device you want to monitor
  4. deploy the device_query script on every machine that has a GPU that shall be monitored
  5. copy the example.ini file and rename it to config.ini
    • change device_name to the name of the device that was created in the admin interface
    • change the server_url to the address where the Django server is running
    • on the Django machine, execute the commands python manage.py tokens or python manage.py token [device_name] to get the authentication token of the registered device and paste it in the config file
  6. run the device_query script

Configuration

In order to make it possible for users to see the devices and their gpus you need to give each user the permission to do so! You can do this in one of the following ways:

  1. Add the use_device permission to a group of your choice (for instance the default Staff group) and add users to this group. this global permission allows each user in that group to use all GPUs in LabShare. This allows you to easily provide the necessary permission to each user.
  2. For fine-grained control you can control who can use which device, by adding the use_device permission to each user or a group in the permission admin of each device.

labshare's People

Contributors

bartzi avatar hendraet avatar hrantzsch avatar jopyth avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

labshare's Issues

API

It would be nice to have an API for (a subset of):

  • Reserved GPUs on a machine
  • Reservation of a GPU
  • Finishing/Cancelling a reservation

Add indicator for low GPU utilization

Sometimes, a GPU utilization close to zero means that the code is not working correctly. It would be nice to add an indicator to the UI that is shown when the utilization falls below a certain threshold.

CPU Usage

Hey @Bartzi

Good job man! Thank you for sharing on github!
I am wondering how to show cpu usage memory in grid as well as gpu!

Thanks in advance!
M.R

Read RAM Usage of Servers

device_query should also read the RAM usage of the servers and report the process with highes RAM usage. We'll need to following things:

  • get RAM usage
  • find process that uses most RAM
  • issue a warning if RAM is nearly full (around 95%)
  • post a message via Slack

Do not persist GPU Processes anymore

Since, we have continuous updates, there is actually no need to persist GPU state information anymore, since we do not query it at all.
It would be good to get rid of all this unnecessary saving and directly push the updated GPU info to the clients.

Add a summary of the resources allocated by slurm

Similar to the sinfo command for GresUsed we could try to integrate the console output:

GRES_USED                                    NODELIST            
gpu:1080ti:2(IDX:0-1)                        fb10dl03            
gpu:1080ti:3(IDX:0-2)                        fb10dl06            
gpu:1080ti:4(IDX:0-3)                        fb10dl07            
gpu:2080ti:1(IDX:0)                          fb10dl08            
gpu:3090:3(IDX:0-2)                          fb10dl09            
gpu:1080ti:1(IDX:0),gpu:980gtx:1(IDX:1)      resterampe          
gpu:titanx:0(IDX:N/A)                        fb10dl[04-05]      

Add Nvidia-Smi Viewer

In light of recent cluster restructurings, we will not need the main functionality of LabShare anymore.
Instead, we might need a functionality that allows us to do system monitoring.
One of the most important parts is to monitor the output of nvidia-smi.

How it should work

  • The user opens LabShare and sees a list of all compute nodes. The compute nodes show name and number of GPUs. The list is shaded based on usage status. If not in use, it is plain white, when in use it should be something like yellow (maybe same color as currently in LabShare).
  • The user may click on a machine and can see further information about the compute nodes, right now all GPUs shall be listed and their current utilization. The utilization is automatically updated via websockets. Eachh compute node pushes newest nvidia-smi data to the server.

Technical Info

Our device_query script needs to be rewritten to not be a server anymore, but rather a tool that pushes to the server automatically. Furthermore, Labshare needs a new POST endpoint that takes nvidia-smi data (this will need some kind of authentication, maybe an API-Key?). We then need a push logic to push new gpu data via Websockets, (similar to this code).

We'll also need a user interface for this.

fix mail handling for LDAP authentication

right now we only check if the number of mail addresses changed when authenticating a user via mail, but we rather should check whether any mail address changed at all!

Change Reservation behavior

If a user reserves a spot on a GPU he gets this spot forever. This leads to problems as oftentimes the users do not use the GPU and other people have to wait and loose precious time for their experiments.

We should change the reservation behavior in such a way that a reservation is based on a predefined time slot that can also be adjusted at the time of reservation by the user. Once the time slot is over the GPU will be freed and the next user is invited to perform his/her experiments. This should not include the forceful shutdown of any trainings, but rather make sure that people are not blocking a GPU.

Furthermore, we should compute statistics for the reservation period and show them to the admins. With this we might encourage users to really use their GPU time instead of just idling around.

Improve waiting queue

It would be great that if I could see who is queued before me. Currently I can only see the current active user and the next one but not who comes after them. This feature only makes sense if there are 3+ people queued up.
Perhaps there could be a little tooltip when I hover of the queue/next row?

image

timing problem

fix timing problem in test_template_tag_position_in_queue by adding time.sleep(0.1) or s.th. like that

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.