bartzi / labshare Goto Github PK

View Code? Open in Web Editor NEW

11.0 2.0 4.0 957 KB

Django Tool that helps everyone to get their fair share of GPU time

License: GNU General Public License v2.0

Python 82.34% HTML 11.96% JavaScript 5.70%

django deep-learning organizer

labshare's Introduction

LabShare

Django Tool that helps everyone to get their fair share of GPU time.

Installation

clone repository
make sure that OpenLDAP and SASL are installed (under Ubuntu they can be installed using this command: apt-get install libldap2-dev libsasl2-dev)
install requirements with pip install -r requirements.txt (make sure to use python 3 (>=3.6)!)
start a redis server instance (you can use a docker container and start it with the following command: docker run -p 6379:6379 -d redis)
create database by running python manage.py migrate
Install Yarn
go into folder static and run yarn or yarn install to install all front end libraries.
run the test server with python manage.py runserver (in the root directory)

Usage

create superuser by running python manage.py createsuperuser
If you want to have more users, you can create them using the Admin WebInterface (/admin).
create a new Device in the django admin for every device you want to monitor
deploy the device_query script on every machine that has a GPU that shall be monitored
copy the example.ini file and rename it to config.ini
- change device_name to the name of the device that was created in the admin interface
- change the server_url to the address where the Django server is running
- on the Django machine, execute the commands python manage.py tokens or python manage.py token [device_name] to get the authentication token of the registered device and paste it in the config file
run the device_query script

Configuration

In order to make it possible for users to see the devices and their gpus you need to give each user the permission to do so! You can do this in one of the following ways:

Add the use_device permission to a group of your choice (for instance the default Staff group) and add users to this group. this global permission allows each user in that group to use all GPUs in LabShare. This allows you to easily provide the necessary permission to each user.
For fine-grained control you can control who can use which device, by adding the use_device permission to each user or a group in the permission admin of each device.

labshare's People

Contributors

Stargazers

Watchers

Forkers

dikorsch manisoftwartist mxrcx hendraet

labshare's Issues

Get notified when there's a problem with GPUs

Hey,

it would be great to be notified by email when the exclamation mark appears on the server.

Best,
Goncalo

change usage indicator

Usage of a GPU should be determined based on the telemetry gained from the clients

add contribute on github banner

API

It would be nice to have an API for (a subset of):

Reserved GPUs on a machine
Reservation of a GPU
Finishing/Cancelling a reservation

dot missing in last update date

the date in the last update field (sometimes?) misses a dot in the end. e.g. "2 p.m. 07.1"

Show hint that user might not be logged in when there is no data available

We should add a hint that a user might need to login, if there is no gpu data available.

store time lengths as constants in settings

Variable time lengths should probably be constants stored in the settings.py.

indicate in overview that a device did not update in the last 30 minutes

add Selenium tests for frontend

add tests for LDAP authentication

We need to have tests for LDAP authentication. In order to achieve this heavy usage of the mock library is required.

add functionality for admins to kick peoples ass

We should have a functionality where an Admins can send a notification to users, if they are sitting on a gpu and doing nothing.

Add indicator for low GPU utilization

Sometimes, a GPU utilization close to zero means that the code is not working correctly. It would be nice to add an indicator to the UI that is shown when the utilization falls below a certain threshold.

How to create new account on GPU?

add coveralls support

automatically add groups to new LDAP user

A new LDAP user should be assigned to the correct group, based on his group membership in the LDAP directory.

New Usage indicator also indicates usage of gpu even if no one is using it

CPU Usage

Hey @Bartzi

Good job man! Thank you for sharing on github!
I am wondering how to show cpu usage memory in grid as well as gpu!

Thanks in advance!
M.R

add information to readme

Add information on how to set up bower components to readme

Read RAM Usage of Servers

device_query should also read the RAM usage of the servers and report the process with highes RAM usage. We'll need to following things:

get RAM usage
find process that uses most RAM
issue a warning if RAM is nearly full (around 95%)
post a message via Slack

Do not persist GPU Processes anymore

Since, we have continuous updates, there is actually no need to persist GPU state information anymore, since we do not query it at all.
It would be good to get rid of all this unnecessary saving and directly push the updated GPU info to the clients.

change not in queue to something like - or /

Add manage.py endpoint that allows creation of Device

Right now, it is quite tedious to add a new Device.
You need to go into the admin, create a new Device and a new user.

We could use a script that does both, by just saying manage.py create_device <devicename>

Parsing error in specific sinfo outputs

Such sinfo outputs do not work, yet: fb10dl09 gpu:3090:3(IDX:0-1,3)

Add a summary of the resources allocated by slurm

Similar to the sinfo command for GresUsed we could try to integrate the console output:

GRES_USED                                    NODELIST            
gpu:1080ti:2(IDX:0-1)                        fb10dl03            
gpu:1080ti:3(IDX:0-2)                        fb10dl06            
gpu:1080ti:4(IDX:0-3)                        fb10dl07            
gpu:2080ti:1(IDX:0)                          fb10dl08            
gpu:3090:3(IDX:0-2)                          fb10dl09            
gpu:1080ti:1(IDX:0),gpu:980gtx:1(IDX:1)      resterampe          
gpu:titanx:0(IDX:N/A)                        fb10dl[04-05]

Bug in sinfo Parsing for Machines with different GPU types

On machines with multiple different GPU types the sinfo output seems to be parsed incorrectly:

Both GPUs should be allocated not just one!

have more than one email address per user

This might be accomplished by allowing more than one E-Mail per user object, or by using Groups

Add Nvidia-Smi Viewer

In light of recent cluster restructurings, we will not need the main functionality of LabShare anymore.
Instead, we might need a functionality that allows us to do system monitoring.
One of the most important parts is to monitor the output of nvidia-smi.

How it should work

The user opens LabShare and sees a list of all compute nodes. The compute nodes show name and number of GPUs. The list is shaded based on usage status. If not in use, it is plain white, when in use it should be something like yellow (maybe same color as currently in LabShare).
The user may click on a machine and can see further information about the compute nodes, right now all GPUs shall be listed and their current utilization. The utilization is automatically updated via websockets. Eachh compute node pushes newest nvidia-smi data to the server.

Technical Info

Our device_query script needs to be rewritten to not be a server anymore, but rather a tool that pushes to the server automatically. Furthermore, Labshare needs a new POST endpoint that takes nvidia-smi data (this will need some kind of authentication, maybe an API-Key?). We then need a push logic to push new gpu data via Websockets, (similar to this code).

We'll also need a user interface for this.

Could you please

Menu Link does not work

e-mail address is not required

create possibility to reserve the next available GPU on a machine

display fan speed and temperature in nvidia-smi viewer

We should also display fan speed and gpu temperature in our new nvidia-smi viewer

add information on how to use device query script

Rename permission "user/group is not allowed to use that device"

This should actually be user/group is allowed to use that device.

add possiblity to send message to all users

fix mail handling for LDAP authentication

right now we only check if the number of mail addresses changed when authenticating a user via mail, but we rather should check whether any mail address changed at all!

add ldap authentication

add ldap authentication, maybe using this lib: https://django-auth-ldap.readthedocs.io/en/latest/

reserve button should only be visible for logged in users

remove user in columns current user and next user

/gpu/update seems to be missing in example.ini

@hendraet It seems to me that the correct endpoint is not configured in the example.ini file for device_query. Is this intended? If not, we either need to add this /gpu/update to the file example.ini or add it to the README

Change Reservation behavior

If a user reserves a spot on a GPU he gets this spot forever. This leads to problems as oftentimes the users do not use the GPU and other people have to wait and loose precious time for their experiments.

We should change the reservation behavior in such a way that a reservation is based on a predefined time slot that can also be adjusted at the time of reservation by the user. Once the time slot is over the GPU will be freed and the next user is invited to perform his/her experiments. This should not include the forceful shutdown of any trainings, but rather make sure that people are not blocking a GPU.

Furthermore, we should compute statistics for the reservation period and show them to the admins. With this we might encourage users to really use their GPU time instead of just idling around.

Improve waiting queue

It would be great that if I could see who is queued before me. Currently I can only see the current active user and the next one but not who comes after them. This feature only makes sense if there are 3+ people queued up.
Perhaps there could be a little tooltip when I hover of the queue/next row?