GithubHelp home page GithubHelp logo

gpuservers's Introduction

iBanks NVIDIA powered GPU cluster

Nodes and Access

The head nodes login-[1,2,3].hep.caltech.edu are only used for submitting job in batch.

Worker nodes are a follows, in chronological order of creation

  • culture-plate-sm.hep.caltech.edu is a Supermicro server with 2T of local SSD, and runs 8 NVidia GeForce GTX 1080
  • imperium-sm.hep.caltech.edu is a Supermicro server with 2T of local SSD, and runs 8 NVidia GeForce GTX 1080
  • flere-imsaho-sm.hep.caltech.edu is a Supermicro server with 2T of local SSD, and runs 6 NVidia Titan Xp
  • mawhrin-skel-sm.hep.caltech.edu is a Supermicro server with 2T of local NVME running 2 NVidia GeForce GTX Titan X

All server have a public (regular network) and a private (10G) IP. SSH key is the only authentication, with double factor authentication. Please let the admins (t2admin AT hep.caltech.edu) in case of issues.

Credits

If you are a user of the cluster, we are happy to help make progress. When producing publication or public presentation, please be so kind as to warn Prof. Spiropulu and Dr. Vlimant, just for accounting purpose. Please include the following latex acknowledgement for support

Part of this work was conducted at  "\textit{iBanks}", the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro  and the Kavli Foundation for their support of "\textit{iBanks}".

Credentials

Follow the instructions for Tier 2 access to get access to ibanks nodes.

Data Storage

The home directory /storage/af/user/<user name> should be used for software and although there is room, please prevent from putting too much data within your home directory.

The /data/ volume is mounted on some nodes, not all on SSD. This is the prefered temporary location for data needed for intensive I/O. There is a 60-days-since-last-access retention policy on this directory

The /imdata/ volume is a ramdisk of 40G with very high throughput, but utilizing the RAM of the machine. Please use this in case of need of very high i/o, but clean the space tightly, as this will use the node memory. There is a 2-days-since-last-access retention policy on this directory.

The /mnt/hadoop/ path is the readonly access to the full caltech Tier2 storage.

The /storage/af/group/gpu/shared path is a 120TB CEPH volume that can be used similarly to bigdata. The old content of /bigdata/shared is now available under /storage/af/group/gpu/bigdata.

CERNbox

You can synchronize your cernbox with local directory /storage/af/user/$USER/cernbox by launching

/storage/af/group/gpu/software/gpuservers/scripts/sync-cernbox.sh

at least once interactively to setup the password. Then it can be run in the background, on one node only, using

screen -S cernbox -d -m /storage/af/group/gpu/software/gpuservers/scripts/sync-cernbox.sh

Setup

It is important to note that I/O on the nfs mounted volume is not as efficient as with local disk, so please use care and monitor performance of your applications.

For ipython, the following directory has to be local

mkdir /tmp/$USER/ipython -p
ln -s /tmp/$USER/ipython .ipython

For cuda, the same applies to

mkdir -p /tmp/$USER/cuda/
export CUDA_CACHE_PATH=/tmp/$USER/cuda/      

It is recommended to have export CUDA_CACHE_PATH in your login file.

To use only a selected GPU, run nvidia-smi or gpustat to see GPU utilization, then set export CUDA_VISIBLE_DEVICES=n to a the index of the GPU you want to use. In python one can either set the environment variable or use import setGPU (gets one device automatically).

Software

CVMFS

cvmfs is mounted on the nodes and can be used accordingly.

Singularity

All the software is provided with singularity images located in /storage/af/group/gpu/software/singularity/ibanks/ Configuration of the images is located at https://github.com/cmscaltech/gpuservers/tree/master/singularity and in /storage/af/group/gpu/software/singularity/

image description
legacy.simg A fixed image with the software already installed on ibanks nodes
edge.simg An image with many of the useful libraries, with the latest versions

Let admins know of any missing library that can be put in the image. A build service will be setup later.

To start a shell in an cutting edge image

/storage/af/group/gpu/software/gpuservers/singularity/run.sh

or to start with a given image

/storage/af/group/gpu/software/gpuservers/singularity/run.sh /storage/af/group/gpu/software/singularity/ibanks/legacy.simg 

Building an image

To build an image, first make sure that there are not an existing image that is usable, or extendable for your purpose. There are example of image specifications under the singularity directory, to create your specification.singularity file To build the image myimage.simg from the spec

/storage/af/group/gpu/software/gpuservers/singularity/build myimage.simg specification.singularity

if you make changes to existing image, please provide suggestion via a pull request modifying the specification file.

If you are building on top of an existing image, you can use that image as base and the build time will be greatly reduced. See an example with building on top the edge image.

Sandbox

It might be practical to try live a singularity recipe instead of building it with test/fail. In this case, you need to create a sandbox and run it in writable mode and execute build commands that you can later copy back in the recipe

apptainer build --sandbox SANDBOXTOBECREATED/ existin.simg
apptainer shell --writable SANDBOXTOBECREATED/

an singularity image can be created back from the sandbox

apptainer build final.simg SANDBOXTOBECREATED/

Tensorflow

Tensorflow is greedy in using GPUs and it is mandatory to use export CUDA_VISIBLE_DEVICES=n (where n is the index of a device, or coma separated index) to use only a selected device, if not explicitly controlled within the application. In python, please use import setGPU that selects automatically the next available GPU.

Jupyter Hub

Work in progress to set this up properly on the cluster.

Jupyter Notebook

The users can start a jupyter notebook server on each machine using either

/storage/af/group/gpu/software/gpuservers/jupyter/start_S.sh

to start a notebook with the latest singularity image. Or

/storage/af/group/gpu/software/gpuservers/jupyter/start_S.sh /storage/af/group/gpu/software/singularity/ibanks/legacy.simg

in a given image.

To start the notebook in screen directly:

screen -S jupyter -d -m /storage/af/group/gpu/software/gpuservers/jupyter/start_S.sh

This will provide back a url to which you can connect, including an authentication token, that changes each time you restart the jupyter server. You should keep this token private, but can also share momentarily to let other people edit your notebooks ; beware anyone with the token is "you".

To list the jupyter notebook already running on the machine, and the url to be used, one can run

/storage/af/group/gpu/software/singularity/run.sh "jupyter notebook list"

The port that is assigned to you is your user id, it should be opened automatically, let an admin know if it's not the case.

MPI

mpi is available within nodes and accross nodes (as long as you have public-key pass-less ssh between nodes). To run a program with mpi

mpirun --prefix /opt/openmpi-3.1.0 -np 3 nvidia-smi

To run a program using singularity with mpi

mpirun --prefix /opt/openmpi-3.1.0 -np 3 singularity exec -B /storage --nv /storage/af/group/gpu/software/singularity/ibanks/edge.simg python3 /storage/af/group/gpu/software/gpuservers/mpi/mpi4py-examples/03-scatter-gather

To run accross nodes, first copy /storage/af/group/gpu/software/gpuservers/mpi/mca-params.conf into the $HOME/.openmpi/ directory

mpirun  --prefix /opt/openmpi-3.1.0 --hostfile /storage/af/group/gpu/software/gpuservers/mpi/hostfile -np 10 singularity exec -B /storage --nv /storage/af/group/gpu/software/singularity/ibanks/edge.simg python3 /storage/af/group/gpu/software/gpuservers/mpi/mpi4py-examples/03-scatter-gather

gpuservers's People

Contributors

dkcira avatar jpata avatar juztas avatar p234a137 avatar thongonary avatar vlimant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gpuservers's Issues

Update Jupyter to secure release

Due to several vulnerabilities - jupyter inside the edge image must be updated to min:
Jupyter Notebook 6.4.1 or above, 5.7.11 or above.
Jupyter Lab 3.1.4 or above, 3.0.17 or above, 2.3.2 or above, 2.2.10 or above , 1.2.21 or above

Can we have a test machine with sudo access to build singularity?

I need to build a singularity image, but the only Linux machine I have root/sudo access is pccitX at CERN, which uses an old SLC version (SLC 6.5) and couldn't build the image due to some OS incompatibility sylabs/singularity#473

Can we have a test machine at the Caltech cluster that we have sudo access and be able to build the image? Otherwise the users have to rely on the admins to update the singularity image, which is very suboptimal.

CMSSW doesn't work with edge singularity image

I was trying to run CMSSW with the edge image, but it gives the following error:

-bash-4.2$ singularity exec --home /storage/user/jbalcas/:/srv --bind /mnt/hadoop --bind /storage --bind /cvmfs --pwd /srv --contain --ipc --pid /storage/group/gpu/software/singularity/ibanks/edge.simg bash
Singularity>
Singularity> source /cvmfs/cms.cern.ch/cmsset_default.sh
Singularity> cd /storage/cc^C
Singularity> cd tmp/
Singularity> ls
Singularity> cmsrel CMSSW_10_2_8
Can't locate Data/Dumper.pm in @INC (@INC contains: /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11 /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/Cache/CacheUtilities.pm line 3.
BEGIN failed--compilation aborted at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/Cache/CacheUtilities.pm line 3.
Compilation failed in require at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/SCRAM/CMD.pm line 500.
BEGIN failed--compilation aborted at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/SCRAM/CMD.pm line 500.
Compilation failed in require at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/SCRAM/SCRAM.pm line 35.
BEGIN failed--compilation aborted at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/SCRAM/SCRAM.pm line 35.
Compilation failed in require at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/bin/scram line 19.
BEGIN failed--compilation aborted at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/bin/scram line 19.

Ports to be open based on UID

It would be nice that ports assigned to user depend on uid. In that case you would not need to do manually edit file an add port for each user.

e.g. I am unable to use jupyter notebook, because I am not on that list.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.