iBanks NVIDIA powered GPU cluster

Nodes and Access

The head nodes login-[1,2,3].hep.caltech.edu are only used for submitting job in batch.

Worker nodes are a follows, in chronological order of creation

culture-plate-sm.hep.caltech.edu is a Supermicro server with 2T of local SSD, and runs 8 NVidia GeForce GTX 1080
imperium-sm.hep.caltech.edu is a Supermicro server with 2T of local SSD, and runs 8 NVidia GeForce GTX 1080
flere-imsaho-sm.hep.caltech.edu is a Supermicro server with 2T of local SSD, and runs 6 NVidia Titan Xp
mawhrin-skel-sm.hep.caltech.edu is a Supermicro server with 2T of local NVME running 2 NVidia GeForce GTX Titan X

All server have a public (regular network) and a private (10G) IP. SSH key is the only authentication, with double factor authentication. Please let the admins (t2admin AT hep.caltech.edu) in case of issues.

Credits

If you are a user of the cluster, we are happy to help make progress. When producing publication or public presentation, please be so kind as to warn Prof. Spiropulu and Dr. Vlimant, just for accounting purpose. Please include the following latex acknowledgement for support

Part of this work was conducted at  "\textit{iBanks}", the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro  and the Kavli Foundation for their support of "\textit{iBanks}".

Credentials

Follow the instructions for Tier 2 access to get access to ibanks nodes.

Data Storage

The home directory /storage/af/user/<user name> should be used for software and although there is room, please prevent from putting too much data within your home directory.

The /data/ volume is mounted on some nodes, not all on SSD. This is the prefered temporary location for data needed for intensive I/O. There is a 60-days-since-last-access retention policy on this directory

The /imdata/ volume is a ramdisk of 40G with very high throughput, but utilizing the RAM of the machine. Please use this in case of need of very high i/o, but clean the space tightly, as this will use the node memory. There is a 2-days-since-last-access retention policy on this directory.

The /mnt/hadoop/ path is the readonly access to the full caltech Tier2 storage.

The /storage/af/group/gpu/shared path is a 120TB CEPH volume that can be used similarly to bigdata. The old content of /bigdata/shared is now available under /storage/af/group/gpu/bigdata.

CERNbox

You can synchronize your cernbox with local directory /storage/af/user/$USER/cernbox by launching

/storage/af/group/gpu/software/gpuservers/scripts/sync-cernbox.sh

at least once interactively to setup the password. Then it can be run in the background, on one node only, using

screen -S cernbox -d -m /storage/af/group/gpu/software/gpuservers/scripts/sync-cernbox.sh

Setup

It is important to note that I/O on the nfs mounted volume is not as efficient as with local disk, so please use care and monitor performance of your applications.

For ipython, the following directory has to be local

mkdir /tmp/$USER/ipython -p
ln -s /tmp/$USER/ipython .ipython

For cuda, the same applies to

mkdir -p /tmp/$USER/cuda/
export CUDA_CACHE_PATH=/tmp/$USER/cuda/

It is recommended to have export CUDA_CACHE_PATH in your login file.

To use only a selected GPU, run nvidia-smi or gpustat to see GPU utilization, then set export CUDA_VISIBLE_DEVICES=n to a the index of the GPU you want to use. In python one can either set the environment variable or use import setGPU (gets one device automatically).

Software

CVMFS

cvmfs is mounted on the nodes and can be used accordingly.

Singularity

All the software is provided with singularity images located in /storage/af/group/gpu/software/singularity/ibanks/ Configuration of the images is located at https://github.com/cmscaltech/gpuservers/tree/master/singularity and in /storage/af/group/gpu/software/singularity/

image	description
legacy.simg	A fixed image with the software already installed on ibanks nodes
edge.simg	An image with many of the useful libraries, with the latest versions

Let admins know of any missing library that can be put in the image. A build service will be setup later.

To start a shell in an cutting edge image

/storage/af/group/gpu/software/gpuservers/singularity/run.sh

or to start with a given image

/storage/af/group/gpu/software/gpuservers/singularity/run.sh /storage/af/group/gpu/software/singularity/ibanks/legacy.simg

Building an image

To build an image, first make sure that there are not an existing image that is usable, or extendable for your purpose. There are example of image specifications under the singularity directory, to create your specification.singularity file To build the image myimage.simg from the spec

/storage/af/group/gpu/software/gpuservers/singularity/build myimage.simg specification.singularity

if you make changes to existing image, please provide suggestion via a pull request modifying the specification file.

If you are building on top of an existing image, you can use that image as base and the build time will be greatly reduced. See an example with building on top the edge image.

Sandbox

It might be practical to try live a singularity recipe instead of building it with test/fail. In this case, you need to create a sandbox and run it in writable mode and execute build commands that you can later copy back in the recipe

apptainer build --sandbox SANDBOXTOBECREATED/ existin.simg
apptainer shell --writable SANDBOXTOBECREATED/

an singularity image can be created back from the sandbox

apptainer build final.simg SANDBOXTOBECREATED/

Tensorflow

Tensorflow is greedy in using GPUs and it is mandatory to use export CUDA_VISIBLE_DEVICES=n (where n is the index of a device, or coma separated index) to use only a selected device, if not explicitly controlled within the application. In python, please use import setGPU that selects automatically the next available GPU.

Jupyter Hub

Work in progress to set this up properly on the cluster.

Jupyter Notebook

The users can start a jupyter notebook server on each machine using either

/storage/af/group/gpu/software/gpuservers/jupyter/start_S.sh

to start a notebook with the latest singularity image. Or

/storage/af/group/gpu/software/gpuservers/jupyter/start_S.sh /storage/af/group/gpu/software/singularity/ibanks/legacy.simg

in a given image.

To start the notebook in screen directly:

screen -S jupyter -d -m /storage/af/group/gpu/software/gpuservers/jupyter/start_S.sh

This will provide back a url to which you can connect, including an authentication token, that changes each time you restart the jupyter server. You should keep this token private, but can also share momentarily to let other people edit your notebooks ; beware anyone with the token is "you".

To list the jupyter notebook already running on the machine, and the url to be used, one can run

/storage/af/group/gpu/software/singularity/run.sh "jupyter notebook list"

The port that is assigned to you is your user id, it should be opened automatically, let an admin know if it's not the case.

MPI

mpi is available within nodes and accross nodes (as long as you have public-key pass-less ssh between nodes). To run a program with mpi

mpirun --prefix /opt/openmpi-3.1.0 -np 3 nvidia-smi

To run a program using singularity with mpi

mpirun --prefix /opt/openmpi-3.1.0 -np 3 singularity exec -B /storage --nv /storage/af/group/gpu/software/singularity/ibanks/edge.simg python3 /storage/af/group/gpu/software/gpuservers/mpi/mpi4py-examples/03-scatter-gather

To run accross nodes, first copy /storage/af/group/gpu/software/gpuservers/mpi/mca-params.conf into the $HOME/.openmpi/ directory

mpirun  --prefix /opt/openmpi-3.1.0 --hostfile /storage/af/group/gpu/software/gpuservers/mpi/hostfile -np 10 singularity exec -B /storage --nv /storage/af/group/gpu/software/singularity/ibanks/edge.simg python3 /storage/af/group/gpu/software/gpuservers/mpi/mpi4py-examples/03-scatter-gather

CMSSW doesn't work with edge singularity image

I was trying to run CMSSW with the edge image, but it gives the following error:

-bash-4.2$ singularity exec --home /storage/user/jbalcas/:/srv --bind /mnt/hadoop --bind /storage --bind /cvmfs --pwd /srv --contain --ipc --pid /storage/group/gpu/software/singularity/ibanks/edge.simg bash
Singularity>
Singularity> source /cvmfs/cms.cern.ch/cmsset_default.sh
Singularity> cd /storage/cc^C
Singularity> cd tmp/
Singularity> ls
Singularity> cmsrel CMSSW_10_2_8
Can't locate Data/Dumper.pm in @INC (@INC contains: /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11 /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/Cache/CacheUtilities.pm line 3.
BEGIN failed--compilation aborted at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/Cache/CacheUtilities.pm line 3.
Compilation failed in require at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/SCRAM/CMD.pm line 500.
BEGIN failed--compilation aborted at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/SCRAM/CMD.pm line 500.
Compilation failed in require at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/SCRAM/SCRAM.pm line 35.
BEGIN failed--compilation aborted at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/src/SCRAM/SCRAM.pm line 35.
Compilation failed in require at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/bin/scram line 19.
BEGIN failed--compilation aborted at /cvmfs/cms.cern.ch/share/lcg/SCRAMV1/V2_2_9_pre11/bin/scram line 19.

cmscaltech / gpuservers Goto Github PK

gpuservers's Introduction

iBanks NVIDIA powered GPU cluster

Nodes and Access

Credits

Credentials

Data Storage

CERNbox

Setup

Software

CVMFS

Singularity

Building an image

Sandbox

Tensorflow

Jupyter Hub

Jupyter Notebook

MPI

gpuservers's People

Contributors

Stargazers

Watchers

Forkers

gpuservers's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs