The head node ibanks.hep.caltech.edu
is running the nfs home and data server.
It has a public (regular network) and a private (10G) IP.
Worker nodes are a follows, in chronological order of creation
titans.hep.caltech.edu
is an MSI desktop with 2TB of local disk, and runs 2 NVidia GeForce GTX Titan Xpassed-pawn-klmx.hep.caltech.edu
is a cocolink server with 200G local disk, and runs 8 NVidia Titan X (Pascal)culture-plate-sm.hep.caltech.edu
is a Supermicro server with 2T of local SSD, and runs 8 NVidia GeForce GTX 1080imperium-sm.hep.caltech.edu
is a Supermicro server with 2T of local SSD, and runs 8 NVidia GeForce GTX 1080flere-imsaho-sm.hep.caltech.edu
is a Supermicro server with 2T of local SSD, and runs 6 NVidia Titan Xp (Pasca)
All server have a public (regular network) and a private (10G) IP. SSH key is the prefered authentication. Please let the admins know if you need help setting this up.
The linux password is not centralized yet, and you have to change it locally to secure access to jupyter notebooks.
If you are a user of the cluster, we are happy to help make progress. When producing publication or public presentation, please be so kind as to warn Prof. Spiropulu and Dr. Vlimant, just for accounting purpose. Please include the following latex acknowledgement for support
Part of this work was conducted at "\textit{iBanks}", the AI GPU cluster at Caltech. We acknowledge NVIDIA, SuperMicro and the Kavli Foundation for their support of "\textit{iBanks}".
The home directory should be used for software and although there is room, please prevent from putting too much data within your home directory.
The /bigdata/
volume is mounted on all nodes. It is a 20TB raid array mounted over nfs. Please use the /bigdata/shared/
directory and contact the admins if in the need for private directory.
The /data/
volume is mounted on some nodes, not all on SSD. This is the prefered temporary location for data needed for intensive I/O.
The
It is important to note that I/O on the nfs mounted volume is not as efficient as with local disk, so please use care and monitor performance of your applications.
For ipython, the following directory better be local
mkdir /tmp/$USER/ipython -p ln -s /tmp/$USER/ipython .ipython
For cuda, the same applies to
mkdir -p /tmp/$USER/cuda/ export CUDA_CACHE_PATH=/tmp/$USER/cuda/
It is recommended to have export CUDA_CACHE_PATH
in your login file.
To use only a selected GPU, run nvidia-smi
or gpustat
to see GPU utilization, then set export CUDA_VISIBLE_DEVICES=n
to a the index of the GPU you want to use.
In python one can either set the environment variable or use import setGPU
(gets one device automatically).
The compilation directory might be better off being a local directory, like in /tmp
mkdir -p /tmp/$USER/theano_compile rm -r ~/.theano ln -s /tmp/$USER/theano_compile ~/.theano
The theano rc file should look like
[nvcc] [global] device=cuda
One can add base_compiledir=/tmp/$USER/theano_compile/
directly in the global section if need be.
Tensorflow is greedy in using GPUs and it is mandatory to use export CUDA_VISIBLE_DEVICES=n
(where n is the index of a device, or coma separated index) to use only a selected device, if not explicitly controlled within the application.
We are building some docker images that are posted on dockerhub: https://hub.docker.com/u/caltechcms/
The users can start a jupyter notebook server on each machine using the command
source /bigdata/shared/Software/jupyter/restart.sh
This will provide back a url to which you can connect.
The password is the linux password.
If you are connecting using ssh key, as recommended, please contact an admin to get a password.
The port that is assigned to you is defined in /bigdata/shared/Software/jupyter/ports
if you are not in there, please contact an admin.
mpi is installead on the cluster and will allow to run jobs across the cluster. For this, the main requirement is to have passwd-less access on all machines. Setup your ssh key for authentication to the host
mpi-ibanks mpi-culture-plate-sm mpi-imperium-sm mpi-passed-pawn-klmx mpi-titans mpi-flere-imsaho-sm
which run over the private IP.
Create the .openmpi directory and link the file /bigdata/shared/Software/mpi/mca-params.conf
in it.
Run an mpi test with the command
mpirun --hostfile /bigdata/shared/Software/mpi/hostfile -n 18 /bigdata/shared/Software/mpi/mpi4py-examples/03-scatter-gather mpirun --hostfile /bigdata/shared/Software/mpi/hostfile -n 18 /bigdata/shared/Software/mpi/mpi4py-examples/08-matrix-matrix-product mpirun --hostfile /bigdata/shared/Software/mpi/hostfile -n 10 /bigdata/shared/Software/mpi/keras_mnist.py
which should both run. Contact the admin if it does not (the debugging options are --mca odls_base_verbose 100 --mca btl_base_verbose 100
).
The default number of slots per node is the number of GPU (as this is the primary usage). One can override this limitation by running
mpirun --map-by node --hostfile /bigdata/shared/Software/mpi/hostfile -n 100 /bigdata/shared/Software/mpi/mpi4py-examples/08-matrix-matrix-product