GithubHelp home page GithubHelp logo

csil_gpu's Introduction

README - GPU Job Submission on CSIL

Overview

This document provides instructions for compiling and running GPU jobs from the Computer Science Instructional Laboratory (CSIL) machines using Slurm Workload Manager and accessing the generated data.

Requirements

CSIL account with appropriate permissions Code compatible with the available GPU architecture

Preparing the Environment

Before submitting jobs, load your Conda environment and verify the availability of CUDA with the following steps:

Load Conda Environment:

/conda/etc/profile.d/conda.sh
conda activate <your_environment>

Test CUDA with PyTorch:

python -c "import torch; print(torch.cuda.is_available())"

Check NVCC and Python Versions (optional):

nvcc --version
which python

Feel free to check if other dependencies are installed with expected version here.

Slurm Script Example

Below is an example .slurm script for a GPU job:

#!/bin/bash -il
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --partition=a100
#SBATCH --gpus-per-node=1
#SBATCH --mem=30G
#SBATCH --time=10
#SBATCH --export=ALL

# Activate the Conda environment
source /opt/conda/etc/profile.d/conda.sh
conda activate simenv

# Display the environment and test CUDA availability
which conda
conda env list
python -c "import torch; print(torch.cuda.is_available())" || echo "PyTorch cannot access CUDA."

# Display versions and paths
nvcc --version
which python
echo $CUDA_HOME
echo $LD_LIBRARY_PATH

# Install dependencies here
# Example: pip install torch==2.2.0 transformers==4.37.2

# Replace with your own command to execute the code
# Example: python train_model.py

# Output the NVIDIA-smi to a logfile
time nvidia-smi >& logfile

Job Submission

To submit your job to the GPU queue, use the following command (use sbatch at your current working directory):

sbatch <your_script>.slurm

Job Status Control

You can check on the status with squeue (to see only your jobs, 'squeue' will show every job on the system; it can run at any directory):

squeue -u <your_username>

You can look at details with:

scontrol show job JIOBID

To kill a job you use:

scancel -i JOBID

Accessing Data

Once the job is completed, the output (slurm-xxx.out), including the logfile, will be available in the directory from which the job was submitted. Ensure you have appropriate read/write permissions for the output directory.

For more detailed outputs or specific file paths, customize the script to direct the outputs to your desired location.

Replace placeholders like <your_environment> and <your_script> with your actual environment name and script filename. Adjust the paths and commands to fit your specific setup and requirements.

Refer to the How to use Slurm as well for more explanations.

csil_gpu's People

Contributors

xinleif666 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.