GithubHelp home page GithubHelp logo

pixiesunky / cpu-free-model Goto Github PK

View Code? Open in Web Editor NEW

This project forked from parcorelab/cpu-free-model

0.0 0.0 0.0 18.3 MB

https://dl.acm.org/doi/10.1145/3577193.3593713

Shell 3.67% C++ 5.40% Python 3.58% C 0.98% Cuda 85.96% Makefile 0.40%

cpu-free-model's Introduction

Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in Charge


This paper proposes a fully autonomous execution model for multi-GPU applications that completely excludes the involvement of the CPU beyond the initial kernel launch. In a typical multi-GPU application, the host serves as the orchestrator of the execution by directly launching kernels, issuing communication calls, and acting as a synchronizer for devices. We argue that this orchestration, or control flow path, causes undue overhead, and can be delegated completely to devices to improve performance in applications that require communication among peers. For the proposed CPU-free execution model, we leverage existing techniques such as persistent kernels, thread block specialization, device-side barriers and device-initiated communication routines to write fully autonomous multi-GPU code and achieve significantly reduced communication overheads. We demonstrate our proposed model on two broadly used iterative solvers, 2D/3D Jacobian stencil and Conjugate Gradient(CG). Compared to the CPU-controlled baselines, CPU-free model can improve 3D stencil communication overhead by 58.8% and provide 1.63x speedup for CG on 8 Nvidia A100 GPUs.

Communication Overhead Overlap Percentage
drawing drawing

Dependencies

  • C++ Compiler with C++-17 support
  • CUDA 11.6 or above
  • NVSHMEM 2.7.0 or above
    • GDRCopy 2.0 or above*
    • InfiniBand GPUDirect Async support*
  • OpenMPI 4.1.4 or above

*multi-node support

Setup

  1. Sample spack installation
$ spack install nvshmem +gpu_initiated_suppot +cuda +gdrcopy +ucx +mpi ^openmpi +cuda fabrics=ucx ^ucx +cuda +gdrcopy +dm +thread_multiple
  1. Sample environment variables for Makefile
# spack load nvshmem

export NVSHMEM_HOME=$SPACK_HOME/.../nvshmem
export MPI_HOME=$SPACK_HOME/.../openmpi
export CUDA_HOME=$SPACK_ROOT/.../cuda
export UCX_HOME=$SPACK_ROOT/.../ucx
export LD_LIBRARY_PATH="$NVSHMEM_HOME/lib:$UCX_HOME/lib:$CUDA_HOME/lib:$MPI_HOME/lib:$LD_LIBRARY_PATH"

Compilation

We provide Direct P2P and NVSHMEM versions of Conjugate Gradient and 2D/3D Jacobian Stencil. Below are the provided targets:

  • make all: Compile all available versions
    • make cg: Conjugate Gradient NVSHMEM implementations
    • make stencil: Compiles all Stencil implementations
      • make jacobi2D: Jacobi 2D Stencil P2P version
      • make jacobi3D: Jacobi 3D Stencil P2P version
      • make jacobi2D_nvshmem: Jacobi 2D Stencil NVSHMEM version*
      • make jacobi3D_nvshmem: Jacobi 3D Stencil NVSHMEM version*
  • make format: Autoformat with .clang-format

Binaries are placed in bin/

Versions marked with * have multi-node support.


Versions

Each binary produced is a composite of multiple versions selected at runtime such as different CPU-Free implementations and Baselines. For example:

$ ./bin/cg
Versions (select with -v):
0:	Baseline Discrete Standard
1:	Baseline Discrete Pipelined
2:	Single Stream Standard
3:	Single Stream Pipelined
4:	Single Stream Pipelined Multi Overlap
5:	Single Stream Standard Saxpy Overlap
6:	Profiling Discrete Standard
7:	Profiling Discrete Pipelined
8:	Single GPU Discrete Standard

Running Baseline Discrete Standard

Execution time:  24.8550 s

Select the desired version with the -v <version num> flag. CG versions additionally support multiple versions delimited by ,, for example, ./bin/cg -v 0,1,2.

Stencil versions:

  • Baseline Copy/Overlap/P2P/NVSHMEM: Baseline versions taken from NVIDIA multi-gpu-programming-models.
  • Design 1: CPU-Free P2P version with two streams.
  • Design 2: CPU-Free P2P version with one stream.
  • Design X NVSHMEM: (Available in NVSHMEM binaries) CPU-Free version with NVSHMEM communication.
  • Design X Partitioned: CPU-Free version with better thread block partitioning.
  • PERKS: Optimized caching implementations adapted from PERKS.

The best performing versions reported in the paper are Design 1 NVSHMEM and its Partitioned counterpart, and PERKS.

CG Versions:

  • Discrete: ~
  • Pipelined: ~
  • Profiling: ~

Flags

  • -s: Suppress version information output. Prints execution time only.
  • -v <version num>: Version selection. Default: 0.
  • -v <version num>[,version num...] (CG only): Version selection
  • -nx <dim>: X dimension size. Default: 16384 for 2D, 512 for 3D.
  • -ny <dim>: Y dimension size.
  • -nz <dim>: Z dimension size (if 3D).
  • -niter <num>: Number of iterations. Default: 1000.
  • -compare: Accuracy comparison with 1 GPU.

Example:

$ ./bin/jacobi2D -s -v 3 -nx 1024 -ny 1024 -niter 10000

NVSHMEM Launch

Launching CG and Stencil NVSHMEM versions requires mpirun. Example:

$ mpirun -np 8 ./bin/jacobi3D_nvshmem -s -v 1 -nx 512 -ny 512 -nz 512 -niter 100

Multi-Node Launch (NVHSMEM only)

No special parameters need to be set for multi-node launch.


Results

2d5p Jacobi Stencil

3d7p Jacobi Stencil

Conjugate Gradient


Citing

@inproceedings{CPU-Free-Model,
  author = {Ismayilov, Ismayil and Baydamirli, Javid and Sa\u{g}bili, Do\u{g}an and Wahib, Mohamed and Unat, Didem},
  title = {Multi-GPU Communication Schemes for Iterative Solvers: When CPUs Are Not in Charge},
  year = {2023},
  url = {https://doi.org/10.1145/3577193.3593713},
  doi = {10.1145/3577193.3593713},
  booktitle = {Proceedings of the 37th International Conference on Supercomputing},
  series = {ICS '23}
}

Acknowledgement

CPU-Free Stencil versions adapted from NVIDIA multi-gpu-programming-models repo.

PERKS code adapted from neozhang307/PERKS.

cpu-free-model's People

Contributors

dogansagbili avatar kylosus avatar readleyj avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.