GithubHelp home page GithubHelp logo

joeyyoung / adapcc Goto Github PK

View Code? Open in Web Editor NEW
5.0 1.0 0.0 3.17 MB

An adaptive collective communication library for distributed training

License: Apache License 2.0

Makefile 0.76% Python 32.13% Cuda 35.79% C 2.64% C++ 28.01% Shell 0.66%

adapcc's Introduction

AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive

AdapCC is a communication library that dynamically adapts to resource heterogeneity and network variability for optimized training performance. The main features are offered as follows:

  • Detecting. adaptive to various resource allocations, by inferring physical configurations within each server.

  • Profiling. adaptive to dynamic network changes, by coordinating workers to enable profiling on the fly.

  • Relay Control. adaptive to computation stragglers, by allowing an arbitrary subset of workers to perform a collective. Non-active GPUs are controlled as relays for data transfers.

  • Fault Tolerance. continued communication without being blocked (hang) by the straggler/faulty.

Prerequisites

Software Dependencies

PyTorch

  • Python>=3.8
  • PyTorch==1.13.0
  • CUDA>=10.2
  • GCC9.4

Download and compile the following libraries if you have not installed them:

UCX==1.13.0

wget https://github.com/openucx/ucx/releases/download/v1.13.0-rc1/ucx-1.13.0.tar.gz
tar xzf ucx-1.13.0.tar.gz
cd ucx-1.13.0
mkdir build
./contrib/configure-release --prefix=[BUILD_PATH] --with-cuda=[CUDA_PATH]
make -j8
make install

OpenMPI==4.1.1

wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz
gunzip -c openmpi-4.1.1.tar.gz | tar xf -
cd openmpi-4.1.1
./configure --prefix=[BUILD_PATH] --with-cuda=[CUDA_PATH] --with-ucx=[UCX_PATH] --enable-mca-no-build=btl-uct
make all install

Add built paths to environment variables PATH, LD_LIBRARY_PATH and MANPATH, respectively.

System Hardwares

Our testbed environment includes:

  • Ubuntu 20.04 LTS
  • NVIDIA A100 SXM4 40G GPUs with NVLink
  • NVIDIA V100 SXM2 32G GPUs with NVLink
  • Mellanox NIC 100Gbps / 50Gbps
  • EPYC-7H12 CPU, PCIe 4.0 / Intel 6230 CPU, PCIe 3.0

Install

Download the repo:

git clone https://github.com/JoeyYoung/adapcc.git

cd adapcc and edit OMPI_LIB_DIR and OMPI_INC_DIR to your own paths in the Makefile.

Run make. The compilation outputs a dynamic library communicator.so.

Usage

Start with Launcher

Processes are managed by MPI. Use the script launch_script.sh to launch.

The arguments include:

  • num_process:   the number of workers (processes) to start.
  • ips:   host ips of workers, following the format of mpirun.
  • master:   the ip of world rank 0, i.e., the master worker.
  • mpi_path:   execution path for mpirun.
  • net_device:   NIC interfaces for network communication.
  • exec_file:   the execution file name. You are required to provide this main file.
  • socket_port:   ports used for inter-process communication.
  • entry_point:   customize the value to enable the detection or profiling modules (if needed).
  • logical_graph:   dump path of the logical graph. You can customize your server configuration as graph input, based on which profiling is performed. See examples in './topology'.
  • strategy_file:   dump path of the communication topology strategies. Represented in xml format. You can customize your own strategy as input, based on which the data transfer follows. See examples in './strategy'.
  • parallel_degree:   the number of parallel concurrent transmissions within one communication context.
  • profile_freq:   the frequency of profiling and graph construction, if enabled.

e.g., when train main.py with 4 computing nodes, each with 4 GPUs:

python launcher.py \
    --num-process 16 \
    --ips node1:4,node2:4,node3:4,node4:4 \
    --master node1 \
    --mpi-path ~/openmpi/bin/mpirun \
    --net-device mlx5_0 \
    --exec-file main.py \
    --socket_port 5000 \
    --entry_point 6 \
    --logical_graph ./topology/logical_graph_4n.xml \
    --strategy_file ./strategy/strategy_4n.xml \
    --parallel_degree 4 \
    --profile_freq 500  

We assume the dependencies are configured on each node.

Primitive Example

Here are steps on how to run a communication operator, refer to adapcc.py:

  1. Import library import adapcc.
  2. Initialize AdapCC.init(args, LOCAL_RANK, WORLD_RANK, WORLD_SIZE). It generates a communication strategy based on detection and profiling. A specific strategy could also be defined by users as illustrated in launcher.
  3. Transmission context setup AdapCC.setup(ALLREDUCE). Create communication resources and work queues for processing operators.
  4. Call primitives AdapCC.communicator.all_reduce(tensor, size, chunk_bytes)
  5. Completed and reclaim resources AdapCC.clear(ALLREDUCE).
import adapcc
AdapCC.init(args, LOCAL_RANK, WORLD_RANK, WORLD_SIZE)
AdapCC.setup(ALLREDUCE)
...
AdapCC.communicator.all_reduce(tensor, size)
...
AdapCC.clear(ALLREDUCE)

You will obtain an output similar to this.

Note

  1. The first op is always slow due to the cache reason.
  2. Adaptive Relay functionality is only supported in a training context.

Training Example

train_ddp.py provides a training template. Use this for your models.

Compared to the former example, additionally, we register the hook AdapCC.communicator.cuda_allreduce_hook and call AdapCC.reconstruct_topology with a defined frequency to reconstruct communication topology.

Relay control is enabled by default and the output should be similar to this.

Contact

Raise issues or email [email protected] for any questions.

License

© Contributors Licensed under an Apache-2.0 license.

adapcc's People

Contributors

joeyyoung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.