AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive

AdapCC is a communication library that dynamically adapts to resource heterogeneity and network variability for optimized training performance. The main features are offered as follows:

Detecting. adaptive to various resource allocations, by inferring physical configurations within each server.
Profiling. adaptive to dynamic network changes, by coordinating workers to enable profiling on the fly.
Relay Control. adaptive to computation stragglers, by allowing an arbitrary subset of workers to perform a collective. Non-active GPUs are controlled as relays for data transfers.
Fault Tolerance. continued communication without being blocked (hang) by the straggler/faulty.

Prerequisites

Software Dependencies

PyTorch

Python>=3.8
PyTorch==1.13.0
CUDA>=10.2
GCC9.4

Download and compile the following libraries if you have not installed them:

UCX==1.13.0

wget https://github.com/openucx/ucx/releases/download/v1.13.0-rc1/ucx-1.13.0.tar.gz
tar xzf ucx-1.13.0.tar.gz
cd ucx-1.13.0
mkdir build
./contrib/configure-release --prefix=[BUILD_PATH] --with-cuda=[CUDA_PATH]
make -j8
make install

OpenMPI==4.1.1

wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.1.tar.gz
gunzip -c openmpi-4.1.1.tar.gz | tar xf -
cd openmpi-4.1.1
./configure --prefix=[BUILD_PATH] --with-cuda=[CUDA_PATH] --with-ucx=[UCX_PATH] --enable-mca-no-build=btl-uct
make all install

Add built paths to environment variables PATH, LD_LIBRARY_PATH and MANPATH, respectively.

System Hardwares

Our testbed environment includes:

Ubuntu 20.04 LTS
NVIDIA A100 SXM4 40G GPUs with NVLink
NVIDIA V100 SXM2 32G GPUs with NVLink
Mellanox NIC 100Gbps / 50Gbps
EPYC-7H12 CPU, PCIe 4.0 / Intel 6230 CPU, PCIe 3.0

Install

Download the repo:

git clone https://github.com/JoeyYoung/adapcc.git

cd adapcc and edit OMPI_LIB_DIR and OMPI_INC_DIR to your own paths in the Makefile.

Run make. The compilation outputs a dynamic library communicator.so.

Usage

Start with Launcher

Processes are managed by MPI. Use the script launch_script.sh to launch.

The arguments include:

num_process: the number of workers (processes) to start.
ips: host ips of workers, following the format of mpirun.
master: the ip of world rank 0, i.e., the master worker.
mpi_path: execution path for mpirun.
net_device: NIC interfaces for network communication.
exec_file: the execution file name. You are required to provide this main file.
socket_port: ports used for inter-process communication.
entry_point: customize the value to enable the detection or profiling modules (if needed).
logical_graph: dump path of the logical graph. You can customize your server configuration as graph input, based on which profiling is performed. See examples in './topology'.
strategy_file: dump path of the communication topology strategies. Represented in xml format. You can customize your own strategy as input, based on which the data transfer follows. See examples in './strategy'.
parallel_degree: the number of parallel concurrent transmissions within one communication context.
profile_freq: the frequency of profiling and graph construction, if enabled.

e.g., when train main.py with 4 computing nodes, each with 4 GPUs:

python launcher.py \
    --num-process 16 \
    --ips node1:4,node2:4,node3:4,node4:4 \
    --master node1 \
    --mpi-path ~/openmpi/bin/mpirun \
    --net-device mlx5_0 \
    --exec-file main.py \
    --socket_port 5000 \
    --entry_point 6 \
    --logical_graph ./topology/logical_graph_4n.xml \
    --strategy_file ./strategy/strategy_4n.xml \
    --parallel_degree 4 \
    --profile_freq 500

We assume the dependencies are configured on each node.

Primitive Example

Here are steps on how to run a communication operator, refer to adapcc.py:

Import library import adapcc.
Initialize AdapCC.init(args, LOCAL_RANK, WORLD_RANK, WORLD_SIZE). It generates a communication strategy based on detection and profiling. A specific strategy could also be defined by users as illustrated in launcher.
Transmission context setup AdapCC.setup(ALLREDUCE). Create communication resources and work queues for processing operators.
Call primitives AdapCC.communicator.all_reduce(tensor, size, chunk_bytes)
Completed and reclaim resources AdapCC.clear(ALLREDUCE).

import adapcc
AdapCC.init(args, LOCAL_RANK, WORLD_RANK, WORLD_SIZE)
AdapCC.setup(ALLREDUCE)
...
AdapCC.communicator.all_reduce(tensor, size)
...
AdapCC.clear(ALLREDUCE)

You will obtain an output similar to this.

Note

The first op is always slow due to the cache reason.
Adaptive Relay functionality is only supported in a training context.

Training Example

train_ddp.py provides a training template. Use this for your models.

Compared to the former example, additionally, we register the hook AdapCC.communicator.cuda_allreduce_hook and call AdapCC.reconstruct_topology with a defined frequency to reconstruct communication topology.

Relay control is enabled by default and the output should be similar to this.

Contact

Raise issues or email [email protected] for any questions.

joeyyoung / adapcc Goto Github PK

adapcc's Introduction

AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive

Prerequisites

Software Dependencies

System Hardwares

Install

Usage

Start with Launcher

Primitive Example

Training Example

Contact

License

adapcc's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs