STRONGHOLD: Fast and Affordable Billion-scale Deep Learning Model Training

Deep neural networks (DNNs) with billion-scale parameters have demonstrated impressive performance in solving many tasks. Unfortunately, training a billion-scale DNN requires high-performance GPU servers that are too expensive to purchase and maintain. Existing solutions for enabling larger DNN training with limited resources are inadequate because they suffer from high training time overhead.

We present StrongHold, a better approach for enabling large DNN model training by dynamically offloading data to the CPU RAM and using the secondary storage (e.g., an SSD drive). It maintains a working window to overlap the GPU computation with CPU-GPU data movement carefully and exploits the multi-core CPU for optimizer update. Compared to the state-of-the-art offloading-based solutions, StrongHold improves the trainable model size by 1.9x∼6.5x on a 32GB V100 GPU, with 1.2x∼3.7x improvement on the training throughput.

Our experiment setup

Hardware

NVIDIA Tesla V100-GPU server with 2x 24-core Intel Xeon Platinum 8163 CPU at 2.50GHz and 755GB of DDR4 RAM. NVIDIA GPU (arch=sm_70) with device memory is 32GB. It is possible to evaluate our approach on a different NVIDIA GPU platform, but the results may deviate from the ones reported in the paper.

Note that for the Artifact Evaluation, we have rent a VM with one V100 for reviewers.

Operating System

Ubutun 20.04 with Linux kernel 4.19.91

Software

CUDA 11.6, PyTorch 1.10.2

How to reproduce the experiment in the submitted paper?

Choice-1: An interactive jupyter notebook.

We have provided an interactive jupyter notebook run in a preconfigured server with a 32GB V100 GPU to make the reviewing process more straightforward. The runtime environment has been set, and all the scripts together with their descriptions have also been well-prepared for reviewers to reproduce the main results of the paper.
jupyter link: http://47.111.26.83:8888/notebooks/sc22ae.ipynb
password: see the Stage 1: Artifact Description in the sc submission system

Choice-2: The docker image.

Reviewers can also deploy the runtime environment via this docker image if they have available GPU(s) resources.
Please note to change the values of script arguments and re-install the torch if the CUDA driver version is lower than 11.4, also highlighted in the next content.
The following content describes the steps on how to use this docker image to reproduce experiments.

StrongHold: Artifact Evaluation

This docker image helps reviewers reproduce the experiment results in the submitted paper for the AE committee. It has already included all the necessary datasets and baselines. Thus, following the instructions, most figures in the paper can be reproduced.

We provide a general description of how to launch a docker container using this image and how to run corresponding scripts to evaluate the system performance.

Prerequirement

Before getting started, please make sure a Docker Engine has been installed. We recommend the installation of Docker Community Edition (CE).

Or, a Linux user can run the following commands in a Linux terminal to get a Docker:

sudo apt-get update
sudo apt-get install -y docker.io

Running the Docker command requires root privileges. It can be run by a user without root privileges by adding the username to the Docker group:

sudo usermod -a -G docker $USER

Run the following commands to enable GPU in Docker environment with NVIDIA container toolkit on the host machine. Details are from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
  && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Then, install the nvidia-docker2 package after updating the package listing:

sudo apt-get update 
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Next, test if the GPU running environment is successfully configured. A working setup can show the GPU information by running a base CUDA container:

sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

The Docker and the Host GPU should all be ready if the above settings do not go south.

Step 1. Runtime Environment

1.1 Download this docker image from the official hub website.

docker pull strongh/sc22-ae:latest

1.2 Create a new docker container based on this image.

docker run -it -P -w /home/sys/STRONGHOLD --name=aetesting --network=host --gpus=all --ipc=host strongh/sc22-ae:latest /bin/bash`

1.3 Check the runtime environment.

At this point, I believe the docker container is launched successfully, and the current terminal focus should be at the /home/sys/STRONGHOLD folder with (py3.9.10) virtual python environment, shown as

(py3.9.10) root@??:/home/sys/STRONGHOLD#

If not at the STRONGHOLD folder, please use the cd /home/sys/STRONGHOLD command to change the current location.

If not at (py3.9.10) virtual python environment, please use pyenv activate py3.9.10 to enter into py3.9.10 virtual environment.

Note: if the host server's CUDA driver is lower than 11.4 or errors about the torch version occur, please reinstall `torch`, `torchvision` and `apex` libraries using the following commands.

pip uninstall torch torchvision apex

STAGE_DIR=/home/sys/
cd ${STAGE_DIR}/pytorch && \
    pip install -r requirements.txt && \
    python setup.py clean && \
    python setup.py develop --cmake && \
    python setup.py install && \
cd ${STAGE_DIR}/vision && \
    python setup.py clean && \
    python setup.py install && \
cd ${STAGE_DIR}/apex && \
    pip install -r ./requirements_dev.txt && \
    pip install -v --no-cache-dir --global-option='--cpp_ext' --global-option='--cuda_ext' . && \
cd ${STAGE_DIR}/STRONGHOLD

1.4 Experiment workflow

The evaluation script is located at the examples folder, which should be executed from the parent directory. The script contains a number of configurable arguments to change the DNN setup. The script itself should self explain these parameters, which are summarized as follow:

number of layers, (default=16)
hidden size, (default=2048)
number of heads, (default=16)
sequence length, (default=1024)
batch size, (default=4)
window size, (default=4) (only for StrongHold)

The script can be executed by:

./examples/run.sh -m [METHOD] -l [NUM_LAYERS] -h [HIDDEN_SIZE] -b [BATCH_SIZE] -w [WINDOW_SIZE]

where the [METHOD] argument takes the following values: megatron-lm, l2l, zero-offload, zero-infinity, stronghold and all. Using all to automatically evaluate all approaches; the [NUM_LAYERS], [HIDDEN_SIZE], [BATCH_SIZE] and [WINDOW_SIZE] take integers, indicating the number of transformer layers, the hidden size of each layer, the training batch size and offloading window size, separately.

Results:

Evaluation results will be saved into the results folder, formatted as log_[METHOD]_l-[NUM_LAYERS]_h-[HIDDEN_SIZE]_bs-[BATCH_SIZE]_ws-[WINDOW_SIZE]_[TIMESTAMP].txt. The scripts of ./examples/case*_extract.sh and ./examples/case*_draw.py produce auto-generated diagrams, where * means the testing case id.

Logs:

Log files are stored in /home/sys/STRONGHOLD/results as a format of log_[METHOD]_l-[NUM_LAYERS]_h-[HIDDEN_SIZE]_bs-[BATCH_SIZE]_ws-[WINDOW_SIZE]_[TIMESTAMP].txt. We print the core content in the log files via grep and awk for you at the end of each execution.

The next step will give more details to help you reproduce the major results in our paper. Please follow the guidelines below to evaluate each case.

Step 2. Evaluation.

To reuse the existing log files produced in the previous cases, we recommend running the following cases one by one, which reduces the total execution time to around 5 hours.

2.1 CASE - The largest trainable model size (Figure 6a in Section VI.A)
2.2 CASE - Throughput on the largest trainable model size supported by each baseline (Figure 7a in Section VI.B)
2.3 CASE - Throughput on the largest trainable model size of Megatron-LM (Figure 8a in Section VI.B)
2.4 CASE - Nearly linear scaling as model size increases (Figure 8b in Section VI.B)
2.5 CASE - Impact of working window size (Figure 9 in Section VI.C)

Log files are stored in /home/sys/STRONGHOLD/results as a format of log_[method]_l-[layers]_h-[hidden size]_bs-[BATCH_SIZE]_ws-[WINDOW_SIZE]_[date].txt. We print the core content in the log files via grep and awk for you at the end of each execution.

Launch script ./examples/run.sh -m [method] -l [layers] -h [hidden size] -b [batch size] -w [window size] accepts five arguements, where [method] takes the values of megatron-lm, l2l, zero-offload, zero-infinity, stronghold and all. Using all to automatically evaluate all approaches. Default values for [layers], [hidden size], [batch size], [window size] are 16, 2048, 4 and 4, respectively.

2.1 The evaluation example on virtual machine with a single 32GB V100 GPU and a CPU (with 90GB RAM and 12 CPU Cores) .

!!! Please change the corresponding arguments if the hardware differs from ours. !!!

2.1.1 CASE - The largest trainable model size (Figure 6a in Section VI.A)

In this case, we use GPT-like models to exploit each method's largest trainable model size. Model size changes via increasing/decreasing the number of transformer layers.

We evaluate Megatron-LM, L2L, ZeRO-Offload, ZeRO-Infinity and STRONGHOLD on a virtual machine with a single 32GB V100 GPU and a CPU (with 90GB RAM and 12 CPU Cores) to exploit their largest trainable model size and bottleneck. During this process, we set the Heads=16, Sequence Length=1024, Batch Size=4 in all GPT-like models and training setups.

The largest model sizes have been tested and shown in the following table.

Methods	Largest Trainable Size	Layers	Hidden Size	Heads	Sequence Length	Batch Size
Megatron-LM	1.717 B	32	2048	16	1024	4
L2L	4.033 B	78	2048	16	1024	4
ZeRO-Offload	2.522 B	48	2048	16	1024	4
ZeRO-Infinity	2.522 B	48	2048	16	1024	4
STRONGHOLD	5.141 B	100	2048	16	1024	4

To reproduce the results above, you can either turn to our provided notebook and run the corresponding cell (Cell Heading: 2.1 CASE - The largest trainable model size (Figure 6a in Section VI.A)) or run the following commands within our docker container.

PS: Errors about GPU/CPU OOM might be represented as other information, such as 'can not create XXX'.

./examples/run.sh -m "megatron-lm" -l 32 -h 2048 && \
./examples/run.sh -m "l2l" -l 78 -h 2048 && \
./examples/run.sh -m "zero-offload" -l 48 -h 2048 && \
./examples/run.sh -m "zero-infinity" -l 48 -h 2048 && \
./examples/run.sh -m "stronghold" -l 100 -h 2048

./examples/case1_extract.sh and ./examples/case1_draw.sh help you analysis log files and print the relevant information only.

2.1.2 CASE - Throughput on the largest trainable model size supported by each baseline (Figure 7a in Section VI.B)

In this case, we use GPT-like models to exploit the largest trainable model size supported by each baseline and compare the performance against STRONGHOLD on each largest model size. Model size changes via increasing/decreasing the number of transformer layers.

We evaluate (Megatron-LM, L2L, ZeRO-Offload, ZeRO-Infinity) v.s. STRONGHOLD on a virtual machine with a single 32GB V100 GPU and a CPU (with 90GB RAM and 12 CPU Cores). During this process, we set the Heads=16, Sequence Length=1024, Batch Size=4 in all GPT-like models and training setups.

The throughput has been tested and shown in the following table.

Methods	Throughput	Trainable Size	Layers	Hidden Size	Heads	Sequence Length	Batch Size
Megatron-LM	0.7496	1.717 B	32	2048	16	1024	4
STRONGHOLD	0.6647	1.717 B	32	2048	16	1024	4
----	----	----	----	----	----	----	----
L2L	0.0529	4.033 B	78	2048	16	1024	4
STRONGHOLD	0.2271	4.033 B	78	2048	16	1024	4
----	----	----	----	----	----	----	----
ZeRO-Offload	0.2523	2.522 B	48	2048	16	1024	4
STRONGHOLD	0.3999	2.522 B	48	2048	16	1024	4
----	----	----	----	----	----	----	----
ZeRO-Infinity	0.2439	2.522 B	48	2048	16	1024	4
STRONGHOLD	0.3999	2.522 B	48	2048	16	1024	4

To reproduce the results above, you can either turn to our provided notebook (Cell Heading: 2.2 CASE - Throughput on the largest trainable model size supported by each baseline (Figure 7a in Section VI.B)) and run the corresponding cell or run the following commands within our docker container.

PS: Limitations of CPU cores and bandwidth in the virtual machine hurts the performance of STRONGHOLD a little.

./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 48 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 78 -h 2048 -w 15

./examples/case2_extract.sh and ./examples/case2_draw.sh help you analysis log files and print the relevant information only.

2.1.3 CASE - Throughput on the largest trainable model size of Megatron-LM (Figure 8a in Section VI.B)

This case shows the throughput performance of running Megatron-LM, L2L, ZeRO-Offload, ZeRO-Infinity and STRONGHOLD, respectively, on a 1.717 B model that is the largest trainable model size supported by Megatron-LM. The evaluation is conducted on a virtual machine with one 32GB V100, 90GB CPU RAM and 12 CPU Cores.

The throughput results have been tested and shown in the following table.

Methods	Throughput	Trainable Size	Layers	Hidden Size	Heads	Sequence Length	Batch Size
Megatron-LM	0.7496	1.717 B	32	2048	16	1024	4
L2L	0.1729	1.717 B	32	2048	16	1024	4
ZeRO-Offload	0.3711	1.717 B	32	2048	16	1024	4
ZeRO-Infinity	0.3587	1.717 B	32	2048	16	1024	4
STRONGHOLD	0.6647	1.717 B	32	2048	16	1024	4

To reproduce the results above, you can either turn to our provided notebook and run the corresponding cell (Cell Heading: 2.3 CASE - Throughput on the largest trainable model size of Megatron-LM (Figure 8a in Section VI.B)) or run the following commands within our docker container.

PS: Limitations of CPU cores and bandwidth in the virtual machine hurts the performance of STRONGHOLD a little.

./examples/run.sh -m "l2l" -l 32 -h 2048 && \
./examples/run.sh -m "zero-offload" -l 32 -h 2048 && \
./examples/run.sh -m "zero-infinity" -l 32 -h 2048

./examples/case3_extract.sh and ./examples/case3_draw.sh help you analysis log files and print the relevant information only.

2.1.4 CASE - Nearly linear scaling as model size increases (Figure 8b in Section VI.B)

In this case, we evaluate the performance (elapsed time per iteration - ms) as the model size increases. Similar to previous cases, the model size changes via increasing/decreasing the number of transformer layers.

You would see the elapsed time per iteration linearly rise with the number of transformer layers (representing model size), proving STRONGHOLD's scalability.

Similar to the previous cases, you can reproduce the results in our notebook (the corresponding Cell Heading: 2.4 CASE - Nearly linear scaling as model size increases (Figure 8b in Section VI.B)) or run the following commands in our docker container.

./examples/run.sh -m "stronghold" -l 92 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 64 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 56 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 40 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 24 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 16 -h 2048 -w 15

./examples/case4_extract.sh and ./examples/case4_draw.sh help you analysis log files and print the relevant information only.

2.1.5 CASE - Impact of working window size (Figure 9 in Section VI.C)

Working window size affects the throughput. A larger window can better overlap GPU computation with data transfer, leading to higher training throughput. However, a larger window size means more GPU memory occupancy.

This case evaluates the impact of working window size for STRONGHOLD with 1.7B model. You will see that at the first stage, the larger window size can gain more benefits, while at the end of the stage, enlarging window size shows no influence because the current window size can hide the data transformation process.

To reproduce this, you can either turn to our notebook (the corresponding Cell Heading: 2.5 CASE - Impact of working window size (Figure 9 in Section VI.C)) or run the following commands in our docker container:

PS: The bandwidth restriction in the virtual machine might slightly hurt the performance of STRONGHOLD.

./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 2 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 4 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 6 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 8 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 10 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 12 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 14

./examples/case5_extract.sh and ./examples/case5_draw.sh help you analysis log files and print the relevant information only.

markshilong / ooc_llm Goto Github PK

ooc_llm's Introduction