GithubHelp home page GithubHelp logo

frankleeeee / sequence-parallelism Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 63 KB

Codebase for Sequence Parallelism: Long Sequence Training from System Perspective

License: Apache License 2.0

Python 85.47% Makefile 0.15% C++ 14.26% Shell 0.13%

sequence-parallelism's Introduction

πŸ“„ Sequence Parallel

πŸ”—Table of Contents

πŸ“Œ Introduction

In this codebase, we implemented BERT with sequence parallelism. Sequence parallelism splits the input tensor and intermediate activation along the sequence dimension. This method can achieve better memory efficiency and allows us to train with larger batch size and longer sequence length.

Paper: Sequence Parallelism: Long Sequence Training from System Perspective

πŸ›  Environment Setup

To run this codebase, the following environment is required:

  • CUDA: 11.3
  • Python: > 3.7
  • PyTorch: 1.11.0

You can follow the script below to set up the environment for this codebase.

# create conda environment
conda create -n seq python=3.9
conda activate seq

# install PyTorch
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

# install nvidia apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/[email protected]

# install apex
pip install -v -r requirements.txt

πŸͺ… Dataset Preparation

For this codebase, we will use the Wikipedia dataset and process the dataset with Megatron-LM's preprocessing script.

Step 1: Download and extract the Wikipedia Dataset

You can use the following script to download and extract the Wikipedia dataset. The extracted corpus will be stored in ./dataset/raw/corpus.json.

Execute the scripts in the root directory of this codebase

# go to the root directory
cd Sequence-Parallelism

# create dataset workspace
mkdir dataset && cd ./dataset

# download
mkdir raw && cd ./raw 
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

# install wiki extractor
pip install git+https://github.com/FrankLeeeee/[email protected]

# extract text
wikiextractor --json enwiki-latest-pages-articles.xml.bz2
cat text/*/* > ./corpus.json

Step 2: Process the Wikipedia dataset

The following commands can be used to process the Wikipedia dataset. Many thanks to Megatron-LM for providing a preprocessing script to generate the corpus file.

# go to the root directory
cd Sequence-Parallelism

# download vocab file
mkdir vocab && cd ./vocab
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
cd ..

# clone the Megatron-LM repository
git clone https://github.com/NVIDIA/Megatron-LM.git
cd ./Megatron-LM
git checkout b037a69eb2622bd824d88cc230ada94077b3716f

# run processing
python tools/preprocess_data.py \
    --input ../dataset/raw/corpus.json \
    --output-prefix my-bert \
    --vocab ../vocab/bert-large-uncased-vocab.txt \
    --dataset-impl mmap \
    --tokenizer-type BertWordPieceLowerCase \
    --split-sentences \
    --workers 48

# move the processed data to the dataset directory
cd ..
mkdir -p dataset/processed
mv Megatron-LM/my-bert_text_sentence.* ./dataset/processed

After running these commands, you should see the following files in dataset/processed:

  1. my-bert_text_sentence.bin
  2. my-bert_text_sentence.idx

πŸš€ Run the codebase

We provided train.py for you to execute training. Before invoking the script, there are several steps to perform.

Step 1. Set data path and vocab path

At the top of config.py, you can see two global variables DATA_PATH and VOCAB_FILE_PATH.

DATA_PATH = './dataset/processed/my-bert_text_sentence'
VOCAB_FILE_PATH = './vocab/bert-large-uncased-vocab.txt'

DATA_PATH refers to the path to the data file generated by Megatron's script. For example, in the section above, you should get two data files (my-bert_text_sentence.bin and my-bert_text_sentence.idx). You just need to DATA_PATH to the path to the bin file without the file extension.

The VOCAB_FILE_PATH refers to the path to the vocabulary downloaded when you prepare the dataset (e.g. bert-large-uncased-vocab.txt).

Step 3. Make Dataset Helper

Build BERT dataset helper. Requirements are CUDA, g++, pybind11 and make.

cd ./data/datasets
make

Step 3. Configure your parameters

In the config.py provided, a set of parameters are defined including training scheme, model, etc. You can also modify the ColossalAI setting. For example, if you wish to parallelize over the sequence dimension on 8 GPUs. You can change size=4 to size=8. If you wish to use pipeline parallelism, you can set pipeline=<num_of_pipeline_stages>.

Step 4. Invoke parallel training

Lastly, you can start training with sequence parallelism with the following command. You need to replace <num-of-gpus> with the number of GPUs you want to use.

python -m torch.distributed.launch --nproc_per_node <num-of-gpus> --master_addr localhost --master_port 29500 train.py

sequence-parallelism's People

Contributors

frankleeeee avatar

Stargazers

jinuklee avatar  avatar Songlin Yang avatar Dacheng Li avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.