DeepNet: An Implementation based on Colossal-AI
This is the re-implementation of model DeepNet from paper DeepNet: Scaling Transformers to 1,000 Layers.
DeepNet can scale transformer models to 1000 layers by applying DeepNorm. This Colossal-AI based implementation support data parallelism, pipeline parallelism and 1D tensor parallelism for training.
The decoder-only DeepNet model is modified from the GPT model. In this example, we use WebText dataset for training. The way we prepare dataset is same as which in Colossal-AI based GPT example.
#!/usr/bin/env sh
export DATA=/path/to/train_data.json
torchrun --standalone --nproc_per_node=<num_gpus> train_deepnet_decoder.py --config=decoder_configs/deepnet_pp1d.py --from_torch
Please modify DATA
, num_gpus
with the path to your dataset and the number of GPUs respectively.
You can also modify the config file decoder_configs/deepnet_pp1d.py
to further change parallel settings, training hyperparameters and model details.
- Decoder-only DeepNet
- Encoder-Decoder DeepNet