GithubHelp home page GithubHelp logo

Comments (8)

Sakura-gh avatar Sakura-gh commented on August 17, 2024 2

你好,补充一些细节信息:

我的oneflow版本:version: 0.8.0+cu102
git_commit: a6d4cb80
cmake_build_type: Release
rdma: True
mlir: True
我的libai commit:commit 622cff9

此外,我根据你提供的脚本复现了实验,结果比较接近:

  1. libai:
  • mbs=2, gbs=16,脚本bash tools/args_libai_gpt2.sh configs/gpt2_nl24_nah16_hs1024.py 1 8 0 127.0.0.1 1 1 true false 2 16,对应的复现结果:mb2_gb16: 8913MiB, gpu_rate=54%, total_throughput: 16.55 samples/s
  • mbs=4, gbs=32,脚本bash tools/args_libai_gpt2.sh configs/gpt2_nl24_nah16_hs1024.py 1 8 0 127.0.0.1 1 1 true false 4 32,对应的复现结果:mb4_gb32: 15375MiB, gpu_rate=94%, throughput: 18.95 samples/s
  • 这里可能因为机器原因,我在mbs=4, gbs=32状态下并未出现OOM,不过也比较接近,误差不大
  1. megatron-lm:

从上述实验中可以看出,我在机器上能够大致复现出你描述的情况,因此判断:问题可能出现在我的megatron-lm训练脚本上。但我仔细对比了我的脚本和你提供的脚本,两者在网络参数、并行数等设置基本相同,但跑出来的结果却大有差异,这里贴上根据我的脚本实现的结果:

为了便于对比,我把两份脚本跑出来的log文件也贴上去了。

附:我的megetron-lm训练gpt2的脚本pretrain_gpt_dp_mp_pp.sh

#! /bin/bash

# Runs the "345M" parameter model
GPUS_PER_NODE=${1:-8}
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=60075
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

DATA_PATH=/home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document
VOCAB_FILE_PATH=/home/gehao/dataset/gpt/hf-GPT2Data/vocab.json
MERGE_FILE_PATH=/home/gehao/dataset/gpt/hf-GPT2Data/merges.txt

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

M_P=${2:-1}
P_P=${3:-1}
D_P=$(($WORLD_SIZE/$M_P/$P_P))  

MICRO_BATCH_SIZE=${4:-8}
GLOABAL_BATCH_SIZE=${5:-64}

TRAIN_ITERS=${6:-100}

CHECKPOINT_PATH=checkpoints/gpt2_gpus${GPUS_PER_NODE}_dp${D_P}_mp${M_P}_pp${P_P}_mbs${MICRO_BATCH_SIZE}_gbs${GLOABAL_BATCH_SIZE}_iters${TRAIN_ITERS}
LOGFILE=./log/megatron_lm_perf_gpt_pretrain_gpus${GPUS_PER_NODE}_dp${D_P}_mp${M_P}_pp${P_P}_mbs${MICRO_BATCH_SIZE}_gbs${GLOABAL_BATCH_SIZE}_iters${TRAIN_ITERS}.log

python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       pretrain_gpt.py \
       --tensor-model-parallel-size $M_P \
       --pipeline-model-parallel-size $P_P \
       --num-layers 24 \
       --hidden-size 1024 \
       --num-attention-heads 16 \
       --micro-batch-size $MICRO_BATCH_SIZE \
       --global-batch-size $GLOABAL_BATCH_SIZE \
       --seq-length 1024 \
       --max-position-embeddings 1024 \
       --train-iters $TRAIN_ITERS \
       --lr-decay-iters 320000 \
       --save $CHECKPOINT_PATH \
       --load $CHECKPOINT_PATH \
       --data-path $DATA_PATH \
       --vocab-file $VOCAB_FILE_PATH \
       --merge-file $MERGE_FILE_PATH \
       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.00015 \
       --lr-decay-style cosine \
       --min-lr 1.0e-5 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --checkpoint-activations \
       --log-interval 1 \
       --save-interval 1000 \
       --eval-interval 100 \
       --eval-iters 10 \
       --fp16 2>&1 | tee ${LOGFILE}

echo "Writting log to ${LOGFILE}"     

执行的命令(mbs=8, gbs=64,gpu_rate=53%):

bash examples/pretrain_gpt_dp_mp_pp.sh 8 1 1 8 64 100

我们讨论下megatron-lm这两份脚本跑出来有差异的原因是啥?

@xyn1201

from libai.

xyn1201 avatar xyn1201 commented on August 17, 2024 1

您好,我用您的配置做了上面的实验

使用的oneflow版本:0.8.0+cu102,使用的libai版本:最新commit

您用到的oneflow和libai的版本可能不匹配,可以具体说明一下用的哪个libai commit吗?
如果用最新的libai,建议您搭配使用oneflow的nightly版本,比如python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/cu102

我的复现环境:

您可以用python3 -m oneflow --doctor这个指令查看当前的oneflow版本

用您的配置在8卡 16G V100上 复现的结果:

  • LiBai
    • mb2_gb16时,10143 MiB
    • mb4_gb32时,OOM
  • Megatron-LM
    • mb2_gb16时,13432 MiB
    • mb4_gb32时,OOM

您也可以用 我们的测试脚本 复现发版的实验结果

我用您的配置和您提到的commit版本没有复现出对应的结果,您可以跑下面的脚本,看能不能得出和我类似的结果
https://github.com/Oneflow-Inc/OneAutoTest/tree/dev_display/libai 下面的

  • args_libai_gpt2.sh 拷贝到libai仓库的tools路径下
  • gpt2_nl24_nah16_hs1024.py 拷贝到libai仓库的configs路径下
  • 运行bash tools/args_libai_gpt2.sh configs/gpt2_nl24_nah16_hs1024.py 1 8 0 127.0.0.1 1 1 true false 2 16
  • 这组配置将input_placement_device从'cuda'更改为了'cpu',对显存有进一步优化,复现结果如下:
    • mb2_gb16时,9942 MiB
    • mb4_gb32时,OOM
  • 复现megatron,将https://github.com/Oneflow-Inc/OneAutoTest/blob/dev_display/libai/megatron/megatron_args_pretrain_gpt2.sh 拷贝到megatron仓库的examples路径下,运行bash examples/megatron_args_pretrain_gpt2.sh 1 8 0 127.0.0.1 1 1 true false 2 16,结果是
    • mb2_gb16时,13432 MiB
    • mb4_gb32时,OOM

@Sakura-gh

from libai.

xyn1201 avatar xyn1201 commented on August 17, 2024 1

我check了这两份log最前面输出的参数,区别在于checkpoint_activations,您的是True,我的是False
之前您和我们在libai上做的测试都是checkpoint_activations=false的
在libai当中是通过 train.activation_checkpoint.enabled=true 来设置打开的
您可以再复现一下
@Sakura-gh

from libai.

yuanms2 avatar yuanms2 commented on August 17, 2024 1

哦,怪不得,使用checkpointing 可以用时间换显存

from libai.

yuanms2 avatar yuanms2 commented on August 17, 2024

也就是即使是Megatron-LM, Sakura-gh 的配置可以跑很大的batch size,但我们的配置只能跑一半的batch size

from libai.

Sakura-gh avatar Sakura-gh commented on August 17, 2024

我check了这两份log最前面输出的参数,区别在于checkpoint_activations,您的是True,我的是False 之前您和我们在libai上做的测试都是checkpoint_activations=false的 在libai当中是通过 train.activation_checkpoint.enabled=true 来设置打开的 您可以再复现一下 @Sakura-gh

感谢感谢~在libai上使用checkpointing之后和megatron-lm的性能相当了,并且显存占用上还占有优势。复现结果如下:

  • libai: mbs=8, gbs=64, throughout=15.82 samples/s, gpu_rate=34%
  • megatron-lm: mbs=8, gbs=64, throughout=15.93 samples/s, gpu_rate=68%

from libai.

yuanms2 avatar yuanms2 commented on August 17, 2024

我看打开checkpointing之后,libai 比 megatron低一点, @xyn1201 和之前的测试结果吻合吗

from libai.

xyn1201 avatar xyn1201 commented on August 17, 2024

我看打开checkpointing之后,libai 比 megatron低一点, @xyn1201 和之前的测试结果吻合吗

from libai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.