GithubHelp home page GithubHelp logo

Comments (13)

alvarobartt avatar alvarobartt commented on May 14, 2024

Hi @ChenDRAG, did you try running it using the multi_gpu.yaml configuration instead? Maybe the memory optimisations introduced by ZeRO are downgrading the performance of your GPU...

The command would look like the following:

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml

Other than that I suggest you to try with LoRA if you're having issues with either SFT or DPO, as it will use less memory and requires less resources to run, with 40GB of VRAM you'll be good to go with LoRA.

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_lora.yaml

from alignment-handbook.

ChenDRAG avatar ChenDRAG commented on May 14, 2024

@alvarobartt Thanks a lot for your kind help!
However, in the scripts, instructions to reproduce experiments are

# Full training with ZeRO-3 on 8 GPUs
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_{task}.py recipes/{model_name}/{task}/config_full.yaml

# LoRA training on a single GPU
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml

# QLoRA 4-bit training on a single GPU
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml --load_in_4bit=true

# LoRA training with ZeRO-3 on two or more GPUs
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes={num_gpus} scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml

I notice whenever using multiple GPUs care, it is suggested to use deepspeed_zero3 for acceleration. I don't know why.

Could you explain what is the main difference between deepspeed_zero3 and multi_gpu configuration? Is there any potential problem (drawback) if I use multi_gpu.yaml for distributed learning?

from alignment-handbook.

ChenDRAG avatar ChenDRAG commented on May 14, 2024

p.s.
I tried
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml
and it still reports OOM error on 8*46Gb cards.

from alignment-handbook.

edbeeching avatar edbeeching commented on May 14, 2024

Deepspeed zero3 will shard the model over several GPUs, this should resolve the OOM issues you see. Note we testing on A100 GB GPUS so you may need to tweek the hyperparameters to match your use case.

from alignment-handbook.

alvarobartt avatar alvarobartt commented on May 14, 2024

Also using Flash Attention may decrease the VRAM consumption while training, right? cc @edbeeching

from alignment-handbook.

ChenDRAG avatar ChenDRAG commented on May 14, 2024

Deepspeed zero3 will shard the model over several GPUs, this should resolve the OOM issues you see. Note we testing on A100 GB GPUS so you may need to tweek the hyperparameters to match your use case.

Thanks for your help!

I thought different GPUs merely lead to different upper limits of batchsize. Can you tell me which specific hyperparameters I may need to alter other than batchsize and accumulation steps in order to get things work on other GPUs?

from alignment-handbook.

tcapelle avatar tcapelle commented on May 14, 2024

I would also like more info about this. Do you use Deepspeed to increase batch size? A 7B model fits nicely on 80GB GPUs without any model paralellism.

from alignment-handbook.

edbeeching avatar edbeeching commented on May 14, 2024

Hi @alvarobartt sorry for the delay. Yes we are using flash attn.

@tcapelle if you have lower GPU memory you can use lora (peft) to perform finetuning.

from alignment-handbook.

tcapelle avatar tcapelle commented on May 14, 2024

Thanks for the prompt response =). BTW outstanding preso at DL.ai @edbeeching !

What I am curious is why use Deepspeed zero3 when using 80GB GPUs, is it faster? or it is to increase batch size? I have a node of 8x80GB

from alignment-handbook.

edbeeching avatar edbeeching commented on May 14, 2024

Thanks @tcapelle zero3 shards the optimizer state, grads and model weights across GPUs. So you should have more memory available. However, if you are tuning a 7b model you may not need to shard, as you will probably be running DDP=8.

from alignment-handbook.

tcapelle avatar tcapelle commented on May 14, 2024

Yes, but in the Readme:

Full fine-tuning on a multi-GPU machine with DeepSpeed ZeRO-3 (tested on an 8 x A100 (80GB) node)

I am curious about why you chose to shard with big GPUs available, maybe I am missing something.

from alignment-handbook.

edbeeching avatar edbeeching commented on May 14, 2024

This is so the config is compatible with a larger model, e.g. llama-2-70b.
I think that for a 7b model no sharding will take place.

from alignment-handbook.

tcapelle avatar tcapelle commented on May 14, 2024

The DPO recipe with a 7b model with config_full get's me OOM so I was wondering what should I reduce to keep the recipe consistent

I am on 8xA100 80GB

from alignment-handbook.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.