Comments (13)
Hi @ChenDRAG, did you try running it using the multi_gpu.yaml
configuration instead? Maybe the memory optimisations introduced by ZeRO are downgrading the performance of your GPU...
The command would look like the following:
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml
Other than that I suggest you to try with LoRA if you're having issues with either SFT or DPO, as it will use less memory and requires less resources to run, with 40GB of VRAM you'll be good to go with LoRA.
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_lora.yaml
from alignment-handbook.
@alvarobartt Thanks a lot for your kind help!
However, in the scripts
, instructions to reproduce experiments are
# Full training with ZeRO-3 on 8 GPUs
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_{task}.py recipes/{model_name}/{task}/config_full.yaml
# LoRA training on a single GPU
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml
# QLoRA 4-bit training on a single GPU
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml --load_in_4bit=true
# LoRA training with ZeRO-3 on two or more GPUs
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes={num_gpus} scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml
I notice whenever using multiple GPUs care, it is suggested to use deepspeed_zero3 for acceleration. I don't know why.
Could you explain what is the main difference between deepspeed_zero3 and multi_gpu configuration? Is there any potential problem (drawback) if I use multi_gpu.yaml for distributed learning?
from alignment-handbook.
p.s.
I tried
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml
and it still reports OOM error on 8*46Gb cards.
from alignment-handbook.
Deepspeed zero3 will shard the model over several GPUs, this should resolve the OOM issues you see. Note we testing on A100 GB GPUS so you may need to tweek the hyperparameters to match your use case.
from alignment-handbook.
Also using Flash Attention may decrease the VRAM consumption while training, right? cc @edbeeching
from alignment-handbook.
Deepspeed zero3 will shard the model over several GPUs, this should resolve the OOM issues you see. Note we testing on A100 GB GPUS so you may need to tweek the hyperparameters to match your use case.
Thanks for your help!
I thought different GPUs merely lead to different upper limits of batchsize. Can you tell me which specific hyperparameters I may need to alter other than batchsize and accumulation steps in order to get things work on other GPUs?
from alignment-handbook.
I would also like more info about this. Do you use Deepspeed to increase batch size? A 7B model fits nicely on 80GB GPUs without any model paralellism.
from alignment-handbook.
Hi @alvarobartt sorry for the delay. Yes we are using flash attn.
@tcapelle if you have lower GPU memory you can use lora (peft) to perform finetuning.
from alignment-handbook.
Thanks for the prompt response =). BTW outstanding preso at DL.ai @edbeeching !
What I am curious is why use Deepspeed zero3 when using 80GB GPUs, is it faster? or it is to increase batch size? I have a node of 8x80GB
from alignment-handbook.
Thanks @tcapelle zero3 shards the optimizer state, grads and model weights across GPUs. So you should have more memory available. However, if you are tuning a 7b model you may not need to shard, as you will probably be running DDP=8.
from alignment-handbook.
Yes, but in the Readme:
Full fine-tuning on a multi-GPU machine with DeepSpeed ZeRO-3 (tested on an 8 x A100 (80GB) node)
I am curious about why you chose to shard with big GPUs available, maybe I am missing something.
from alignment-handbook.
This is so the config is compatible with a larger model, e.g. llama-2-70b.
I think that for a 7b model no sharding will take place.
from alignment-handbook.
The DPO recipe with a 7b model with config_full get's me OOM so I was wondering what should I reduce to keep the recipe consistent
I am on 8xA100 80GB
from alignment-handbook.
Related Issues (20)
- Not able to run Zephyr 7B Gemma with 4 80GB A100s HOT 1
- Early Stopping Issue when used with ConstantLengthDataset
- Is there a way to freeze some layers of a model ?
- Missing config_qlora.yaml
- How to select parts to bp in sft
- Can any one share the script what params should be passed to run_dpo.py HOT 1
- Efficient dialog data format for KTO training
- Can we please add the option to work with a tokenized dataset, escpailly for the CPT task.
- Constitutional AI models do not achieve MT-Bench scores as reported
- Multi-GPU Training with DPO Full Parameter Stucks
- Cannot reproduce zephyr-7b-gemma-v0.1 HOT 2
- CPT training is giving pretty unstalbe results with the learning rate 2e-5. HOT 1
- Method to disable evaluation
- Different dtype while saving optimizer with FSDP HOT 2
- Dependency updates for QLoRA+FSDP
- Clarification on dataset mixer HOT 2
- How to work with local data HOT 1
- FSDP + QDoRA Support HOT 5
- Issue Running `run_sft.py` After Configuration Changes in GMAL Folder : (ChildFailedError) HOT 3
- CI failing due to `mistralai/Mistral-7B-Instruct-v0.2` being gated now
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from alignment-handbook.