Hi, thank you for your great work! I'd like to reproduce full parameter fine-tuning o

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks for the prompt response =). BTW outstanding preso at DL.ai <a class="user-menti

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

How to perform full parameter finetuning without A100 GPUs about alignment-handbook HOT 13 OPEN

ChenDRAG commented on May 14, 2024

How to perform full parameter finetuning without A100 GPUs

from alignment-handbook.

Comments (13)

alvarobartt commented on May 14, 2024

Hi @ChenDRAG, did you try running it using the multi_gpu.yaml configuration instead? Maybe the memory optimisations introduced by ZeRO are downgrading the performance of your GPU...

The command would look like the following:

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml

Other than that I suggest you to try with LoRA if you're having issues with either SFT or DPO, as it will use less memory and requires less resources to run, with 40GB of VRAM you'll be good to go with LoRA.

CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_lora.yaml

from alignment-handbook.

ChenDRAG commented on May 14, 2024

@alvarobartt Thanks a lot for your kind help!
However, in the scripts, instructions to reproduce experiments are

# Full training with ZeRO-3 on 8 GPUs
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml scripts/run_{task}.py recipes/{model_name}/{task}/config_full.yaml

# LoRA training on a single GPU
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml

# QLoRA 4-bit training on a single GPU
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml --load_in_4bit=true

# LoRA training with ZeRO-3 on two or more GPUs
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes={num_gpus} scripts/run_{task}.py recipes/{model_name}/{task}/config_lora.yaml

I notice whenever using multiple GPUs care, it is suggested to use deepspeed_zero3 for acceleration. I don't know why.

Could you explain what is the main difference between deepspeed_zero3 and multi_gpu configuration? Is there any potential problem (drawback) if I use multi_gpu.yaml for distributed learning?

from alignment-handbook.

ChenDRAG commented on May 14, 2024

p.s.
I tried
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7,8,9 ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/multi_gpu.yaml --main_process_port 6000 scripts/run_dpo.py recipes/zephyr-7b-beta/dpo/config_full.yaml
and it still reports OOM error on 8*46Gb cards.

from alignment-handbook.

edbeeching commented on May 14, 2024

Deepspeed zero3 will shard the model over several GPUs, this should resolve the OOM issues you see. Note we testing on A100 GB GPUS so you may need to tweek the hyperparameters to match your use case.

from alignment-handbook.

alvarobartt commented on May 14, 2024

Also using Flash Attention may decrease the VRAM consumption while training, right? cc @edbeeching

from alignment-handbook.

ChenDRAG commented on May 14, 2024

Deepspeed zero3 will shard the model over several GPUs, this should resolve the OOM issues you see. Note we testing on A100 GB GPUS so you may need to tweek the hyperparameters to match your use case.

Thanks for your help!

I thought different GPUs merely lead to different upper limits of batchsize. Can you tell me which specific hyperparameters I may need to alter other than batchsize and accumulation steps in order to get things work on other GPUs?

from alignment-handbook.

tcapelle commented on May 14, 2024

I would also like more info about this. Do you use Deepspeed to increase batch size? A 7B model fits nicely on 80GB GPUs without any model paralellism.

from alignment-handbook.

edbeeching commented on May 14, 2024

Hi @alvarobartt sorry for the delay. Yes we are using flash attn.

@tcapelle if you have lower GPU memory you can use lora (peft) to perform finetuning.

from alignment-handbook.

tcapelle commented on May 14, 2024

Thanks for the prompt response =). BTW outstanding preso at DL.ai @edbeeching !

What I am curious is why use Deepspeed zero3 when using 80GB GPUs, is it faster? or it is to increase batch size? I have a node of 8x80GB

from alignment-handbook.

edbeeching commented on May 14, 2024

Thanks @tcapelle zero3 shards the optimizer state, grads and model weights across GPUs. So you should have more memory available. However, if you are tuning a 7b model you may not need to shard, as you will probably be running DDP=8.

from alignment-handbook.

tcapelle commented on May 14, 2024

Yes, but in the Readme:

Full fine-tuning on a multi-GPU machine with DeepSpeed ZeRO-3 (tested on an 8 x A100 (80GB) node)

I am curious about why you chose to shard with big GPUs available, maybe I am missing something.

from alignment-handbook.

edbeeching commented on May 14, 2024

This is so the config is compatible with a larger model, e.g. llama-2-70b.
I think that for a 7b model no sharding will take place.

from alignment-handbook.

tcapelle commented on May 14, 2024

The DPO recipe with a 7b model with config_full get's me OOM so I was wondering what should I reduce to keep the recipe consistent

I am on 8xA100 80GB

from alignment-handbook.

How to perform full parameter finetuning without A100 GPUs about alignment-handbook HOT 13 OPEN

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs