Describe the bug Hi, I use zero-3 for MLLM training. After one-ep

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training about deepspeed HOT 9 OPEN

Coobiw commented on August 24, 2024

[BUG] Zero3: Gather the params for inference(huggingface_language_model.generate) in the end of 1 epoch and re-partition it for next epoch training

from deepspeed.

Comments (9)

tjruwase commented on August 24, 2024

@Coobiw, you can you use the GatheredParameters context manager which will automatically gather the parameters within the context, and release on exit. You can see a simple example usage of computing moving average of parameters here.

from deepspeed.

Coobiw commented on August 24, 2024

Hi, I've tried this before. But the program is stuck. How can I debug this?

And I want to know whether it is because I use 30B+ LLM and zero3 inference is very slow?

if self.zero_stage == 3:
                params_to_fetch = [
                    p for p in self.model.parameters()
                    if hasattr(p, 'ds_id') and p.ds_status == deepspeed.zero.partition_parameters.ZeroParamStatus.NOT_AVAILABLE
                ]
                should_gather_param = len(params_to_fetch) > 0
                with deepspeed.zero.GatheredParameters(params_to_fetch, enabled=should_gather_param):
                    self.model.eval()
                    evaluation() # contain model.generate()

from deepspeed.

tjruwase commented on August 24, 2024

@Coobiw, can you share your full script to help us repro on our side?

Is this a dense or MoE model?

In terms of debugging, can you use prints to pin-point the hang point?

Also, can you try to repro on single gpu so that you can use pdb for debugging. You can try two options for this:

Enable cpu/nvme offloading to fit the model, or
Use smaller model

from deepspeed.

Coobiw commented on August 24, 2024

Sorry, it is inconvenient to share the whole code. I would try my best to provide more information. It is a dense model. I've tried the script on my ~9B model on A100 80GB. Similar stuck appeared.

I think it may be a multi-gpu communication problem? No explicit bug. Only a warning in model.generate which is related with NCCL.

/root/miniconda3/lib/python3.9/site-packages/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
t-20240517175036-k966t-worker-0:5136:5282 [7] ib_plugin.c:798 NCCL WARN NET/IB : req 0/1 tag 7 peer 172.25.40.117<36987> collective mismatch error, local size 897024 remote size 614400
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO transport/net.cc:990 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:679 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:858 -> 5 [Proxy Thread]

I guess collective mismatch error, local size 897024 remote size 614400 causes the stuck.

Additionally, my env is as following:

deepspeed == 0.14.0
cuda: 11.8

The output of nvcc -V is:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

from deepspeed.

Coobiw commented on August 24, 2024

after double check, I find another error message on one worker. as following（time-out error probably）:

[E ProcessGroupNCCL.cpp:475] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
t-20240517230118-grg2t-worker-1:5123:5271 [0] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5125:5269 [2] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5127:5272 [4] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5129:5270 [6] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5275 [7] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5206 [0] NCCL INFO comm 0x738ea950 rank 15 nranks 64 cudaDev 7 busId e4000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 15] NCCL watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'

from deepspeed.

Coobiw commented on August 24, 2024

hi, I also test this in one node(8 x A100) with one 9B model. Stuck appeared. TAT

from deepspeed.

tjruwase commented on August 24, 2024

Another cause of hanging like this is if prompt length or generation length is different across the GPUs. This is because zero-inference is data-parallel algorithm

from deepspeed.

Coobiw commented on August 24, 2024

Oh, thanks, I get it. Do you have any suggestion about this? I think I've done left-padding. How to ensure the output length?

from deepspeed.

tjruwase commented on August 24, 2024

@Coobiw, I think we need to first confirm that different prompt/generation lengths are responsible. Can you force all the ranks to process the exact same prompt?

from deepspeed.

Related Issues (20)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble