Comments (9)
@Coobiw, you can you use the GatheredParameters context manager which will automatically gather the parameters within the context, and release on exit. You can see a simple example usage of computing moving average of parameters here.
from deepspeed.
Hi, I've tried this before. But the program is stuck. How can I debug this?
And I want to know whether it is because I use 30B+ LLM and zero3 inference is very slow?
if self.zero_stage == 3:
params_to_fetch = [
p for p in self.model.parameters()
if hasattr(p, 'ds_id') and p.ds_status == deepspeed.zero.partition_parameters.ZeroParamStatus.NOT_AVAILABLE
]
should_gather_param = len(params_to_fetch) > 0
with deepspeed.zero.GatheredParameters(params_to_fetch, enabled=should_gather_param):
self.model.eval()
evaluation() # contain model.generate()
from deepspeed.
@Coobiw, can you share your full script to help us repro on our side?
Is this a dense or MoE model?
In terms of debugging, can you use prints to pin-point the hang point?
Also, can you try to repro on single gpu so that you can use pdb for debugging. You can try two options for this:
- Enable cpu/nvme offloading to fit the model, or
- Use smaller model
from deepspeed.
Sorry, it is inconvenient to share the whole code. I would try my best to provide more information. It is a dense model. I've tried the script on my ~9B model on A100 80GB. Similar stuck appeared.
I think it may be a multi-gpu communication problem? No explicit bug. Only a warning in model.generate
which is related with NCCL
.
/root/miniconda3/lib/python3.9/site-packages/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
warnings.warn(
t-20240517175036-k966t-worker-0:5136:5282 [7] ib_plugin.c:798 NCCL WARN NET/IB : req 0/1 tag 7 peer 172.25.40.117<36987> collective mismatch error, local size 897024 remote size 614400
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO transport/net.cc:990 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:679 -> 5
t-20240517175036-k966t-worker-0:5136:5282 [7] NCCL INFO proxy.cc:858 -> 5 [Proxy Thread]
I guess collective mismatch error, local size 897024 remote size 614400
causes the stuck.
Additionally, my env is as following:
deepspeed == 0.14.0
cuda: 11.8
The output of nvcc -V
is:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
from deepspeed.
after double check, I find another error message on one worker. as following(time-out error probably):
[E ProcessGroupNCCL.cpp:475] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
t-20240517230118-grg2t-worker-1:5123:5271 [0] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5125:5269 [2] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5127:5272 [4] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5129:5270 [6] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5275 [7] NCCL INFO [Service thread] Connection closed by localRank 7
t-20240517230118-grg2t-worker-1:5130:5206 [0] NCCL INFO comm 0x738ea950 rank 15 nranks 64 cudaDev 7 busId e4000 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 15] NCCL watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=96383, OpType=_ALLGATHER_BASE, NumelIn=88200, NumelOut=5644800, Timeout(ms)=7200000) ran for 7200520 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
from deepspeed.
hi, I also test this in one node(8 x A100) with one 9B model. Stuck appeared. TAT
from deepspeed.
Another cause of hanging like this is if prompt length or generation length is different across the GPUs. This is because zero-inference is data-parallel algorithm
from deepspeed.
Oh, thanks, I get it. Do you have any suggestion about this? I think I've done left-padding. How to ensure the output length?
from deepspeed.
@Coobiw, I think we need to first confirm that different prompt/generation lengths are responsible. Can you force all the ranks to process the exact same prompt?
from deepspeed.
Related Issues (20)
- [WATCHLIVE] High School Football 2024 LIVE Broadcast Free ON Tv Channel 16. 08. 2024
- ~(LIVE!-STREAM) Buford vs Milton live streams FreE On TV Channel 16 August 2024
- !!+[OffiCial@!]* Kahuku vs Bishop Gorman Live Free HSF on Tv Channel 16 August 2024
- [WATCHLIVE]! Alabama High School Football 2024 LIVE Broadcast Free ON Tv Channel 16. 08. 2024
- [WATCHLIVE]! Buford vs Milton Football 2024 LIVE Broadcast Free ON Tv Channel 16. 08. 2024
- [WATCHLIVE]! Corner Canyon vs American Fork Football 2024 LIVE Broadcast Free ON Tv Channel 16. 08. 2024
- [WATCHLIVE]! Kahuku vs Bishop Gorman Football 2024 LIVE Broadcast Free ON Tv Channel 16. 08. 2024
- [WATCHLIVE]! Plant City vs Lakeland Football 2024 LIVE Broadcast Free ON Tv Channel 16. 08. 2024
- [WATCHLIVE]! Collins Hill vs Grayson Football 2024 LIVE Broadcast Free ON Tv Channel 16. 08. 2024
- [WATCHLIVE]! Carrollton vs Woodward Academy Football 2024 LIVE Broadcast Free ON Tv Channel 16. 08. 2024 #6024
- [WATCHLIVE]! Iolani vs Leilehua Football 2024 LIVE Broadcast Free ON Tv Channel 16. 08. 2024
- [WATCHLIVE]! Bearden vs Powell Football 2024 LIVE Broadcast Free ON Tv Channel 16. 08. 2024
- [WATCHLIVE]! New Zealand vs Argentina Rugby 2024 LIVE Broadcast Free ON Tv Channel 17. 08. 2024
- [WATCHLIVE]UPDATES] NZ All Blacks vs Argentina Rugby LIVE Broadcast Free ON Tv Channel 17. 08. 2024
- !!+[OffiCial@!]* New Zealand vs Argentina Live Free Rugby on Tv Channel 17 August 2024
- [STREAMS@Aussies]** NZ All Blacks vs Argentina LIVE FREE ONLINE ON TV CHANNEL
- [+[!STREAMs!]+]New Zealand vs Argentina Live Rugby Championship Free ON TV Channel 17 August 2024
- sasaf fsdfsd
- [WATCHLIVE]! Australia vs South Africa Rugby 2024 LIVE Broadcast Free ON Tv Channel 17. 08. 2024
- [WATCHLIVE]! Wallabies vs Springboks Rugby 2024 LIVE Broadcast Free ON Tv Channel 17. 08. 2024
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.