Comments (13)
Thank you for this. I believe this was the issue. I have been using nn.DataParallel
and should upgrade to the distributed method.
from deepspeed.
Thanks for reporting this. We recently changed to auto initialize the distributed backend but forgot to update thus tutorial. You should be able to get around this by setting dist_init_required=False like you mention.
from deepspeed.
Thanks for reporting this. We recently changed to auto initialize the distributed backend but forgot to update thus tutorial. You should be able to get around this by setting dist_init_required=False like you mention.
I tried that, but there are parts that don't work, specifically in the _initialize_parameter_parallel_groups part in the initialization.
from deepspeed.
Are you running the cifar example with the deepspeed
launcher e.g., deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
? I seem to be able to recreate your issue if i run with python cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
but it works if I use deepspeed
. Can you try that?
Since DeepSpeed and ZeRO are intended to run with >1 GPUs a lot of our focus has been on these environments. However, we should probably support running non-distributed mode and not using our deepspeed
launcher for single GPU debugging.
from deepspeed.
I am using DeepSpeed within python with a just an import, so not using the DeepSpeed launcher. My intent is to use it with multiple GPU, but not on a distributed network, rather in a single node.
from deepspeed.
Gotcha, you can still use the deepspeed launcher even if you are not running on multiple nodes. It will attempt to launch on all local gpus (it will discover how many are available) by default in this case. You can also specific the number of gpus you want to launch on your local node via --num_gpus
.
from deepspeed.
That won't be the easiest as I'm trying to use it in a pre-existing modelling pipeline I already have developed.
from deepspeed.
Does your existing modelling pipeline handle launching processes across multiple gpus? If so you'll need to satisfy the requirements of torch.distributed launching to get this to work. We did this recently to support mpirun launching (instead of our deepspeed launcher), you can see the variables that are needed see here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/pt/deepspeed_light.py#L209-L213
from deepspeed.
I thought that the torch.distributed was meant more for running a multiple across multiple network-connected devices, rather than the model from multiple cards in a single box. Going through the documentation, I see that this may be a misconception (is that correct?). I will play around and look at the different environmental variables necessary to have torch.distributed work within a single machine. Maybe this was the problem I was having.
Edit: I'm still unclear why the mpi would be better than the nccl backend. Also, in the documentation, I thought that DeepSpeed should be able to work with a single GPU (ie, someone wants the benefits of APEX or other tools in-place). Are they required to still set-up a distributed process, even for a single GPU task?
from deepspeed.
We can support running 1-gpu without the DeepSpeed launcher, it's on our roadmap now. I'll be sure to update this thread once this support is added.
However, if you're going to want to run multi-gpu (single node) I highly recommend using torch.distributed. The old way of running multi-gpu single node was using nn.DataParallel however we have found significant performance benefits from using torch.distributed instead. One reason is that torch.distributed uses separate processes per GPU instead of sharing a single process across GPUs.
DeepSpeed runs uses torch.distributed with a NCCL back-end for comm collectives. We have recently added support for using MPI simply for launching processes, but in this case it still uses the NCCL torch.distributed back-end for all communication during training.
from deepspeed.
Feel free to re-open if needed. Otherwise I'll update this thread when we have 1-gpu support, probably more useful for testing in certain scenarios though.
from deepspeed.
@jeffra, at the very least if 1 gpu is not supported, could you please bail with a user-friendly error saying that non-multi-gpu is not supported?
Currently if fails with:
AssertionError: DeepSpeed requires integer command line parameter --local_rank
which is not documented anywhere as a user-side parameter.
$ CUDA_VISIBLE_DEVICES=1 deepspeed ... -deepspeed --deepspeed_config ds_config.json
[...]
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/__init__.py", line 109, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 150, in __init__
self._do_args_sanity_check(args)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 527, in _do_args_sanity_check
assert hasattr(args, 'local_rank') and type(args.local_rank) == int, \
AssertionError: DeepSpeed requires integer command line parameter --local_rank
My first card is rtx-3090 which doesn't seem to work with deepspeed - bails on NCCL error - so I tried to check with the 2nd older card only as a sanity check and then had to hunt down why it was failing with this error.
Thank you!
from deepspeed.
Well, this proved to be unrelated to this issue - one needs to forward --local_rank
to deepspeed
initialize's args
- in the application I am trying to integrate deepspeed in it was goobled up by another consumer of argparser.
from deepspeed.
Related Issues (20)
- [REQUEST] Moving a trainable model with an optimiser between GPU and CPU
- [BUG] RuntimeError: Error building extension 'fused_adam' Loading extension module fused_adam
- [BUG] 1: error: must run as root and 2: raise RuntimeError("Ninja is required to load C++ extensions")
- [BUG] RuntimeError encountered when generating tokens from a DeepSpeedHybridEngine initialized with 4-bit quantization. HOT 2
- [BUG] is_zero_init_model is always False when I'm using zero_init! HOT 4
- [BUG] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! HOT 1
- Deepspeed stage 3 hanging after 1st validation sample
- [BUG] 4-bit quantized models would repeatedly generate the same tokens when bf16.enabled is true HOT 1
- Deepspeed zero3 + qlora arise problem! Params didn't sharded first before load to each GPU!
- Install errors on Windows HOT 5
- [HELP] How to safely switch trainable parameters in ZeRO-3 stage? HOT 2
- Does deepspeed support aarch64? HOT 6
- [BUG] tortoise_tts.py fails on deepspeed/pydantic error HOT 1
- [BUG] 1 line logic issue: flipped sign/direction in `_partition_param_sec` of `partition_parameters.py`? HOT 1
- [BUG] RuntimeError encountered when generating tokens from a Meta-Llama-3-8B-Instruct model initialized with 4-bit or 8-bit quantization HOT 2
- Why doesn't deepspeed stage 3 allow a batch size of 1 with multiple GPUs?
- [BUG] File not found in autotuner cache in multi-node setting on SLURM HOT 1
- Inference with the MoE based GPT model trained by ds_pretrain_gpt_345M_MoE128.sh [BUG]
- RuntimeError: still have inflight params[BUG] HOT 1
- Install issue with setuptools 70 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.