Comments (9)
Hi @daehuikim I use the following command and can see cuda op status. Note I don't have CUDA toolchain installed. Is your environment have CUDA toolchain you should be able to see desired result on your master node.
DS_ACCELERATOR=cuda DS_BUILD_FUSED_ADAM=1 pip install deepspeed
DS_ACCELERATOR=cuda ds_report
You may want to set in your .bashrc
if you wish to build CUDA by default on master node. You don't need this env on compute node but it will work as well.
(dscpu) 22:07:19|~/machine_learning/DeepSpeed$ DS_ACCELERATOR=cuda ds_report
[2024-05-27 22:07:36,330] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (override)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fp_quantizer ........... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/akey/anaconda3/envs/dscpu/lib/python3.10/site-packages/torch']
torch version .................... 2.1.0+cu121
deepspeed install path ........... ['/home/akey/anaconda3/envs/dscpu/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.14.2, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... [FAIL] cannot find CUDA_HOME via torch.utils.cpp_extension.CUDA_HOME=None
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 31.18 GB
from deepspeed.
Hi @daehuikim - are you able to run pip install deepspeed
with no errors? And do you hit any errors when installing other ops?
It appears that your system is being detected as a CPU, but you have installed torch+cuda, can you tell us more about what accelerator you are trying to use?
[2024-05-21 11:34:41,285] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
from deepspeed.
Hello @loadams Thanks for your replying.
pip install deepspeed
works without any errors for me.
I am running my script on my master node which has CPU only using slurm scheduler.
Specifically, I am activating conda virtual environment that has packages and propagate works to worker node which has multiple GPUs using slurm scheduler.
Therefore, I am trying to install deepspeed with ops in my conda virtual environment.
from deepspeed.
I see, is there a reason that you need to precompile the ops? Since you should be able to run DeepSpeed on the GPU nodes and it will detect the GPU and then JIT compile the ops (information here.)
from deepspeed.
@loadams
There is no reason for doing this.
I was just following this tutorial about finetuning t5 model.
I found another way to utilize fused adm just adding
torch_adam=true
in optimizers in deepspeed config now.
I just wanted to let contributors know this(failing pre-build installation in some environment) happens.
Thanks for replying!
from deepspeed.
Thanks @daehuikim - that makes sense, since it currently believes your environment is a CPU environment on your master node, so it believes that it can only run certain ops that are installed. Can you try running with the following (this may not work since you don't have cuda installed on the node, but if you do, you can specify the type of DeepSpeed accelerator to build for with the DS_ACCELERATOR=cuda
env var added before your pip install command?
from deepspeed.
pip uninstall deepspeed
DS_ACCELERATOR=cuda pip install deepspeed
ds_report
produces the result like below
[2024-05-24 09:17:18,762] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2024-05-24 09:17:18,763] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
deepspeed_shm_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['TORCH_INSTALL_PATH']
torch version .................... 2.1.2+cu121
deepspeed install path ........... ['DEEPSPEED_INSTALL_PATH']
deepspeed info ................... 0.14.2+cu118torch2.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 2.0
shared memory (/dev/shm) size .... 125.67 GB
@loadams I tried recommended variable and got same result.
from deepspeed.
DS_ACCELERATOR=cuda ds_report
@delock Your recommendation made everything perfect! Thanks for giving nice advice on it!
I got same results with you Thanks again :)
from deepspeed.
Thanks for clarifying the env var use, @delock!
from deepspeed.
Related Issues (20)
- [REQUEST] Does Universal Checkpoint supports for MoE Checkpoint? HOT 3
- Different seeds are giving the exact same loss on Zero 1,2 and 3 during multi gpu training [BUG]
- [BUG] fp16 not supported for CPU? HOT 1
- Issue with LoRA Tuning on llama3-70b using PEFT and TRL's SFTTrainer
- [REQUEST] Asynchronous Checkpointing HOT 1
- [BUG] ImportError: /home/nlp/.cache/torch_extensions/py310_cu121/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory HOT 1
- CUDA error: no kernel image is available for execution on the device [BUG]
- lr scheduler defined in config cannot be overwritten by lr scheduler defined in code and pass to `deepspeed.initialize` [BUG]
- [BUG] PipelineEngine calculates loss with outputs and labels from different batches. HOT 1
- [BUG] Learning rate scheduler and optimizer logical issue
- In distributed training, in order to continue training, an error occurred when loading model checkpoints after saving them.
- DS communication issue when using NCCL backend: All_reduce instead of reduce_scatter (or several reduce ops) HOT 5
- [BUG] I can't run fp8 with pipeline parallel HOT 2
- [BUG] Multi-gpu stuck when the computation graph is not complete for wach process.
- [BUG] Multi-node fine-tuning with thunderbolt HOT 1
- Multi-node multi-GPUs training is slower than single-node multi-GPUs training[BUG] HOT 2
- Default libcurand path fails HOT 1
- [BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"
- test
- how to set "training_step" during training?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.