Comments (14)
The installation scripts perform sudo -H pip installs, which install system wide. I replaced those with normal pip installs and it installed into the current environment without problems.
from deepspeed.
We have a now have a conda package uploaded and we appreciate any feedback!
We have versions compiled for cudatoolkit
versions 10.0 and 10.1 To install along with pytorch and other dependencies that are in the conda-forge channel:
conda install deepspeed cudatoolkit=10.1 -c deepspeed -c pytorch -c conda-forge
from deepspeed.
Hi there! We're in the process of rewriting our installation scripts (that were previously only used within Docker containers) and hoping to also release a conda package in short time. These sorts of issues should be fixed at that point.
from deepspeed.
When I installed it I did that in the install script (for both the deep speed, apex and requirements). However, there were still issues in that DeepSpeed would not install to the right environment location. Looking at the installation a little more, this seemed more likely an issue with the wheel created for DeepSpeed in the install.sh file. I was able to get it working by forcing pip to install DeepSpeed into the correct location (the same location that Apex was correctly installed to).
from deepspeed.
The repo's install.sh
should respect the environment by default now (sudo
is opt-in). Please let me know if the issue persists.
from deepspeed.
Using the conda install, deepspeed shows up when I run conda list
but it is not available when trying to import in python.
from deepspeed.
Hi @kleingeo, thanks for the report. I can see that on my end now as well. Not sure what happened...I'm looking into it.
Interestingly, the deepspeed
entry point looks fine and is found in my $PATH
after installation. And I can see the DeepSpeed library installed under ~/miniconda3/envs/test/lib/python3.7/site-packages/deepspeed/
(where test
is my conda environment name), and also see the expected ~/miniconda3/lib/python3.7/site-packages
in my sys.path
...so I'm not sure why the deepspeed library is not importable.
from deepspeed.
Yes, I remember having this problem a lot when trying to install deepspeed normally with the install.sh file. With a normal python virtual env it works, but for some reason with Conda, it consistently tries to install to another location. The only thing I found to work was to force pip (when using conda) to force the install location to where the install.sh file installs Apex.
from deepspeed.
@ShadenSmith , it is easier to install deepspeed via your conda command than 'install.sh' (prone to fail). In the deepspeed channel, only early-version deepspeed exists.
conda search -f deepspeed -c deepspeed
Loading channels: done
deepspeed 0.1.0 py3.6_cuda10.0.130_0 deepspeed
deepspeed 0.1.0 py3.6_cuda10.1.243_0 deepspeed
deepspeed 0.1.0 py3.7_cuda10.0.130_0 deepspeed
deepspeed 0.1.0 py3.7_cuda10.1.243_0 deepspeed
When do you plan to release new conda version of deepspeed with Zero2?
Thanks
from deepspeed.
Hi @jdongca2003, I have some time to dedicate to the DeepSpeed's conda infrastructure now that the v0.2 release is complete. I'm looking at improved packages (per the above bug report) and automating the package build process.
from deepspeed.
@ShadenSmith Thanks. I tested your conda deepspeed package on https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar.
It failed on Tesla K80 and I got the following error mesage:
"
THCudaCheck FAIL file=csrc/fused_adam_cuda_kernel.cu line=135 error=209 : no kernel image is available for execution on the device
Traceback (most recent call last):
File "cifar10_deepspeed.py", line 178, in
model_engine.step()
File "/home/dong/miniconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/pt/deepspeed_light.py", line 692, in step
self.optimizer.step()
File "/home/dong/miniconda3/envs/deepspeed/lib/python3.7/site-packages/apex/optimizers/fused_adam.py", line 146, in step
group['weight_decay'])
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at csrc/fused_adam_cuda_kernel.cu:135"
But it worked well on Tesla P4. Probably deepspeed does not support old GPU architecture.
from deepspeed.
In V100, Same error with THCudaChecker happens!!
from deepspeed.
@ShadenSmith Thanks. I tested your conda deepspeed package on https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar.
It failed on Tesla K80 and I got the following error mesage:
"
THCudaCheck FAIL file=csrc/fused_adam_cuda_kernel.cu line=135 error=209 : no kernel image is available for execution on the device
Traceback (most recent call last):
File "cifar10_deepspeed.py", line 178, in
model_engine.step()
File "/home/dong/miniconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/pt/deepspeed_light.py", line 692, in step
self.optimizer.step()
File "/home/dong/miniconda3/envs/deepspeed/lib/python3.7/site-packages/apex/optimizers/fused_adam.py", line 146, in step
group['weight_decay'])
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at csrc/fused_adam_cuda_kernel.cu:135"But it worked well on Tesla P4. Probably deepspeed does not support old GPU architecture.
Hi @jdongca2003 ,
I encounter the same problem as you describe when using Tesla K80. And I found it work normally when applying them on Tesla V100. Have you solved this problem?
@ShadenSmith Could you please explain why this happen? Dose deepspeed not support Tesla K80?
Thanks.
from deepspeed.
Hi, closing this issue as it is stale with respect to Cuda/Torch/DeepSpeed versions. However, we now provide an environment.yml for ease of building in conda, that is located at the root of our repo!
from deepspeed.
Related Issues (20)
- [BUG]I found that the parameters of model will be fully transferred to the VRAM of each process. Is this abnormal in my understanding? HOT 5
- [BUG] fp6 canβt load qwen1.5-34b-chat
- [BUG] deepspeed amp seems to convert all input to specific dtype
- Data Loading for DeepSpeed Ulysses and Data Parallelism
- different setting for same (num_gpus * batch_size * grad_accum_steps) output different loss and gradient norm HOT 1
- [BUG] Stage 3 in WSL2 throws RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! HOT 3
- [BUG] DeepSpeed is loads the whole model to every GPUs instead of partitioning HOT 1
- [BUG] RepeatingLoader may be invalid in the pipe stages neither the fist nor last
- [REQUEST] Supporting custom generation loop (outlines, LMQL, guidance) in DeepSpeedHybridEngine
- [BUG] M1 Mac has an issue with `hostname -I` not being a valid command HOT 7
- [BUG] CUDA OOM error when Hugging Face `ignore_mismatched_sizes` is enabled
- [BUG] Zero3 causes AttributeError: 'NoneType' object has no attribute 'numel' in continual training HOT 3
- [BUG] cannot import name '_get_socket_with_port' from 'torch.distributed.elastic.agent.server.api' HOT 4
- # [REQUEST] Upstream modifications of PaRO
- Reset Optimizer HOT 1
- nv-ds-chat CI test failure
- [HELP] ZeRO3 partition parameters after fully load to each GPU! HOT 7
- [BUG] ZeRO optimizer with MoE Expert Parallelism HOT 1
- [BUG] Pipeline Dataloader Samler: `shuffle=False`
- [REQUEST] Moving a trainable model with an optimiser between GPU and CPU
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.