Comments (5)
按照这里的方法,解决了 #3929 (comment)
cd /home/asdf/.local/lib/python3.10/site-packages/torch/lib
ln -s /usr/local/cuda/lib64/libcurand.so .
there is no lib64 under /home/enwei/anaconda3/envs/llama. So I copied everything in lib to lib64, and the problem is solved for me.
from deepspeed.
Hi @hekaijie123,
Can you please share your environment variables, specifically LIBRARY_PATH
and LD_LIBRARY_PATH
?
We've found that explicitly setting LIBRARY_PATH
to point to lib64
resolves this exact issue, e.g.:
export CUDA_HOME="/usr/local/cuda-12.5"
export LIBRARY_PATH="/usr/local/cuda-12.5/lib64:$LIBRARY_PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12.5/lib64:$LD_LIBRARY_PATH"
from deepspeed.
same problem.
Have you solved it?
from deepspeed.
same problem!why
from deepspeed.
same problem
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/data/miniconda3/envs/env-novelai/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/data/miniconda3/envs/env-novelai/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: cannot find -lcurand
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/data/miniconda3/envs/env-novelai/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build
subprocess.run(
File "/data/miniconda3/envs/env-novelai/lib/python3.10/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
from deepspeed.
Related Issues (20)
- Different seeds are giving the exact same loss on Zero 1,2 and 3 during multi gpu training [BUG]
- [BUG] fp16 not supported for CPU? HOT 1
- Issue with LoRA Tuning on llama3-70b using PEFT and TRL's SFTTrainer
- [REQUEST] Asynchronous Checkpointing HOT 1
- [BUG] ImportError: /home/nlp/.cache/torch_extensions/py310_cu121/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory HOT 1
- CUDA error: no kernel image is available for execution on the device [BUG]
- lr scheduler defined in config cannot be overwritten by lr scheduler defined in code and pass to `deepspeed.initialize` [BUG]
- [BUG] PipelineEngine calculates loss with outputs and labels from different batches. HOT 1
- [BUG] Learning rate scheduler and optimizer logical issue
- In distributed training, in order to continue training, an error occurred when loading model checkpoints after saving them.
- DS communication issue when using NCCL backend: All_reduce instead of reduce_scatter (or several reduce ops) HOT 5
- [BUG] I can't run fp8 with pipeline parallel HOT 2
- [BUG] Multi-gpu stuck when the computation graph is not complete for wach process.
- [BUG] Multi-node fine-tuning with thunderbolt HOT 1
- Multi-node multi-GPUs training is slower than single-node multi-GPUs training[BUG] HOT 2
- Default libcurand path fails HOT 1
- [BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"
- test
- how to set "training_step" during training?
- does deepspeed support pure bf16 training?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.