Comments (2)
Although this doesn't solve the bug if you would like to get things working and disable vllm from trying to use your integrated Radeon Graphics you can set CUDA_VISIBLE_DEVICES=-1
. I tried setting --device=cpu
and it is working correctly for me.
from vllm.
+1 to this issue, seems the error is caused when you install vllm without cpu version. Currently attention backend is decided based on wether the installed version of vllm has cpu suffix or not (https://github.com/vllm-project/vllm/blob/main/vllm/attention/selector.py#L84 -> https://github.com/vllm-project/vllm/blob/main/vllm/utils.py#L131). This means that even when you specify device to be cpu
vllm tries to load from other attention backends.
#4962 is a potential solution (effectively passing down cpu attention backend flag from worker)
from vllm.
Related Issues (20)
- [Bug]: ailed to import from vllm._C with ImportError('/usr/local/lib/python3.8/dist-packages/vllm/_C.abi3.so: undefined symbol: _ZN5torch7LibraryC1ENS0_4KindESsSt8optionalIN3c1011DispatchKeyEEPKcj') HOT 9
- [Usage]: RAG system HOT 8
- [Bug]: ModuleNotFoundError: No module named 'bitsandbytes' HOT 4
- [Bug]: Illegal memory access in CUTLASS FP8 kernels HOT 1
- [Bug]: Docker image version 05.0 and 0.4.3 dont work with 4090's HOT 2
- [Bug]: Error loading FP8 weights for `gpt_bigcode` model
- [Bug]: Excessive Memory Consumption of Cudagraph on A10G/L4 GPUs HOT 3
- [RFC]: Usage Data Enhancement for v0.5.* HOT 2
- [Bug]: Shutdown error when using multiproc_gpu_executor HOT 1
- [Bug]: The speed of loading the qwen2 72b model, glm-4-9b-chat-1m model in v0.5.0 is much lower than that in v0.4.2. HOT 10
- [Bug]: MOE模型,2卡推理,报错AssertionError("Invalid device id") HOT 2
- [Bug]: In vLLM v0.4.3 and later, calling list_loras() in a tensor parallelism situation causes the system to hang. HOT 2
- [Bug]: Very slow execution of from_lora_tensors() when using mp instead of ray as --distributed-executor-backend.
- [Usage]: how to use enable-chunked-prefill? HOT 2
- [Performance]: How use vllm.attention.ops.triton_flash_attention replace flash_attn package HOT 1
- [Bug]: Performance : very slow inference for Mixtral 8x7B Instruct FP8 on H100 with 0.5.0 and 0.5.0.post1 HOT 2
- [Bug]: CUDA illegal memory access error when `enable_prefix_caching=True` HOT 4
- [Bug]: Vllm 0.3.0 got weired output
- [Feature]: LoRA support for Mixtral GPTQ and AWQ HOT 1
- [Feature]: asymmetric tensor parallel
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.