Comments (2)
Update: the same exact error happens when intializing the meta-llama/Meta-Llama-3-8B-Instruct
with only a single GPU
- model is served with FastApi + uvicorn
- sequential calls work fine, call in parallel yields an assertion:
/lib/python3.10/site-packages/vllm/core/scheduler.py", line 756, in _schedule_default
assert len(running_scheduled.prefill_seq_groups) == 0
from vllm.
Ok, so figures i wasn't using vllm properly, loading model with LLM isn't supporting async requests, maybe the lib should be more clear about that, should load with AsyncLLMEngine
from vllm.
Related Issues (20)
- 当调用接口,不传system时,输出卡主了,输出全是!!!!! HOT 1
- [Feature]: load/unload API to run multiple LLMs in a single GPU instance
- [Usage]: How do I get the FP8 scaling factors for KV cache? HOT 3
- [Bug]: ray not work when tp>=2 HOT 8
- [Bug]: Qwen/Qwen2-72B-Instruct 128k server down HOT 2
- [Bug]: RuntimeError: out must have shape (total_q, num_heads, head_size_og) HOT 3
- [Bug]: ailed to import from vllm._C with ImportError('/usr/local/lib/python3.8/dist-packages/vllm/_C.abi3.so: undefined symbol: _ZN5torch7LibraryC1ENS0_4KindESsSt8optionalIN3c1011DispatchKeyEEPKcj') HOT 9
- [Usage]: RAG system HOT 8
- [Bug]: ModuleNotFoundError: No module named 'bitsandbytes' HOT 4
- [Bug]: Illegal memory access in CUTLASS FP8 kernels HOT 1
- [Bug]: Docker image version 05.0 and 0.4.3 dont work with 4090's HOT 2
- [Bug]: Error loading FP8 weights for `gpt_bigcode` model
- [Bug]: Excessive Memory Consumption of Cudagraph on A10G/L4 GPUs HOT 3
- [RFC]: Usage Data Enhancement for v0.5.* HOT 2
- [Bug]: Shutdown error when using multiproc_gpu_executor HOT 1
- [Bug]: The speed of loading the qwen2 72b model, glm-4-9b-chat-1m model in v0.5.0 is much lower than that in v0.4.2. HOT 10
- [Bug]: MOE模型,2卡推理,报错AssertionError("Invalid device id") HOT 2
- [Bug]: In vLLM v0.4.3 and later, calling list_loras() in a tensor parallelism situation causes the system to hang. HOT 3
- [Bug]: Very slow execution of from_lora_tensors() when using mp instead of ray as --distributed-executor-backend.
- [Usage]: how to use enable-chunked-prefill? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.