Your current environment <div class="snippet-clipboard-content notranslate posit

[Bug]: Scheduler fail with assertion on "meta-llama/Meta-Llama-3-70B-Instruct" when calling concurrently about vllm HOT 2 CLOSED

tsvisab commented on June 18, 2024

[Bug]: Scheduler fail with assertion on "meta-llama/Meta-Llama-3-70B-Instruct" when calling concurrently

from vllm.

Comments (2)

tsvisab commented on June 18, 2024

Update: the same exact error happens when intializing the meta-llama/Meta-Llama-3-8B-Instruct with only a single GPU

model is served with FastApi + uvicorn
sequential calls work fine, call in parallel yields an assertion:

/lib/python3.10/site-packages/vllm/core/scheduler.py", line 756, in _schedule_default
assert len(running_scheduled.prefill_seq_groups) == 0

from vllm.

tsvisab commented on June 18, 2024

Ok, so figures i wasn't using vllm properly, loading model with LLM isn't supporting async requests, maybe the lib should be more clear about that, should load with AsyncLLMEngine

from vllm.

Related Issues (20)

当调用接口，不传system时，输出卡主了，输出全是！！！！！ HOT 1
[Feature]: load/unload API to run multiple LLMs in a single GPU instance
[Usage]: How do I get the FP8 scaling factors for KV cache? HOT 3
[Bug]: ray not work when tp>=2 HOT 8
[Bug]: Qwen/Qwen2-72B-Instruct 128k server down HOT 2
[Bug]: RuntimeError: out must have shape (total_q, num_heads, head_size_og) HOT 3
[Bug]: ailed to import from vllm._C with ImportError('/usr/local/lib/python3.8/dist-packages/vllm/_C.abi3.so: undefined symbol: _ZN5torch7LibraryC1ENS0_4KindESsSt8optionalIN3c1011DispatchKeyEEPKcj') HOT 9
[Usage]: RAG system HOT 8
[Bug]: ModuleNotFoundError: No module named 'bitsandbytes' HOT 4
[Bug]: Illegal memory access in CUTLASS FP8 kernels HOT 1
[Bug]: Docker image version 05.0 and 0.4.3 dont work with 4090's HOT 2
[Bug]: Error loading FP8 weights for `gpt_bigcode` model
[Bug]: Excessive Memory Consumption of Cudagraph on A10G/L4 GPUs HOT 3
[RFC]: Usage Data Enhancement for v0.5.* HOT 2
[Bug]: Shutdown error when using multiproc_gpu_executor HOT 1
[Bug]: The speed of loading the qwen2 72b model, glm-4-9b-chat-1m model in v0.5.0 is much lower than that in v0.4.2. HOT 10
[Bug]: MOE模型，2卡推理，报错AssertionError("Invalid device id") HOT 2
[Bug]: In vLLM v0.4.3 and later, calling list_loras() in a tensor parallelism situation causes the system to hang. HOT 3
[Bug]: Very slow execution of from_lora_tensors() when using mp instead of ray as --distributed-executor-backend.
[Usage]: how to use enable-chunked-prefill? HOT 3

[Bug]: Scheduler fail with assertion on "meta-llama/Meta-Llama-3-70B-Instruct" when calling concurrently about vllm HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs