Comments (3)
Thanks for reporting. I managed to recreate the error.
What is happening is that this request is generating so many tokens that we run out of memory in the KV cache!
When this happens we swap the KVs to CPU RAM to wait until more GPU memory frees up to accomodate the request. But, since a single request fills up the entire KV cache, this will never happen. Thus, request is "stuck" in the SWAPPED
state permanently.
INFO 05-01 12:12:13 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 1 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 71.9%
We will discuss how we want to handle this case
from vllm.
I retried after terminating all containers using the GPU, resulting in the end of the process with PID 18761.
❯ nvidia-smi
Wed May 1 09:10:23 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.76 Driver Version: 550.76 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 0% 32C P8 21W / 420W | 18452MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 41908 C python3 18438MiB |
+-----------------------------------------------------------------------------------------+
The behavior remains the same.
~/Desktop/GitHub
❯ time curl -s http://mnemosyne.local:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct", "n": 22, "temperature": 0.3, "min_p": 0.1, "max_tokens": "7168",
"messages": [
{
"role": "user",
"content": "Tell me a long story about llamas. Then analyze the story, focusing on the stylistic devices."
}
]
}' > response-batch-22-sampling.json
curl -s http://mnemosyne.local:8000/v1/chat/completions -H -d > 0.01s user 0.01s system 0% cpu 36.651 total
While the request with "n": 22 works again, the following request fails.
~/Desktop/GitHub 36s
❯ time curl -s http://mnemosyne.local:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct", "n": 23, "temperature": 0.3, "min_p": 0.1, "max_tokens": "7168",
"messages": [
{
"role": "user",
"content": "Tell me a long story about llamas. Then analyze the story, focusing on the stylistic devices."
}
]
}' > response-batch-23-sampling.json
From here the request with "n": 23 hangs. The engine becomes unresponsive, does not log anything suspicious and needs a restart. New requests are accepted, but will not cause any text to be generated.
from vllm.
Thank you for your reproduction and feedback.
from vllm.
Related Issues (20)
- [Feature]: Add efficient interface for evaluating probabilities of fixed prompt-completion pairs HOT 1
- [Bug]: LLM.generate() collapse with some padding side HOT 1
- [Performance]: Speculative Performance almost same or lower HOT 4
- [Bug]: Tokenizer setter of LLM without CachedTokenizer adapter HOT 1
- [Usage]: Running Phi-3-small-128k-instruct with v0.4.3 without --trust-remote-code HOT 2
- [Feature]: support Classifier-Free Guidance Logits processor
- how to compile with GLIBCXX_USE_CXX11_ABI=1 HOT 2
- [Misc]: vllm ONLY allocate KVCache on the first device in CUDA_VISIBLE_DEVICES HOT 1
- [Misc]: Why prometheus metric vllm:request_success_total doubles the value? HOT 4
- [Bug]: prompt_logprobs doesn't work with openai compatible server HOT 11
- [Usage]:how to get the output embedding for a text generation model using vllm
- [Bug]: Pending but Avg generation throughput: 0.0 tokens/s HOT 1
- vllm推理THUDM/chatglm3-6b-128k 无法stop HOT 2
- [Usage]: How to load a model with less CPU memory
- [Bug]: a bug HOT 1
- [Bug]: When I call the speculative model through the vllm interface, an error is reported: TypeError: 'type' object is not subscriptable HOT 2
- [Bug]: chatglm3 with lora adapter HOT 3
- [Bug]: high gpu_memory_utilization with 'OOM' and low gpu_memory_utilization with 'No available memory for the cache blocks' HOT 1
- [Bug]: Regression in predictions in v0.4.3 HOT 4
- [Usage]: the docker image v0.4.3 cannot work HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.