Your current environment <div class="snippet-clipboard-content notranslate posit

[Bug]: vllm/vllm-openai:v0.4.1 becomes unresponsive on specific requests about vllm HOT 3 OPEN

Wolfsauge commented on June 18, 2024

[Bug]: vllm/vllm-openai:v0.4.1 becomes unresponsive on specific requests

from vllm.

Comments (3)

robertgshaw2-neuralmagic commented on June 18, 2024 4

Thanks for reporting. I managed to recreate the error.

What is happening is that this request is generating so many tokens that we run out of memory in the KV cache!

When this happens we swap the KVs to CPU RAM to wait until more GPU memory frees up to accomodate the request. But, since a single request fills up the entire KV cache, this will never happen. Thus, request is "stuck" in the SWAPPED state permanently.

INFO 05-01 12:12:13 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 1 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 71.9%

We will discuss how we want to handle this case

cc @zhuohan123 @simon-mo

from vllm.

Wolfsauge commented on June 18, 2024

I retried after terminating all containers using the GPU, resulting in the end of the process with PID 18761.

❯ nvidia-smi 
Wed May  1 09:10:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.76                 Driver Version: 550.76         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8             21W /  420W |   18452MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     41908      C   python3                                     18438MiB |
+-----------------------------------------------------------------------------------------+

The behavior remains the same.

~/Desktop/GitHub
❯ time curl -s http://mnemosyne.local:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct", "n": 22, "temperature": 0.3, "min_p": 0.1, "max_tokens": "7168",
    "messages": [
      {
        "role": "user",
        "content": "Tell me a long story about llamas. Then analyze the story, focusing on the stylistic devices."
      }
    ]
  }' > response-batch-22-sampling.json
curl -s http://mnemosyne.local:8000/v1/chat/completions -H  -d  >   0.01s user 0.01s system 0% cpu 36.651 total

While the request with "n": 22 works again, the following request fails.

~/Desktop/GitHub 36s
❯ time curl -s http://mnemosyne.local:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct", "n": 23, "temperature": 0.3, "min_p": 0.1, "max_tokens": "7168", 
    "messages": [
      {
        "role": "user",
        "content": "Tell me a long story about llamas. Then analyze the story, focusing on the stylistic devices."
      }
    ]
  }' > response-batch-23-sampling.json

From here the request with "n": 23 hangs. The engine becomes unresponsive, does not log anything suspicious and needs a restart. New requests are accepted, but will not cause any text to be generated.

from vllm.

Wolfsauge commented on June 18, 2024

Thank you for your reproduction and feedback.

from vllm.

[Bug]: vllm/vllm-openai:v0.4.1 becomes unresponsive on specific requests about vllm HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs