Comments (6)
Having the same error with Mixtral-8x7B-Instruct-v0.1-GPTQ and tensor_parallel_size=2
INFO 04-25 09:50:28 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-25 09:50:28 api_server.py:150] args: Namespace(host=None, port=8001, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', max_model_len=5000, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.8, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 04-25 09:50:29 config.py:767] Casting torch.bfloat16 to torch.float16.
WARNING 04-25 09:50:29 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-25 09:51:01 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5000, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 04-25 09:51:20 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 04-25 09:51:20 selector.py:25] Using XFormers backend.
�[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:51:21 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs.
�[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:51:21 selector.py:25] Using XFormers backend.
INFO 04-25 09:51:33 pynccl_utils.py:45] vLLM is using nccl==2.18.1
�[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:51:33 pynccl_utils.py:45] vLLM is using nccl==2.18.1
INFO 04-25 09:52:18 custom_all_reduce.py:137] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped
�[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:52:18 custom_all_reduce.py:137] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped
INFO 04-25 09:52:23 weight_utils.py:177] Using model weights format ['*.safetensors']
�[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:52:24 weight_utils.py:177] Using model weights format ['*.safetensors']
INFO 04-25 09:53:12 model_runner.py:104] Loading model weights took 11.0906 GB
I am awake
time mem processes process usage
(secs) (MB) tot actv (sorted, %CPU)
�[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:53:12 model_runner.py:104] Loading model weights took 11.0906 GB
INFO 04-25 09:54:42 ray_gpu_executor.py:240] # GPU blocks: 12524, # CPU blocks: 4096
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x2aad73314ae0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x2aad836e63d0>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x2aad73314ae0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x2aad836e63d0>>)>
Traceback (most recent call last):
File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 490, in wait_for
return fut.result()
^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 454, in engine_step
request_outputs = await self.engine.step_async()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
output = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
all_outputs = await self._run_workers_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
all_outputs = await asyncio.gather(*coros)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
task.result()
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 492, in wait_for
raise exceptions.TimeoutError() from exc
TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 490, in wait_for
return fut.result()
^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 454, in engine_step
request_outputs = await self.engine.step_async()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async
output = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async
all_outputs = await self._run_workers_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async
all_outputs = await asyncio.gather(*coros)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
result = await app( # type: ignore[func-returns-value]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
return await self.app(scope, receive, send)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
response = await func(request)
^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_completion
generator = await openai_serving_completion.create_completion(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion
async for i, res in result_generator:
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 81, in consumer
raise item
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 66, in producer
async for item in iterator:
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 644, in generate
raise e
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 638, in generate
async for request_output in stream:
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__
raise result
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
task.result()
File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 492, in wait_for
raise exceptions.TimeoutError() from exc
TimeoutError
from vllm.
@ericzhou571 @JPonsa @blackblue9 @supdizh Could you try --disable-custom-all-reduce
when you launch the server and see if this issue persists?
from vllm.
looks like the same issue with #4135
emerged after 0.4.0
from vllm.
Having the same error with Mixtral-8x7B-Instruct-v0.1-GPTQ and tensor_parallel_size=2
INFO 04-25 09:50:28 api_server.py:149] vLLM API server version 0.4.0.post1 INFO 04-25 09:50:28 api_server.py:150] args: Namespace(host=None, port=8001, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', max_model_len=5000, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.8, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None) WARNING 04-25 09:50:29 config.py:767] Casting torch.bfloat16 to torch.float16. WARNING 04-25 09:50:29 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 04-25 09:51:01 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5000, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0) INFO 04-25 09:51:20 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs. INFO 04-25 09:51:20 selector.py:25] Using XFormers backend. �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:51:21 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs. �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:51:21 selector.py:25] Using XFormers backend. INFO 04-25 09:51:33 pynccl_utils.py:45] vLLM is using nccl==2.18.1 �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:51:33 pynccl_utils.py:45] vLLM is using nccl==2.18.1 INFO 04-25 09:52:18 custom_all_reduce.py:137] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:52:18 custom_all_reduce.py:137] NVLink detection failed with message "Not Supported". This is normal if your machine has no NVLink equipped INFO 04-25 09:52:23 weight_utils.py:177] Using model weights format ['*.safetensors'] �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:52:24 weight_utils.py:177] Using model weights format ['*.safetensors'] INFO 04-25 09:53:12 model_runner.py:104] Loading model weights took 11.0906 GB I am awake time mem processes process usage (secs) (MB) tot actv (sorted, %CPU) �[36m(RayWorkerVllm pid=233183)�[0m INFO 04-25 09:53:12 model_runner.py:104] Loading model weights took 11.0906 GB INFO 04-25 09:54:42 ray_gpu_executor.py:240] # GPU blocks: 12524, # CPU blocks: 4096
Exception in callback functools.partial(<function _raise_exception_on_finish at 0x2aad73314ae0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x2aad836e63d0>>) handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x2aad73314ae0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x2aad836e63d0>>)> Traceback (most recent call last): File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 490, in wait_for return fut.result() ^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 454, in engine_step request_outputs = await self.engine.step_async() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async output = await self.model_executor.execute_model_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async all_outputs = await self._run_workers_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async all_outputs = await asyncio.gather(*coros) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( ^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 492, in wait_for raise exceptions.TimeoutError() from exc TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. ERROR: Exception in ASGI application Traceback (most recent call last): File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 490, in wait_for return fut.result() ^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 454, in engine_step request_outputs = await self.engine.step_async() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 213, in step_async output = await self.model_executor.execute_model_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 418, in execute_model_async all_outputs = await self._run_workers_async( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 408, in _run_workers_async all_outputs = await asyncio.gather(*coros) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ asyncio.exceptions.CancelledError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi result = await app( # type: ignore[func-returns-value] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__ return await self.app(scope, receive, send) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__ await super().__call__(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__ await self.middleware_stack(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__ raise exc File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__ await self.app(scope, receive, _send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__ await self.app(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__ await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__ await self.middleware_stack(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 776, in app await route.handle(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle await self.app(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 77, in app await wrap_app_handling_exceptions(app, request)(scope, receive, send) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app raise exc File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app await app(scope, receive, sender) File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/starlette/routing.py", line 72, in app response = await func(request) ^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app raw_response = await run_endpoint_function( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function return await dependant.call(**values) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 103, in create_completion generator = await openai_serving_completion.create_completion( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 178, in create_completion async for i, res in result_generator: File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 81, in consumer raise item File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/entrypoints/openai/serving_completion.py", line 66, in producer async for item in iterator: File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 644, in generate raise e File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 638, in generate async for request_output in stream: File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__ raise result File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish task.result() File "/lustre/scratch/scratch/rmhijpo/ctgov_rag/.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 480, in run_engine_loop has_requests_in_progress = await asyncio.wait_for( ^^^^^^^^^^^^^^^^^^^^^^^ File "/shared/ucl/apps/python/3.11.3/gnu-4.9.2/lib/python3.11/asyncio/tasks.py", line 492, in wait_for raise exceptions.TimeoutError() from exc TimeoutError
Would you kindly share the specifications of the GPU you utilized while encountering these issues? Also A800-80G?
from vllm.
@ywang96 the issue persists when launching the server with --disable-custom-all-reduce
from vllm.
I encountered a similar issue in version 0.4.2.
from vllm.
Related Issues (20)
- [Usage]: how should I do data parallelism using vLLM?
- [Bug]: torch.cuda.OutOfMemoryError: CUDA out of memory when Handle inference requests HOT 1
- [Misc]: Should inference with temperature 0 generate the same results for a lora adapter and equivalent merged model? HOT 5
- [Bug] [spec decode] [flash_attn]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache
- [Bug]: The openai deployment model takes twice as long to deploy as fastapi's approach to offline inference. HOT 1
- [Feature]: Linear adapter support for Mixtral
- [Feature]: VLLM suport for function calling in Mistral-7B-Instruct-v0.3 HOT 1
- [Bug]: Issue with Token Processing Efficiency and Key-Value Cache Utilization in AsyncLLMEngine
- [Bug]: WSL2(Including Docker) 2 GPU problem --tensor-parallel-size 2 HOT 1
- [Bug]: Unable to Use Prefix Caching in AsyncLLMEngine HOT 10
- [Performance]: What can we learn from OctoAI HOT 3
- [Bug]: Model Launch Hangs with 16+ Ranks in vLLM HOT 1
- [Usage]: Prefix caching in VLLM HOT 1
- [Bug]: Incorrect Example for the Inference with Prefix
- [Feature]: BERT models for embeddings HOT 1
- [Bug]: The Offline Inference Embedding Example Fails HOT 5
- [Bug]: Offline Inference with the OpenAI Batch file format yields unnecessary `asyncio.exceptions.CancelledError` HOT 2
- [Feature]: MoE kernels (Mixtral-8x22B-Instruct-v0.1) are not yet supported on CPU only ?
- [Bug]: vLLM api_server.py when using with prompt_token_ids causes error.
- [Bug]: loading squeezellm model
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.