Comments (6)
vllm-nccl-cu12==2.18.1.0.4.0
does not install in ./usr/local/lib/python3.11/site-packages/nvidia/nccl/lib/libnccl.so.2
. It should be in ~/.config/vllm/nccl/cu12/libnccl.so.2.18.1
. The home directory depends on which user installs it. It is possible that your current user is different from the user who installed it.
./usr/local/lib/python3.11/site-packages/nvidia/nccl/lib/libnccl.so.2
is the library installed by pytorch, and has a problem of increased memory used reported at NVIDIA/nccl#1234 . That's why vllm need to change the nccl dependency.
Unless either NVIDIA/nccl#1234 or pypi/support#3792 is resolved, we have no choice but to bring libnccl.so this way. Sorry for the trouble. This is not what we want, either. We also hope to manage dependency in a standard pip way.
from vllm.
run with export NCCL_DEBUG=TRACE
? this might be a nccl problem.
from vllm.
run with
export NCCL_DEBUG=TRACE
? this might be a nccl problem.
After the command is executed, the complete log is as follows:
(vllm-test) $ export NCCL_DEBUG=TRACE
(vllm-test) $ CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.openai.api_server --model /data/zhaoxf4/pretrained/meta-llama/Meta-Llama-3-8B-Instruct --dtype half --tensor-parallel-size 2
INFO 04-24 09:39:20 api_server.py:151] vLLM API server version 0.4.1
INFO 04-24 09:39:20 api_server.py:152] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/data/zhaoxf4/pretrained/meta-llama/Meta-Llama-3-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 04-24 09:39:20 config.py:948] Casting torch.bfloat16 to torch.float16.
2024-04-24 09:39:22,770 INFO worker.py:1749 -- Started a local Ray instance.
INFO 04-24 09:39:23 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/data/zhaoxf4/pretrained/meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='/data/zhaoxf4/pretrained/meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-24 09:39:27 utils.py:598] Found nccl from environment variable VLLM_NCCL_SO_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2
ERROR 04-24 09:39:27 pynccl.py:44] Failed to load NCCL library from /usr/lib/x86_64-linux-gnu/libnccl.so.2 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-4.15.0-136-generic-x86_64-with-glibc2.27.One solution is to download libnccl2 version 2.18 from https://developer.download.nvidia.com/compute/cuda/repos/ and extract the libnccl.so.2 file. If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
INFO 04-24 09:39:27 pynccl_utils.py:17] Failed to import NCCL library: Failed to load NCCL library from /usr/lib/x86_64-linux-gnu/libnccl.so.2 .
INFO 04-24 09:39:27 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
(RayWorkerWrapper pid=1177) INFO 04-24 09:39:27 utils.py:598] Found nccl from environment variable VLLM_NCCL_SO_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:27 pynccl.py:44] Failed to load NCCL library from /usr/lib/x86_64-linux-gnu/libnccl.so.2 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-4.15.0-136-generic-x86_64-with-glibc2.27.One solution is to download libnccl2 version 2.18 from https://developer.download.nvidia.com/compute/cuda/repos/ and extract the libnccl.so.2 file. If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
(RayWorkerWrapper pid=1177) INFO 04-24 09:39:27 pynccl_utils.py:17] Failed to import NCCL library: Failed to load NCCL library from /usr/lib/x86_64-linux-gnu/libnccl.so.2 .
(RayWorkerWrapper pid=1177) INFO 04-24 09:39:27 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
INFO 04-24 09:39:27 selector.py:65] Cannot use FlashAttention backend for Volta and Turing GPUs.
INFO 04-24 09:39:27 selector.py:33] Using XFormers backend.
(RayWorkerWrapper pid=1177) INFO 04-24 09:39:27 selector.py:65] Cannot use FlashAttention backend for Volta and Turing GPUs.
(RayWorkerWrapper pid=1177) INFO 04-24 09:39:27 selector.py:33] Using XFormers backend.
ERROR 04-24 09:39:29 worker_base.py:153] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 04-24 09:39:29 worker_base.py:153] Traceback (most recent call last):
ERROR 04-24 09:39:29 worker_base.py:153] File "/data/zhaoxf4/API/llama3/vllm/vllm/worker/worker_base.py", line 145, in execute_method
ERROR 04-24 09:39:29 worker_base.py:153] return executor(*args, **kwargs)
ERROR 04-24 09:39:29 worker_base.py:153] File "/data/zhaoxf4/API/llama3/vllm/vllm/worker/worker.py", line 110, in init_device
ERROR 04-24 09:39:29 worker_base.py:153] init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 04-24 09:39:29 worker_base.py:153] File "/data/zhaoxf4/API/llama3/vllm/vllm/worker/worker.py", line 301, in init_worker_distributed_environment
ERROR 04-24 09:39:29 worker_base.py:153] pynccl_utils.init_process_group(
ERROR 04-24 09:39:29 worker_base.py:153] File "/data/zhaoxf4/API/llama3/vllm/vllm/distributed/device_communicators/pynccl_utils.py", line 46, in init_process_group
ERROR 04-24 09:39:29 worker_base.py:153] logger.info(f"vLLM is using nccl=={ncclGetVersion()}")
ERROR 04-24 09:39:29 worker_base.py:153] NameError: name 'ncclGetVersion' is not defined
Traceback (most recent call last):
File "/home/zhaoxf4/miniconda3/envs/vllm-test/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/zhaoxf4/miniconda3/envs/vllm-test/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data/zhaoxf4/API/llama3/vllm/vllm/entrypoints/openai/api_server.py", line 159, in <module>
engine = AsyncLLMEngine.from_engine_args(
File "/data/zhaoxf4/API/llama3/vllm/vllm/engine/async_llm_engine.py", line 361, in from_engine_args
engine = cls(
File "/data/zhaoxf4/API/llama3/vllm/vllm/engine/async_llm_engine.py", line 319, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/data/zhaoxf4/API/llama3/vllm/vllm/engine/async_llm_engine.py", line 437, in _init_engine
return engine_class(*args, **kwargs)
File "/data/zhaoxf4/API/llama3/vllm/vllm/engine/llm_engine.py", line 148, in __init__
self.model_executor = executor_class(
File "/data/zhaoxf4/API/llama3/vllm/vllm/executor/executor_base.py", line 41, in __init__
self._init_executor()
File "/data/zhaoxf4/API/llama3/vllm/vllm/executor/ray_gpu_executor.py", line 44, in _init_executor
self._init_workers_ray(placement_group)
File "/data/zhaoxf4/API/llama3/vllm/vllm/executor/ray_gpu_executor.py", line 181, in _init_workers_ray
self._run_workers("init_device")
File "/data/zhaoxf4/API/llama3/vllm/vllm/executor/ray_gpu_executor.py", line 323, in _run_workers
driver_worker_output = self.driver_worker.execute_method(
File "/data/zhaoxf4/API/llama3/vllm/vllm/worker/worker_base.py", line 154, in execute_method
raise e
File "/data/zhaoxf4/API/llama3/vllm/vllm/worker/worker_base.py", line 145, in execute_method
return executor(*args, **kwargs)
File "/data/zhaoxf4/API/llama3/vllm/vllm/worker/worker.py", line 110, in init_device
init_worker_distributed_environment(self.parallel_config, self.rank,
File "/data/zhaoxf4/API/llama3/vllm/vllm/worker/worker.py", line 301, in init_worker_distributed_environment
pynccl_utils.init_process_group(
File "/data/zhaoxf4/API/llama3/vllm/vllm/distributed/device_communicators/pynccl_utils.py", line 46, in init_process_group
logger.info(f"vLLM is using nccl=={ncclGetVersion()}")
NameError: name 'ncclGetVersion' is not defined
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:29 worker_base.py:153] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:29 worker_base.py:153] Traceback (most recent call last):
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:29 worker_base.py:153] File "/data/zhaoxf4/API/llama3/vllm/vllm/worker/worker_base.py", line 145, in execute_method
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:29 worker_base.py:153] return executor(*args, **kwargs)
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:29 worker_base.py:153] File "/data/zhaoxf4/API/llama3/vllm/vllm/worker/worker.py", line 110, in init_device
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:29 worker_base.py:153] init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:29 worker_base.py:153] File "/data/zhaoxf4/API/llama3/vllm/vllm/worker/worker.py", line 301, in init_worker_distributed_environment
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:29 worker_base.py:153] pynccl_utils.init_process_group(
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:29 worker_base.py:153] File "/data/zhaoxf4/API/llama3/vllm/vllm/distributed/device_communicators/pynccl_utils.py", line 46, in init_process_group
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:29 worker_base.py:153] logger.info(f"vLLM is using nccl=={ncclGetVersion()}")
(RayWorkerWrapper pid=1177) ERROR 04-24 09:39:29 worker_base.py:153] NameError: name 'ncclGetVersion' is not defined
Looks like no new logs have been added.
I found out that my libnccl version is 2.15.1. Is this version likely to have an impact on multi-card communication?
(vllm-test) $ ll /usr/lib/x86_64-linux-gnu/libnccl.so.2
lrwxrwxrwx 1 root root 17 Sep 20 2022 /usr/lib/x86_64-linux-gnu/libnccl.so.2 -> libnccl.so.2.15.1*
from vllm.
[pip3] vllm-nccl-cu12==2.18.1.0.3.0
If you have vllm-nccl-cu12
installed, you don't need to specify VLLM_NCCL_SO_PATH
. It should just work.
from vllm.
[pip3] vllm-nccl-cu12==2.18.1.0.3.0
If you have
vllm-nccl-cu12
installed, you don't need to specifyVLLM_NCCL_SO_PATH
. It should just work.
PR #4259 can avoid this problem, but should just ignore the error.
I'll try to reproduce this problem on another machine or docker.
from vllm.
Just chiming in to say I'm experiencing a similar issue. Perhaps this is just an issue issue with how my directories are being set up. I have vllm-nccl-cu12==2.18.1.0.4.0
installed, and it's still having trouble finding libnccl
.
$ find . | grep libnccl
./usr/local/lib/python3.11/site-packages/nvidia/nccl/lib/libnccl.so.2
Digging through the latest version of find_nccl_library seemed to confirm the issue for me
>>> find_nccl_library()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 18, in find_nccl_library
File "<stdin>", line 25, in find_library
ValueError: Cannot find libnccl.so.2 in the system.
I think manually passing in libnccl may be the only feasible solution
from vllm.
Related Issues (20)
- [Installation]: Failed to build punica HOT 1
- [Bug]: prompt_logprobs=0 raises AssertionError HOT 1
- [Bug]: Mistral 7B crashes on NVidia Tesla P100 with a CUDA Error HOT 1
- [Bug]: Mixtral-8x22 request cancelled by cancel scope when client sends multiple concurrent requests HOT 2
- [Doc]: Update the vllm distributed Inference and Serving with the new MultiprocessingGPUExecutor HOT 3
- [Usage]: RuntimeError: CUDA error: uncorrectable ECC error encountered HOT 1
- v0.5.0 Release Tracker HOT 2
- [Usage]: How to start inference serving through `LLM` object HOT 3
- [Feature]: Custom attention masks
- [Feature]: vllm-flash-attn cu118 compatibility HOT 3
- [Feature]: Add efficient interface for evaluating probabilities of fixed prompt-completion pairs HOT 1
- [Bug]: LLM.generate() collapse with some padding side HOT 1
- [Performance]: Speculative Performance almost same or lower HOT 4
- [Bug]: Tokenizer setter of LLM without CachedTokenizer adapter HOT 1
- [Usage]: Running Phi-3-small-128k-instruct with v0.4.3 without --trust-remote-code HOT 2
- [Feature]: support Classifier-Free Guidance Logits processor
- how to compile with GLIBCXX_USE_CXX11_ABI=1 HOT 2
- [Misc]: vllm ONLY allocate KVCache on the first device in CUDA_VISIBLE_DEVICES HOT 1
- [Misc]: Why prometheus metric vllm:request_success_total doubles the value? HOT 4
- [Bug]: prompt_logprobs doesn't work with openai compatible server HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.