GithubHelp home page GithubHelp logo

[Bug]: NameError: name 'ncclGetVersion' is not defined (or Failed to import NCCL library: Cannot find libnccl.so.2 in the system.) about vllm HOT 32 CLOSED

pseudotensor avatar pseudotensor commented on June 12, 2024
[Bug]: NameError: name 'ncclGetVersion' is not defined (or Failed to import NCCL library: Cannot find libnccl.so.2 in the system.)

from vllm.

Comments (32)

pseudotensor avatar pseudotensor commented on June 12, 2024

Seems nccl2 is not built into the image when using the instructions.

(RayWorkerWrapper pid=6313) INFO 04-24 02:10:53 pynccl_utils.py:17] Failed to import NCCL library: Cannot find libnccl.so.2 in the system.

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

Can you try the latest main? It should install vllm-nccl-cu12, which should work and bring the correct nccl version.

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

I tried:

DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai

or

DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai --build-arg max_jobs=20 --build-arg nvcc_threads=20

following the documentation here:

https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html

Is this command wrong for nccl?

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A
Versions of relevant libraries:
[pip3] No relevant packages
[conda] No relevant packages

Your environment must be incorrect.

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

I'm confused. If I build docker image my environment is not relevant.

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

Please report the environment inside the docker.

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

I don't understand. I am running the docker command to build the image, there is nothing "inside" docker to run. It's being built.

I can of course run after the fact, but that is not relevant to my environment as building Docker image should be independent of the environment.

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

attach a shell into the docker image, report the environment inside it.

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024
ubuntu@compute-permanent-node-171:~/vllm$ docker run -ti    --runtime=nvidia     --gpus '"device=0,1,2,6"'     --shm-size=10.24gb --entrypoint=bash     -p 5004:5004         -e NCCL_IGNORE_DISABLED_P2P=1     -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN     --env "HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN"     -v /etc/passwd:/etc/passwd:ro     -v /etc/group:/etc/group:ro     -u root:root     -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/     --network host     fee8ae2c9682    
WARNING: Published ports are discarded when using host network mode
root@compute-permanent-node-171:/vllm-workspace# apt-get install wget
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  wget
0 upgraded, 1 newly installed, 0 to remove and 38 not upgraded.
Need to get 367 kB of archives.
After this operation, 1008 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 wget amd64 1.21.2-2ubuntu1 [367 kB]
Fetched 367 kB in 1s (283 kB/s)      
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package wget.
(Reading database ... 19331 files and directories currently installed.)
Preparing to unpack .../wget_1.21.2-2ubuntu1_amd64.deb ...
Unpacking wget (1.21.2-2ubuntu1) ...
Setting up wget (1.21.2-2ubuntu1) ...
root@compute-permanent-node-171:/vllm-workspace# wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
--2024-04-24 05:39:32--  https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24877 (24K) [text/plain]
Saving to: 'collect_env.py'

collect_env.py                                              100%[=========================================================================================================================================>]  24.29K  --.-KB/s    in 0s      

2024-04-24 05:39:32 (145 MB/s) - 'collect_env.py' saved [24877/24877]

root@compute-permanent-node-171:/vllm-workspace# /usr/bin/python3.10 collect_env.py
Collecting environment information...
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-1018-oracle-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3

Nvidia driver version: 535.161.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             224
On-line CPU(s) list:                0-111
Off-line CPU(s) list:               112-223
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8480+
CPU family:                         6
Model:                              143
Thread(s) per core:                 1
Core(s) per socket:                 56
Socket(s):                          2
Stepping:                           8
CPU max MHz:                        3800.0000
CPU min MHz:                        0.0000
BogoMIPS:                           4000.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          5.3 MiB (112 instances)
L1i cache:                          3.5 MiB (112 instances)
L2 cache:                           224 MiB (112 instances)
L3 cache:                           210 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-55
NUMA node1 CPU(s):                  56-111
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:           Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Vulnerable
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.1
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu12==2.18.1.0.3.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   NIC12   NIC13   NIC14   NIC15   NIC16   NIC17   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    PXB     PXB     NODE    NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-55    0               N/A
GPU1    NV18     X      NV18    NV18    NODE    NODE    NODE    PXB     PXB     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-55    0               N/A
GPU2    NV18    NV18     X      NV18    NODE    NODE    NODE    NODE    NODE    PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-55    0               N/A
GPU3    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    PXB     PXB     NODE    NODE    56-111  1               N/A
NIC0    PXB     NODE    NODE    SYS      X      PIX     NODE    NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PXB     NODE    NODE    SYS     PIX      X      NODE    NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    NODE    NODE    NODE    SYS     NODE    NODE     X      NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    NODE    PXB     NODE    SYS     NODE    NODE    NODE     X      PIX     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC4    NODE    PXB     NODE    SYS     NODE    NODE    NODE    PIX      X      NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC5    NODE    NODE    PXB     SYS     NODE    NODE    NODE    NODE    NODE     X      PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC6    NODE    NODE    PXB     SYS     NODE    NODE    NODE    NODE    NODE    PIX      X      NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC7    NODE    NODE    NODE    SYS     NODE    NODE    NODE    NODE    NODE    NODE    NODE     X      PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC8    NODE    NODE    NODE    SYS     NODE    NODE    NODE    NODE    NODE    NODE    NODE    PIX      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC9    SYS     SYS     SYS     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     NODE    NODE    NODE    NODE    NODE    NODE    NODE
NIC10   SYS     SYS     SYS     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      NODE    NODE    NODE    NODE    NODE    NODE    NODE
NIC11   SYS     SYS     SYS     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    NODE    NODE    NODE    NODE    NODE
NIC12   SYS     SYS     SYS     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      PIX     NODE    NODE    NODE    NODE
NIC13   SYS     SYS     SYS     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX      X      NODE    NODE    NODE    NODE
NIC14   SYS     SYS     SYS     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE     X      PIX     NODE    NODE
NIC15   SYS     SYS     SYS     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    PIX      X      NODE    NODE
NIC16   SYS     SYS     SYS     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    NODE     X      PIX
NIC17   SYS     SYS     SYS     NODE    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    NODE    PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11
  NIC12: mlx5_12
  NIC13: mlx5_13
  NIC14: mlx5_14
  NIC15: mlx5_15
  NIC16: mlx5_16
  NIC17: mlx5_17
root@compute-permanent-node-171:/vllm-workspace# 

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

Also:

root@compute-permanent-node-171:/vllm-workspace# /usr/bin/python3.10
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>> 

Just seems the nccl part of the build is broken.

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

Actually the file is present, but not being found by vllm:

root@compute-permanent-node-171:/vllm-workspace# find / | grep libnccl
/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2
/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
root@compute-permanent-node-171:/vllm-workspace# 

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

which user are you using to run the srcipt? the path vllm use to find is ~/.config/vllm/nccl/ . it will change if you change the user.

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

I ran as root so I could install wget since that is not installed by default in the image you make for vllm.

Do you have a targeted question w.r.t. the actual issue of the vllm startup not finding nccl lib?

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

try to add environment variable export VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

I don't know what is the problem in your side. You can try to debug this function:

def find_nccl_library():

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

It's not the function. I showed it was not able to load the nccl library. It finds it but was unable to use it.

The function error is just a cascade.

In case helps:

root@compute-permanent-node-171:/vllm-workspace# ldconfig -v /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
/sbin/ldconfig.real: Path `/usr/local/cuda-12/targets/x86_64-linux/lib' given more than once
(from /etc/ld.so.conf.d/988_cuda-12.conf:1 and /etc/ld.so.conf.d/000_cuda.conf:1)
/sbin/ldconfig.real: Can't stat /usr/local/nvidia/lib: No such file or directory
/sbin/ldconfig.real: Can't stat /usr/local/nvidia/lib64: No such file or directory
/sbin/ldconfig.real: Can't stat /usr/local/lib/x86_64-linux-gnu: No such file or directory
/sbin/ldconfig.real: Path `/usr/lib/x86_64-linux-gnu' given more than once
(from /etc/ld.so.conf.d/x86_64-linux-gnu.conf:4 and /etc/ld.so.conf.d/x86_64-linux-gnu.conf:3)
/sbin/ldconfig.real: Path `/lib/x86_64-linux-gnu' given more than once
(from <builtin>:0 and /etc/ld.so.conf.d/x86_64-linux-gnu.conf:3)
/sbin/ldconfig.real: Path `/usr/lib/x86_64-linux-gnu' given more than once
(from <builtin>:0 and /etc/ld.so.conf.d/x86_64-linux-gnu.conf:3)
/sbin/ldconfig.real: Path `/usr/lib' given more than once
(from <builtin>:0 and <builtin>:0)
/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1: (from <cmdline>:0)
/sbin/ldconfig.real: Can't open directory /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1: Not a directory
/usr/local/cuda/targets/x86_64-linux/lib: (from /etc/ld.so.conf.d/000_cuda.conf:1)
        libcudart.so.12 -> libcudart.so.12.1.55
/usr/lib/x86_64-linux-gnu/libfakeroot: (from /etc/ld.so.conf.d/fakeroot-x86_64-linux-gnu.conf:1)
        libfakeroot-0.so -> libfakeroot-tcp.so
/usr/local/lib: (from /etc/ld.so.conf.d/libc.conf:2)
/lib/x86_64-linux-gnu: (from /etc/ld.so.conf.d/x86_64-linux-gnu.conf:3)
        libsqlite3.so.0 -> libsqlite3.so.0.8.6
        libsasl2.so.2 -> libsasl2.so.2.0.25
        libhistory.so.8 -> libhistory.so.8.1
        libldap-2.5.so.0 -> libldap-2.5.so.0.1.11
        liblber-2.5.so.0 -> liblber-2.5.so.0.1.11
        libksba.so.8 -> libksba.so.8.14.0
        libnpth.so.0 -> libnpth.so.0.1.2
        libreadline.so.8 -> libreadline.so.8.1
        libassuan.so.0 -> libassuan.so.0.8.5
        libubsan.so.1 -> libubsan.so.1.0.0
        libjbig.so.0 -> libjbig.so.0
        libopcodes-2.38-system.so -> libopcodes-2.38-system.so
        libquadmath.so.0 -> libquadmath.so.0.0.0
        libgd.so.3 -> libgd.so.3.0.8
        libmd.so.0 -> libmd.so.0.0.5
        libpython3.10.so.1.0 -> libpython3.10.so.1.0
        libfreetype.so.6 -> libfreetype.so.6.18.1
        libasan.so.6 -> libasan.so.6.0.0
        libXdmcp.so.6 -> libXdmcp.so.6.0.0
        libtiff.so.5 -> libtiff.so.5.7.0
        libgpm.so.2 -> libgpm.so.2
        libnghttp2.so.14 -> libnghttp2.so.14.20.1
        libexpat.so.1 -> libexpat.so.1.8.7
        libgomp.so.1 -> libgomp.so.1.0.0
        libmpfr.so.6 -> libmpfr.so.6.1.0
        libwebp.so.7 -> libwebp.so.7.1.3
        libdeflate.so.0 -> libdeflate.so.0
        libbrotlicommon.so.1 -> libbrotlicommon.so.1.0.9
        libmpdec++.so.3 -> libmpdec++.so.2.5.1
        libedit.so.2 -> libedit.so.2.0.68
        libmpdec.so.3 -> libmpdec.so.2.5.1
        libjpeg.so.8 -> libjpeg.so.8.2.2
        libctf-nobfd.so.0 -> libctf-nobfd.so.0.0.0
        libitm.so.1 -> libitm.so.1.0.0
        libisl.so.23 -> libisl.so.23.1.0
        libcurl-gnutls.so.4 -> libcurl-gnutls.so.4.7.0
        libtsan.so.0 -> libtsan.so.0.0.0
        libctf.so.0 -> libctf.so.0.0.0
        libX11.so.6 -> libX11.so.6.4.0
        libsodium.so.23 -> libsodium.so.23.3.0
        libbrotlidec.so.1 -> libbrotlidec.so.1.0.9
        libmpc.so.3 -> libmpc.so.3.2.1
        librtmp.so.1 -> librtmp.so.1
        libbsd.so.0 -> libbsd.so.0.11.5
        libXext.so.6 -> libXext.so.6.4.0
        libcbor.so.0.8 -> libcbor.so.0.8.0
        libbrotlienc.so.1 -> libbrotlienc.so.1.0.9
        libexpatw.so.1 -> libexpatw.so.1.8.7
        libatomic.so.1 -> libatomic.so.1.2.0
        libxcb.so.1 -> libxcb.so.1.1.0
        libperl.so.5.34 -> libperl.so.5.34.0
        libpsl.so.5 -> libpsl.so.5.3.2
        libfontconfig.so.1 -> libfontconfig.so.1.12.0
        libXau.so.6 -> libXau.so.6.0.0
        libbfd-2.38-system.so -> libbfd-2.38-system.so
        libfido2.so.1 -> libfido2.so.1.10.0
        libXmuu.so.1 -> libXmuu.so.1.0.0
        libssh.so.4 -> libssh.so.4.8.7
        liblsan.so.0 -> liblsan.so.0.0.0
        libpng16.so.16 -> libpng16.so.16.37.0
        libgdbm_compat.so.4 -> libgdbm_compat.so.4.0.0
        libXpm.so.4 -> libXpm.so.4.11.0
        libgdbm.so.6 -> libgdbm.so.6.0.0
        libcc1.so.0 -> libcc1.so.0.0.0
        libnvidia-pkcs11-openssl3.so.535.161.07 -> libnvidia-pkcs11-openssl3.so.535.161.07
        libnvidia-allocator.so.1 -> libnvidia-allocator.so.535.161.07
        libnvidia-pkcs11.so.535.161.07 -> libnvidia-pkcs11.so.535.161.07
        libcuda.so.1 -> libcuda.so.535.161.07
        libnvidia-ml.so.1 -> libnvidia-ml.so.535.161.07
        libnvidia-opencl.so.1 -> libnvidia-opencl.so.535.161.07
        libnvidia-cfg.so.1 -> libnvidia-cfg.so.535.161.07
        libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.535.161.07
        libcudadebugger.so.1 -> libcudadebugger.so.535.161.07
        libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.535.161.07
        libselinux.so.1 -> libselinux.so.1
        libpthread.so.0 -> libpthread.so.0
        libthread_db.so.1 -> libthread_db.so.1
        libk5crypto.so.3 -> libk5crypto.so.3.1
        libudev.so.1 -> libudev.so.1.7.2
        libnsl.so.1 -> libnsl.so.1
        libhogweed.so.6 -> libhogweed.so.6.4
        libBrokenLocale.so.1 -> libBrokenLocale.so.1
        libnettle.so.8 -> libnettle.so.8.4
        libnss_dns.so.2 -> libnss_dns.so.2
        libapt-pkg.so.6.0 -> libapt-pkg.so.6.0.0
        libsemanage.so.2 -> libsemanage.so.2
        libc.so.6 -> libc.so.6
        libbz2.so.1.0 -> libbz2.so.1.0.4
        libanl.so.1 -> libanl.so.1
        libncurses.so.6 -> libncurses.so.6.3
        liblz4.so.1 -> liblz4.so.1.9.3
        libmenuw.so.6 -> libmenuw.so.6.3
        libsmartcols.so.1 -> libsmartcols.so.1.1.0
        libpcreposix.so.3 -> libpcreposix.so.3.13.3
        libmvec.so.1 -> libmvec.so.1
        libgcc_s.so.1 -> libgcc_s.so.1
        libm.so.6 -> libm.so.6
        libnss_compat.so.2 -> libnss_compat.so.2
        libnss_files.so.2 -> libnss_files.so.2
        libc_malloc_debug.so.0 -> libc_malloc_debug.so.0
        libsepol.so.2 -> libsepol.so.2
        libnsl.so.2 -> libnsl.so.2.0.1
        libdl.so.2 -> libdl.so.2
        libe2p.so.2 -> libe2p.so.2.3
        libaudit.so.1 -> libaudit.so.1.0.0
        libapt-private.so.0.0 -> libapt-private.so.0.0.0
        libpamc.so.0 -> libpamc.so.0.82.1
        libtinfo.so.6 -> libtinfo.so.6.3
        librt.so.1 -> librt.so.1
        libgssapi_krb5.so.2 -> libgssapi_krb5.so.2.2
        libkeyutils.so.1 -> libkeyutils.so.1.9
        libformw.so.6 -> libformw.so.6.3
        libtic.so.6 -> libtic.so.6.3
        libxxhash.so.0 -> libxxhash.so.0.8.1
        libpam.so.0 -> libpam.so.0.85.1
        libseccomp.so.2 -> libseccomp.so.2.5.3
        libpcre.so.3 -> libpcre.so.3.13.3
        libutil.so.1 -> libutil.so.1
        libz.so.1 -> libz.so.1.2.11
        libssl.so.3 -> libssl.so.3
        libdebconfclient.so.0 -> libdebconfclient.so.0.0.0
/sbin/ldconfig.real: /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 is the dynamic linker, ignoring

        ld-linux-x86-64.so.2 -> ld-linux-x86-64.so.2
        libgcrypt.so.20 -> libgcrypt.so.20.3.4
        libpcre2-8.so.0 -> libpcre2-8.so.0.10.4
        libdb-5.3.so -> libdb-5.3.so
        libcap-ng.so.0 -> libcap-ng.so.0.0.0
        libprocps.so.8 -> libprocps.so.8.0.3
        libkrb5.so.3 -> libkrb5.so.3.3
        libncursesw.so.6 -> libncursesw.so.6.3
        libuuid.so.1 -> libuuid.so.1.3.0
        libss.so.2 -> libss.so.2.0
        libcom_err.so.2 -> libcom_err.so.2.1
        libform.so.6 -> libform.so.6.3
        libpcprofile.so -> libpcprofile.so
        libresolv.so.2 -> libresolv.so.2
        libtirpc.so.3 -> libtirpc.so.3.0.0
        libgpg-error.so.0 -> libgpg-error.so.0.32.1
        libblkid.so.1 -> libblkid.so.1.1.0
        libmount.so.1 -> libmount.so.1.1.0
        libgmp.so.10 -> libgmp.so.10.4.1
        libcrypt.so.1 -> libcrypt.so.1.1.0
        libnss_hesiod.so.2 -> libnss_hesiod.so.2
        libcrypto.so.3 -> libcrypto.so.3
        libcap.so.2 -> libcap.so.2.44
        libp11-kit.so.0 -> libp11-kit.so.0.3.0
        libkrb5support.so.0 -> libkrb5support.so.0.1
        libsystemd.so.0 -> libsystemd.so.0.32.0
        libtasn1.so.6 -> libtasn1.so.6.6.2
        libstdc++.so.6 -> libstdc++.so.6.0.30
        libacl.so.1 -> libacl.so.1.1.2301
        libffi.so.8 -> libffi.so.8.1.0
        libpanel.so.6 -> libpanel.so.6.3
        libidn2.so.0 -> libidn2.so.0.3.7
        libpanelw.so.6 -> libpanelw.so.6.3
        libattr.so.1 -> libattr.so.1.1.2501
        libmemusage.so -> libmemusage.so
        libunistring.so.2 -> libunistring.so.2.2.0
        libext2fs.so.2 -> libext2fs.so.2.4
        libpam_misc.so.0 -> libpam_misc.so.0.82.1
        libgnutls.so.30 -> libgnutls.so.30.31.0
        libmenu.so.6 -> libmenu.so.6.3
        liblzma.so.5 -> liblzma.so.5.2.5
        libzstd.so.1 -> libzstd.so.1.4.8
/lib: (from <builtin>:0)
root@compute-permanent-node-171:/vllm-workspace# 

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

Is it possible to know your exact command for building the docker image?

Also, is there a docker image for the pre-release 0.4.1?

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

The relevant error after adding the env you suggested:

ubuntu@compute-permanent-node-171:~/vllm$ docker logs ac406d9bc86d
INFO 04-24 06:02:06 api_server.py:151] vLLM API server version 0.4.1
INFO 04-24 06:02:06 api_server.py:152] args: Namespace(host='0.0.0.0', port=5004, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='databricks/dbrx-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir='/home/ubuntu/.cache/huggingface/hub', load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=1234, swap_space=4, gpu_memory_utilization=0.98, num_gpu_blocks_override=None, max_num_batched_tokens=32768, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, model_loader_extra_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=100)
2024-04-24 06:02:28,968 INFO worker.py:1749 -- Started a local Ray instance.
INFO 04-24 06:02:29 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='databricks/dbrx-instruct', speculative_config=None, tokenizer='databricks/dbrx-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/home/ubuntu/.cache/huggingface/hub', load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=1234)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-24 06:02:46 utils.py:598] Found nccl from environment variable VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
ldd: /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1: No such file or directory
ERROR 04-24 06:02:46 pynccl.py:45] Failed to load NCCL library from /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-6.5.0-1018-oracle-x86_64-with-glibc2.35.One solution is to download libnccl2 version 2.18 from https://developer.download.nvidia.com/compute/cuda/repos/ and extract the libnccl.so.2 file. If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
INFO 04-24 06:02:46 pynccl_utils.py:17] Failed to import NCCL library: Failed to load NCCL library from /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 .
INFO 04-24 06:02:46 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
(RayWorkerWrapper pid=6433) INFO 04-24 06:02:46 utils.py:598] Found nccl from environment variable VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(RayWorkerWrapper pid=6433) ldd: /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1: No such file or directory
(RayWorkerWrapper pid=6433) ERROR 04-24 06:02:46 pynccl.py:45] Failed to load NCCL library from /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 .It is expected if you are not running on NVIDIA/AMD GPUs.Otherwise, the nccl library might not exist, be corrupted or it does not support the current platform Linux-6.5.0-1018-oracle-x86_64-with-glibc2.35.One solution is to download libnccl2 version 2.18 from https://developer.download.nvidia.com/compute/cuda/repos/ and extract the libnccl.so.2 file. If you already have the library, please set the environment variable VLLM_NCCL_SO_PATH to point to the correct nccl library path.
(RayWorkerWrapper pid=6433) INFO 04-24 06:02:46 pynccl_utils.py:17] Failed to import NCCL library: Failed to load NCCL library from /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 .
(RayWorkerWrapper pid=6433) INFO 04-24 06:02:46 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
INFO 04-24 06:02:47 selector.py:28] Using FlashAttention backend.
(RayWorkerWrapper pid=6433) INFO 04-24 06:02:47 selector.py:28] Using FlashAttention backend.

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

I'll try running as root with the env.

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

Why ldd /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 returned error? I don't know why your environment is broken.

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

I'm just building image with the documented command. It has nothing to do with my environment. The same exact commands work on the release version of 0.4.0.post1 docker image.

For non-root it only finds /usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 so I'll set env to that.

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

Ok that env worked for non-root user with that non-root path:

docker run -d     --runtime=nvidia     --gpus '"device=0,1,2,6"'     --shm-size=10.24gb     -p 5004:5004  -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2       -e NCCL_IGNORE_DISABLED_P2P=1     -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN     --env "HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN"     -v /etc/passwd:/etc/passwd:ro     -v /etc/group:/etc/group:ro     -u `id -u`:`id -g`     -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/     --network host     fee8ae2c9682         --port=5004         --host=0.0.0.0         --model=databricks/dbrx-instruct         --seed 1234         --trust-remote-code         --tensor-parallel-size=4 --max-num-batched-tokens=32768 --max-log-len=100 --trust-remote-code --worker-use-ray --enforce-eager --gpu-memory-utilization 0.98         --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.dbrx.txt

I think there must be some bug in the docker image that this would be required. You keep blaming my env, but for docker that can't be. I told you my command, same as documented.

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

It might be the case that you are using a non-root user, which cannot access the /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1 file because it is created by root user.

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

I've always run prior docker images as non-root. If the build command is somehow wrong or the image is wrong and isn't correctly supporting non-root user, there could be issue yes. But has nothing to do with my env.

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

I think that is the key. You are running the image as non-root.

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

Thanks for your help. I hope the released version doesn't have the same problems.

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

The docker image is meant to be run as a root user. That's the issue we fight very hard with nccl :(

from vllm.

Techinix avatar Techinix commented on June 12, 2024

@pseudotensor Can you please specify which version of flash attention are you working with ?

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

@youkaichao Ok, but I've been running the releases of vllm and my own build of vllm inside h2ogpt as non-root for many months now. So this must be a new problem.

@ttbachyinsda Unsure what you mean, I'm building docker image using documented commands, so it's whatever is in the Dockerfile.

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

This is a new problem. Until nccl team address this issue NVIDIA/nccl#1234 , we will suffer a lot with nccl :(

from vllm.

pseudotensor avatar pseudotensor commented on June 12, 2024

Ok no problem, I guess good to document (in the docker run part) that env and what to set it to for the 2 cases of running as root and running as some user like I shared. Then the issue can be closed.

from vllm.

youkaichao avatar youkaichao commented on June 12, 2024

document (in the docker run part) that env and what to set it to for the 2 cases of running as root and running as some user like I shared

will do.

from vllm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.