/usr/bin/mpirun --hostfile HOSTFILE --mca btl tcp,self --mca btl_tcp_if_exclude docker0,lo --bind-to none -N 1 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=all -x LD_LIBRARY_PATH=$HOME/nccl/build/lib:/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH nccl-tests/build/all_reduce_perf --minbytes 8 --maxbytes 256M --stepfactor 2 --ngpus 1 --check 0 --nthreads 1
/usr/bin/mpirun --hostfile HOSTFILE --mca btl tcp,self --mca btl_tcp_if_exclude docker0,lo --bind-to none -N 1 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=all -x LD_LIBRARY_PATH=$HOME/nccl/build/lib:/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH nccl-tests/build/all_reduce_perf --minbytes 8 --maxbytes 256M --stepfactor 2 --ngpus 1 --check 0 --nthreads 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
#
# Using devices
# Rank 0 Pid 82431 on pkb-b52c0368-0 device 0 [0x00] Tesla V100-PCIE-16GB
# Rank 1 Pid 82676 on pkb-b52c0368-1 device 0 [0x00] Tesla V100-PCIE-16GB
pkb-b52c0368-0:82431:82431 [0] NCCL INFO Bootstrap : Using [0]eth0:10.0.0.4<0> [1]eth1:fe80::215:5dff:fe33:ff27%eth1<0>
pkb-b52c0368-0:82431:82431 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
pkb-b52c0368-0:82431:82431 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB eth0:10.0.0.4<0>
NCCL version 2.5.6+cuda10.0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO Bootstrap : Using [0]eth0:10.0.0.5<0> [1]eth1:fe80::215:5dff:fe33:ff70%eth1<0>
pkb-b52c0368-1:82676:82676 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
pkb-b52c0368-1:82676:82676 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB eth0:10.0.0.5<0>
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Setting affinity for GPU 0 to 0fff
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 545505 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 582757 mtu 5 LID 57
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Intel CPU (PCI 12, InterCpu 8)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Intel CPU (PCI 12, InterCpu 8)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/df5e6558-97d1-4698-9de8-a5f603d2bef7/pci0002:00/0002:00:02.0 -> 0/0/0/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/63787d95-43b1-4da0-adf3-de0aba056eb9/pci0002:00/0002:00:02.0 -> 0/0/0/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO === System : maxWidth 12 maxSpeed 12 ===
pkb-b52c0368-1:82676:82696 [0] NCCL INFO CPU/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - PCI/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - PCI/DE
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - PCI/47505500
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - GPU/5B6C00000 (1)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - PCI/63787D95
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + PCI[12] - NIC/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO === System : maxWidth 12 maxSpeed 12 ===
pkb-b52c0368-1:82676:82696 [0] NCCL INFO + NET[12] - NET/0 (0)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO ==========================================
pkb-b52c0368-1:82676:82696 [0] NCCL INFO GPU/5B6C00000 :GPU/5B6C00000 (0/5000/0) CPU/0 (4/12/2) NET/0 (5/12/2)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/0 :GPU/5B6C00000 (5/12/2) CPU/0 (5/12/2) NET/0 (0/5000/0)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 24/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-1:82676:82696 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO CPU/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - PCI/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - PCI/DE
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - PCI/47505500
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - GPU/92FB00000 (0)
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 12/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-1:82676:82696 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - PCI/DF5E6558
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + PCI[12] - NIC/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO + NET[12] - NET/0 (0)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO ==========================================
pkb-b52c0368-0:82431:82448 [0] NCCL INFO GPU/92FB00000 :GPU/92FB00000 (0/5000/0) CPU/0 (4/12/2) NET/0 (5/12/2)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/0 :GPU/92FB00000 (5/12/2) CPU/0 (5/12/2) NET/0 (0/5000/0)
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 24/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 12/12, nvlink 1, type 2, sameChannels 1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Channel 00/02 : 0 1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Channel 01/02 : 0 1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Threads per block : 512/640/512
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Latency/AlgBw | Tree/ LL | Tree/ LL128 | Tree/Simple | Ring/ LL | Ring/ LL128 | Ring/Simple |
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Broadcast | 0.0/ 0.0| 0.0/ 0.0| 0.0/ 0.0| 4.5/ 3.0| 6.1/ 11.2| 15.0/ 12.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Reduce | 0.0/ 0.0| 0.0/ 0.0| 0.0/ 0.0| 4.5/ 3.0| 6.1/ 11.2| 15.0/ 12.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO AllGather | 0.0/ 0.0| 0.0/ 0.0| 0.0/ 0.0| 4.5/ 6.0| 6.1/ 22.5| 15.0/ 24.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO ReduceScatter | 0.0/ 0.0| 0.0/ 0.0| 0.0/ 0.0| 4.5/ 6.0| 6.1/ 22.5| 15.0/ 24.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO AllReduce | 14.4/ 3.6| 19.4/ 8.4| 100.0/ 10.8| 5.4/ 3.0| 8.6/ 11.2| 21.6/ 12.0|
pkb-b52c0368-0:82431:82448 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->1|1->0->-1/-1/-1
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Threads per block : 512/640/512
pkb-b52c0368-1:82676:82696 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] 0/-1/-1->1->-1|-1->1->0/-1/-1
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 742314 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 159581 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 92fb00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 5b6c00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 00 : 1[5b6c00000] -> 0[92fb00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 00 : 0[92fb00000] -> 1[5b6c00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 00 : 0[92fb00000] -> 1[5b6c00000] [send] via NET/IB/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 00 : 1[5b6c00000] -> 0[92fb00000] [send] via NET/IB/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 126471 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 424932 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 470331 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 490679 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 92fb00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB : GPU Direct RDMA Enabled for GPU 5b6c00000 / HCA 0 (distance 1 < 2), read 0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 01 : 1[5b6c00000] -> 0[92fb00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 01 : 0[92fb00000] -> 1[5b6c00000] [receive] via NET/IB/0/GDRDMA
pkb-b52c0368-0:82431:82448 [0] NCCL INFO Ring 01 : 0[92fb00000] -> 1[5b6c00000] [send] via NET/IB/0
pkb-b52c0368-1:82676:82696 [0] NCCL INFO Ring 01 : 1[5b6c00000] -> 0[92fb00000] [send] via NET/IB/0
pkb-b52c0368-0:82431:82448 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 179520 mtu 5 LID 108
pkb-b52c0368-1:82676:82696 [0] NCCL INFO NET/IB: Dev 0 Port 1 qpn 542420 mtu 5 LID 57
pkb-b52c0368-0:82431:82448 [0] NCCL INFO comm 0x7f7724001aa0 rank 0 nranks 2 cudaDev 0 busId 2fb00000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO Launch mode Parallel
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-0:82431:82431 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f774a000000 recvbuff 0x7f773a000000 count 67108864 datatype 7 op 0 root 0 comm 0x7f7724001aa0 [nranks=2] stream 0x132bd40
pkb-b52c0368-1:82676:82696 [0] NCCL INFO comm 0x7fb6cc001aa0 rank 1 nranks 2 cudaDev 0 busId b6c00000 - Init COMPLETE
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-1:82676:82676 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7fb6a0000000 recvbuff 0x7fb690000000 count 67108864 datatype 7 op 0 root 0 comm 0x7fb6cc001aa0 [nranks=2] stream 0x223dde0
pkb-b52c0368-0:82431:82449 [0] transport/net_ib.cc:774 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 32631, vendor err 129
pkb-b52c0368-0:82431:82449 [0] NCCL INFO include/net.h:28 -> 2
pkb-b52c0368-0:82431:82449 [0] NCCL INFO transport/net.cc:377 -> 2
pkb-b52c0368-0:82431:82449 [0] NCCL INFO transport.cc:166 -> 2 [Proxy Thread]
pkb-b52c0368-1:82676:82697 [0] transport/net_ib.cc:774 NCCL WARN NET/IB : Got completion with error 12, opcode 671036413, len 0, vendor err 129
pkb-b52c0368-1:82676:82697 [0] NCCL INFO include/net.h:28 -> 2
pkb-b52c0368-1:82676:82697 [0] NCCL INFO transport/net.cc:377 -> 2
pkb-b52c0368-1:82676:82697 [0] NCCL INFO transport.cc:166 -> 2 [Proxy Thread]
pkb-b52c0368-1: Test NCCL failure common.cu:345 'unhandled system error'
.. pkb-b52c0368-1: Test failure common.cu:393
.. pkb-b52c0368-1: Test failure common.cu:492
.. pkb-b52c0368-1: Test failure all_reduce.cu:103
.. pkb-b52c0368-1: Test failure common.cu:518
.. pkb-b52c0368-1: Test failure common.cu:839
pkb-b52c0368-0: Test NCCL failure common.cu:345 'unhandled system error'
.. pkb-b52c0368-0: Test failure common.cu:393
.. pkb-b52c0368-0: Test failure common.cu:492
.. pkb-b52c0368-0: Test failure all_reduce.cu:103
.. pkb-b52c0368-0: Test failure common.cu:518
.. pkb-b52c0368-0: Test failure common.cu:839
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[44150,1],1]
Exit code: 3
--------------------------------------------------------------------------