nvidia-developer-blog / code-samples Goto Github PK

Source code examples from the Parallel Forall Blog

License: BSD 3-Clause "New" or "Revised" License

C 17.84% Makefile 4.30% Fortran 1.07% Cuda 26.00% HTML 28.28% CSS 2.01% Perl 0.25% JavaScript 3.14% C++ 9.58% Shell 0.39% MATLAB 3.53% Jupyter Notebook 2.16% R 0.24% Python 1.18% CMake 0.05%

code-samples's Introduction

Parallel Forall Code Samples

This repository contains CUDA, OpenACC, Python, MATLAB, and other source code examples from the NVIDIA Parallel Forall Blog.

License

These examples are released under the BSD open source license. Refer to license.txt in this directory for full details.

List of Code Samples

posts/002-openacc-example: An example of [OpenACC Directives programming].

code-samples's People

Contributors

Stargazers

Watchers

Forkers

azuredsky tjhb srinathv frenchrd anihilation rolyluoli zhang051 damienfrancois koallider skynowa dirtyboy aeminem vhpvmx ozgurpek pranalisheth darkey09 chrislaidler siserte fjramireg mbernaschi jdemouth sonulohani priyankt68 gsall dhoh dengliang wang973843114 eamonto xfbingshan zyx1986 ecid1 adrienroussel masgro ilciavo navyxliu kampores zai365 shripadah rcrovella dipupo soumikg1 mbliu zhouyuan878 possani nuu9323226 joriscram marwan-abdellah wangbiaouestc achimturan qfxue110 pyzhao777 tikolochegithub harshit661000143 pingwangzx raghuna4 genezhou jakezhaojb gaurav553 sonyomega super-hippo hopobcn ahmadelyoussef mgul emmcjonas ijmeisner sconde crisil hemre chenrt rickdana ivalladares zxiaohan beyondlosdrones omertekin liujianqiao miguelcarcamov shenghangao jngjae001 zhangjun fredericmao josejamilena shnonn derekcameron tangjia025 mirceamp xiongzhanblake avatar-z jmsolano ahmadrazie mohendra yanweifu kerrchen bruinxiong quliuwuyihmy princeofdarkness76 mecyr jaguarxp82 thyvawh nagyistoce rasbar

code-samples's Issues

ioHelper.cpp:66:5: error: ‘onnx’ has not been declared

When I complied the example from the posts/TensorRT-introduction, I got the following error:

ioHelper.cpp: In function ‘std::ostream& nvinfer1::operator<<(std::ostream&, nvinfer1::ILogger::Severity)’:
ioHelper.cpp:52:12: warning: enumeration value ‘kVERBOSE’ not handled in switch [-Wswitch]
     switch (severity)
            ^
ioHelper.cpp: In function ‘size_t nvinfer1::readTensorProto(const string&, float*)’:
ioHelper.cpp:66:5: error: ‘onnx’ has not been declared
     onnx::TensorProto tensorProto;
     ^~~~
ioHelper.cpp:67:10: error: ‘tensorProto’ was not declared in this scope
     if (!tensorProto.ParseFromString(data))
          ^~~~~~~~~~~
ioHelper.cpp:67:10: note: suggested alternative: ‘readTensorProto’
     if (!tensorProto.ParseFromString(data))
          ^~~~~~~~~~~
          readTensorProto
In file included from /home/opt/compiler/gcc-8.2/gcc-8.2/include/c++/8.2.0/cassert:44,
                 from /home/work/protobuf/src/google/protobuf/extension_set.h:42,
                 from /home/work/onnx-tensorrt/build/third_party/onnx/onnx/onnx_onnx2trt_onnx-ml.pb.h:33,
                 from /home/work/onnx-tensorrt/build/third_party/onnx/onnx/onnx-ml.pb.h:2,
                 from /home/work/onnx-tensorrt/third_party/onnx/onnx/onnx_pb.h:50,
                 from ioHelper.cpp:32:
ioHelper.cpp:70:12: error: ‘tensorProto’ was not declared in this scope
     assert(tensorProto.has_raw_data());
            ^~~~~~~~~~~
ioHelper.cpp:70:12: note: suggested alternative: ‘readTensorProto’
make: *** [<builtin>: ioHelper.o] Error 1

I have installed onnx-tensorrt and TensorRT successfully, why can't find onnx?

The ioHelper.cpp is same as it.

Many appreciate if got any reply!

@harrism @angererc @nsakharnykh

bandwidthtest.cu shows GB/s, but the math looks like MB/s

printf(" Host to Device bandwidth (GB/s): %f\n", bytes * 1e-6 / time);

This line looks like it translates bytes to MB. Wouldn't GM be: "bytes * 1e-9"?

CUDA-aware runtime error

Hi all,
First, thanks a lot of sharing the very valuable code bases here.

I am trying to run the cuda-aware MPI example on a compute node with a P100 device attached. The example's been made with the following modules (esp. CUDA 10 and OpenMPI 4):

Currently Loaded Modules (list is jammed after copy-paste):

GCCcore/6.4.0 4) zlib/1.2.11-GCCcore-6.4.0 7) libxml2/2.9.7-GCCcore-6.4.0 10) CUDA/10.0.130
binutils/2.28-GCCcore-6.4.0 5) numactl/2.0.11-GCCcore-6.4.0 8) libpciaccess/0.14-GCCcore-6.4.0 11) OpenMPI/4.0.0-GCC-6.4.0-2.28
GCC/6.4.0-2.28 6) XZ/5.2.3-GCCcore-6.4.0 9) hwloc/2.0.2-GCCcore-6.4.0

Below is the error message I receive after executing the first example:

`cuda-aware-mpi-example$ mpiexec -np 2 bin/jacobi_cuda_normal_mpi -t 2 1
Topology size: 2 x 1
Local domain size (current node): 4096 x 4096
Global domain size (all nodes): 8192 x 4096
Error: CUDA result "CUDA driver version is insufficient for CUDA runtime version" for call "cudaGetDeviceCount(&devCount)" in file "CUDA_Normal_MPI.c" at line 55. Terminating...
Error: CUDA result "CUDA driver version is insufficient for CUDA runtime version" for call "cudaGetDeviceCount(&devCount)" in file "CUDA_Normal_MPI.c" at line 55. Terminating...

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[15358,1],0]
Exit code: 255
`
Despite I am using CUDA/10.0.130, the most recent version of that is 10.1; hence, my CUDA module is not that old. However, the error complains about "CUDA driver version is insufficient".

Do you have any idea what might have been wrong?

Regards,
Ehsan

How to calculate TFLOPS in LSTM.cu

The output of this code is runtime, but what i want to compare is throughput, how do i convert the runtime into TFLOPS.
I mean how the computation is related to the other parameters.

Verification Failed on sample for cufft_callbacks

After compiling the code, I got verification failed for both of them. I just started to try to understand why. If someone have any tips on that I would really appreciate to hear them.

(base) eduardoj@Worksmart:~/Repo/NVIDIA-developer-blog/code-samples/posts/cufft-callbacks$ ./cufft_no_callbacks 
Preparing input: 1000x1024
Computing reference solution
Creating FFT plan
Running 100 iterations
Time for the FFT: 25.354240ms
28000: (-2.59,-8.31) != (-2.59, -8.31)
           *** FAILED ***
!!! Verification Failed !!!
Done
(base) eduardoj@Worksmart:~/Repo/NVIDIA-developer-blog/code-samples/posts/cufft-callbacks$ ./cufft_callbacks 
Preparing input: 1000x1024
Computing reference solution
Creating FFT plan
Running 100 iterations
Time for the FFT: 9.087104ms
1000: (3.56,-0.35) != (3.56, -0.35)
           *** FAILED ***
!!! Verification Failed !!!
Done

tensorflow to tensorrt

Hi,

I am trying to convert tensorflow model into tensorrt, I did import successfully, but the convolution layer output is completely different, I guess that is due to the difference of order in convolutional layer weights in tf(RSCK ) and trt(KCRS).

Did you found same problem before?

Thanks

CUDA-aware MPI example complains about CUDA runtime version

Hi,

Firstly, thanks for sharing the great examples with the community.
I am interested in starting with the CUDA-aware MPI example to test our GPU nodes, where each node has 4xP100 devices.

I recently installed OpenMPI v. 4.0.0 together with CUDA/10.0.130. The nvidia driver is version 418.40.04. The example is compiled using these modules.
Now, I would like to test the node ...

`bin$ mpirun -np 4 jacobi_cuda_normal_mpi -t 2 2
Topology size: 2 x 2
Local domain size (current node): 4096 x 4096
Global domain size (all nodes): 8192 x 8192
Error: CUDA result "CUDA driver version is insufficient for CUDA runtime version" for call "cudaGetDeviceCount(&devCount)" in file "CUDA_Normal_MPI.c" at line 55. Terminating...
Error: CUDA result "CUDA driver version is insufficient for CUDA runtime version" for call "cudaGetDeviceCount(&devCount)" in file "CUDA_Normal_MPI.c" at line 55. Terminating...
Error: CUDA result "CUDA driver version is insufficient for CUDA runtime version" for call "cudaGetDeviceCount(&devCount)" in file "CUDA_Normal_MPI.c" at line 55. Terminating...
Error: CUDA result "CUDA driver version is insufficient for CUDA runtime version" for call "cudaGetDeviceCount(&devCount)" in file "CUDA_Normal_MPI.c" at line 55. Terminating...

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[41647,1],2]
Exit code: 255

Do you have any idea why this goes wrong?

With best regards,
Ehsan

error by using cuda-aware-mpi-example, bandwidth was wrong

I tried to run jacobi_cuda_aware_mpi and jacobi_cuda_normal_mpi on HPC, and I use 2 A100 with 40GB memory as devices. The Max. GPU memory bandwidth is 1,555GB/s, but in the benchmark I got 2.52 TB/s, and when I used the GPUs in a same node, the bandwidth of GPU which with CUDA-aware is slower than normal one...

This is the normal MPI result, which came from 2 Nividia a 100 on the same node:
Topology size: 2 x 1
Local domain size (current node): 20480 x 20480
Global domain size (all nodes): 40960 x 20480
normal-ID= 0
normal-ID= 1
Starting Jacobi run with 2 processes using "A100-SXM4-40GB" GPUs (ECC enabled: 2 / 2):
Iteration: 0 - Residue: 0.250000
Iteration: 100 - Residue: 0.002397
Iteration: 200 - Residue: 0.001204
Iteration: 300 - Residue: 0.000804
Iteration: 400 - Residue: 0.000603
Iteration: 500 - Residue: 0.000483
Iteration: 600 - Residue: 0.000403
Iteration: 700 - Residue: 0.000345
Iteration: 800 - Residue: 0.000302
Iteration: 900 - Residue: 0.000269
Iteration: 1000 - Residue: 0.000242
Iteration: 1100 - Residue: 0.000220
Iteration: 1200 - Residue: 0.000201
Iteration: 1300 - Residue: 0.000186
Iteration: 1400 - Residue: 0.000173
Iteration: 1500 - Residue: 0.000161
Iteration: 1600 - Residue: 0.000151
Iteration: 1700 - Residue: 0.000142
Iteration: 1800 - Residue: 0.000134
Iteration: 1900 - Residue: 0.000127
Stopped after 2000 iterations with residue 0.000121
Total Jacobi run time: 21.3250 sec.
Average per-process communication time: 0.2794 sec.
Measured lattice updates: 78.66 GLU/s (total), 39.33 GLU/s (per process)
Measured FLOPS: 393.31 GFLOPS (total), 196.66 GFLOPS (per process)
Measured device bandwidth: 5.03 TB/s (total), 2.52 TB/s (per process)

This is the result of CUDA-aware MPI which came from 2 Nvidia a 100 on the same node:
Topology size: 2 x 1
Local domain size (current node): 20480 x 20480
Global domain size (all nodes): 40960 x 20480
Starting Jacobi run with 2 processes using "A100-SXM4-40GB" GPUs (ECC enabled: 2 / 2):
Iteration: 0 - Residue: 0.250000
Iteration: 100 - Residue: 0.002397
Iteration: 200 - Residue: 0.001204
Iteration: 300 - Residue: 0.000804
Iteration: 400 - Residue: 0.000603
Iteration: 500 - Residue: 0.000483
Iteration: 600 - Residue: 0.000403
Iteration: 700 - Residue: 0.000345
Iteration: 800 - Residue: 0.000302
Iteration: 900 - Residue: 0.000269
Iteration: 1000 - Residue: 0.000242
Iteration: 1100 - Residue: 0.000220
Iteration: 1200 - Residue: 0.000201
Iteration: 1300 - Residue: 0.000186
Iteration: 1400 - Residue: 0.000173
Iteration: 1500 - Residue: 0.000161
Iteration: 1600 - Residue: 0.000151
Iteration: 1700 - Residue: 0.000142
Iteration: 1800 - Residue: 0.000134
Iteration: 1900 - Residue: 0.000127
Stopped after 2000 iterations with residue 0.000121
Total Jacobi run time: 51.8048 sec.
Average per-process communication time: 4.4083 sec.
Measured lattice updates: 32.38 GLU/s (total), 16.19 GLU/s (per process)
Measured FLOPS: 161.90 GFLOPS (total), 80.95 GFLOPS (per process)
Measured device bandwidth: 2.07 TB/s (total), 1.04 TB/s (per process)

I ran them with same node and same GPUs, because I am using the sbatch system, so I changed the flag "ENV_LOCAL_RANK" as "SLURM_LOCALID", but I also tried "OMPI_COMM_WORLD_LOCAL_RANK" because I used OpenMPI, but the result of CUDA-aware MPI were much slower than normal one, when the GPUs on a same node (but if each GPU on the different node CUDA-aware MPI is a little bit faster than normal one), maybe I didn't activate CUDA-aware?

Does someone has any idea about this? Thanks a lot!

memtype_cache.c:137 UCX WARN destroying inuse address

Hey guys,

I have installed OpeMPI v.3.1.4 with UCX, gdrcopy and CUDA from source. When I load the module, here is the list of dependencies:

When I try the jacobi_cuda_normal_mpi with 4 processes, I get an output, and a lot of (similar) warning messages at the bottom. The warnings look like this (full stdout/stderr attached below):

[1557488926.059031] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  destroying inuse address:0x2b11a8000000

Shall I be concerned about UCX? Is there a way I would inspect this further, and fix the issue, so the the stderr stays clean?

Kindest regards,
Ehsan

log.txt

CUDA aware Jacobi examples fail using PGI

I've been able to run the CUDA aware and CUDA normal Jacobi examples using hpcx-2.4.0 and HPE MPT (2.20r173) using the GNU-8.2.0 compilers. However, I get a segfault with the following trace when using pgi-19.5.

pgcc --version
pgcc 19.5-0 LLVM 64-bit target on x86-64 Linux -tp skylake

MPT: #1 0x00002aaaab8d7b96 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7fffffffbd40 "MPT ERROR: Rank 1(g:1) received signal SIGSEGV(11).\n\tProcess ID: 41511, Host: r101i0n0, Program: /nobackupp16/swbuild/dkokron/cuda/bin/jacobi_cuda_normal_mpi\n\tMPT Version: HPE MPT 2.20 05/28/19 04:16"...) at sig.c:340
MPT: #3 0x00002aaaab8d7d92 in first_arriver_handler (signo=signo@entry=11,
MPT: stack_trace_sem=stack_trace_sem@entry=0x2aaaaf380080) at sig.c:489
MPT: #4 0x00002aaaab8d812b in slave_sig_handler (signo=11,
MPT: siginfo=, extra=) at sig.c:565
MPT: #5
MPT: #6 0x0000000000404053 in CallJacobiKernel ()
MPT: #7 0x00000000004038b2 in RunJacobi (cartComm=3, rank=1, size=2,
MPT: domSize=0x7fffffffd100, topIndex=0x7fffffffd0d8, neighbors=0x7fffffffd0e4,
MPT: useFastSwap=0, devBlocks=0x7fffffffd160, devSideEdges=0x7fffffffd150,
MPT: devHaloLines=0x7fffffffd140, hostSendLines=0x7fffffffd130,
MPT: hostRecvLines=0x7fffffffd120, devResidue=0x2aeaf0220000,
MPT: copyStream=0xa152e90, iterations=0x7fffffffd174,
MPT: avgTransferTime=0x7fffffffd178) at Host.c:470
MPT: #8 0x0000000000401d8e in main (argc=4, argv=0x7fffffffd298) at Jacobi.c:78

Is CHW format mandatory in inference ?

So, the model was loaded/trained with Keras/Tensorflow and for sure input is HWC however during inference I see both in testing and conversion scrips we are forcing CHW input. Why is that and how is that possible?

compile "TensorRT-introduction"

I compile "TensorRT-introduction", but have some error as follow:

CMakeFiles/main.dir/ioHelper.cpp.o: In function `nvinfer1::readTensorProto(std::string const&, float*)':
ioHelper.cpp:(.text+0x131): undefined reference to `onnx::TensorProto::TensorProto()'
ioHelper.cpp:(.text+0x232): undefined reference to `onnx::TensorProto::~TensorProto()'
ioHelper.cpp:(.text+0x255): undefined reference to `onnx::TensorProto::~TensorProto()'
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/main.dir/build.make:201: main] Error 1
make[1]: *** [CMakeFiles/Makefile2:95: CMakeFiles/main.dir/all] Error 2
make: *** [Makefile:103: all] Error 2

CMakeLists.txt

cmake_minimum_required(VERSION 2.8)

project(simpleOnnx)

add_executable(main simpleOnnx_2.cpp cudaWrapper.h ioHelper.cpp ioHelper.h logger.cpp)

include_directories(/usr/local/include)

include_directories(/home/developer/zhangcc/download/TensorRT-7.0.0.11/include)
include_directories(/home/developer/zhangcc/download/TensorRT-7.0.0.11/samples/common)
set(TENSORRT_DIR /home/developer/zhangcc/download/TensorRT-7.0.0.11/lib)

find_package(CUDA 10.2 REQUIRED)
message(STATUS "CUDA status:")
message(STATUS "    include path: ${CUDA_INCLUDE_DIRS}")
message(STATUS "    libraries: ${CUDA_LIBRARIES}")
include_directories(${CUDA_INCLUDE_DIRS})
target_link_libraries(main ${CUDA_LIBRARIES})

find_package(Protobuf REQUIRED)
message(STATUS "Protobuf library status:")
message(STATUS "    version: ${Protobuf_VERSION}")
message(STATUS "    libraries: ${Protobuf_LIBS}")
message(STATUS "    include path: ${Protobuf_INCLUDE_DIRS}")

target_link_libraries(main /usr/local/lib/libprotobuf-lite.so)
target_link_libraries(main /usr/local/lib/libonnxifi.so)
target_link_libraries(main /usr/local/lib/libprotoc.so)

target_link_libraries(main ${TENSORRT_DIR}/libnvinfer.so ${TENSORRT_DIR}/libnvinfer_plugin.so ${TENSORRT_DIR}/libnvonnxparser.so ${TENSORRT_DIR}/libnvparsers.so)

Can't Detecting CUDA compiler ABI

When I try to make the test case with Cmake, I met this error:

[cmake] -- The CUDA compiler identification is unknown
[cmake] -- Detecting CUDA compiler ABI info
[cmake] -- Detecting CUDA compiler ABI info - failed
[cmake] -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
[cmake] -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - broken
[cmake] -- Configuring incomplete, errors occurred!
[cmake] See also "/home/XXX/文档/test/cuda/build/CMakeFiles/CMakeOutput.log".
[cmake] See also "/home/XXX/文档/test/cuda/build/CMakeFiles/CMakeError.log".
[cmake] CMake Error at /usr/local/share/cmake-3.22/Modules/CMakeTestCUDACompiler.cmake:56 (message):
[cmake]   The CUDA compiler
[cmake] 
[cmake]     "/usr/local/cuda/bin/nvcc"
[cmake] 
[cmake]   is not able to compile a simple test program.
[cmake] 
[cmake]   It fails with the following output:
[cmake] 
[cmake]     Change Dir: /home/XXX/文档/test/cuda/build/CMakeFiles/CMakeTmp
[cmake]     
[cmake]     Run Build Command(s):/usr/bin/gmake -f Makefile cmTC_77547/fast && /usr/bin/gmake  -f CMakeFiles/cmTC_77547.dir/build.make CMakeFiles/cmTC_77547.dir/build
[cmake]     gmake[1]: 进入目录“/home/XXX/文档/test/cuda/build/CMakeFiles/CMakeTmp”
[cmake]     Building CUDA object CMakeFiles/cmTC_77547.dir/main.cu.o
[cmake]     /usr/local/cuda/bin/nvcc      -c /home/XXX/文档/test/cuda/build/CMakeFiles/CMakeTmp/main.cu -o CMakeFiles/cmTC_77547.dir/main.cu.o
[cmake]     /usr/include/stdio.h(189): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdio.h(201): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdio.h(223): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdio.h(260): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdio.h(285): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdio.h(294): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdio.h(303): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdio.h(309): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdio.h(315): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdio.h(830): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdlib.h(566): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdlib.h(570): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     /usr/include/stdlib.h(799): error: attribute "__malloc__" does not take arguments
[cmake]     
[cmake]     13 errors detected in the compilation of "/home/XXX/文档/test/cuda/build/CMakeFiles/CMakeTmp/main.cu".
[cmake]     gmake[1]: *** [CMakeFiles/cmTC_77547.dir/build.make:78：CMakeFiles/cmTC_77547.dir/main.cu.o] 错误 1
[cmake]     gmake[1]: 离开目录“/home/XXX/文档/test/cuda/build/CMakeFiles/CMakeTmp”
[cmake]     gmake: *** [Makefile:127：cmTC_77547/fast] 错误 2

I think the main reason is Detecting CUDA compiler ABI info - failed
The CMakeLists begin with below:

cmake_minimum_required(VERSION 3.10)
project(cmake_and_cuda LANGUAGES CXX CUDA)
find_package(CUDA)

Test environment：

Ubuntu 22.04
Cmake 3.22
CUDA 11.4
g++/gcc 11.3.0

submodule is broken

Could someone fix submodule please ?
I got following error message:

fatal: reference is not a tree: 93696c4bce447b71c4bd0b25d1e26f1247341c04
fatal: reference is not a tree: 93696c4bce447b71c4bd0b25d1e26f1247341c04
Unable to checkout '93696c4bce447b71c4bd0b25d1e26f1247341c04' in submodule path 'posts/american-options/external/cub'
Unable to checkout '93696c4bce447b71c4bd0b25d1e26f1247341c04' in submodule path 'posts/parallel_reduction_with_shfl/cub'

Getting errors running tensor-cores example

Running the example from the posts/tensor-cores folder as discussed at https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/, it appears the nubmers are not as close as expected. I am getting the following output

./TCGemm 

M = 16384, N = 16384, K = 16384. alpha = 2.000000, beta = 2.000000

Running with wmma...
Running with cuBLAS...

Checking results...
8266.587891 8267.766602
8240.230469 8241.420898
8242.393555 8243.574219
8209.478516 8210.649414
8100.519043 8101.664062
8251.499023 8252.675781
8189.156738 8190.297852
8260.410156 8261.580078
8311.802734 8313.015625
WMMA does not agree with cuBLAS! 268435456 errors!

tensor core example result mismatch with that of cublas

while I learning of tensor core low level code example using wmma::, i found there is a mismatch in a resulting c matrxi where tensor core and cublas sums were compared and large number of mismatches occurred. Because i am learning still tensor core I could not find the error is. Can you investigate?
code-samples/posts/tensor-cores


[root@localhost tensor-cores]# ls -l
total 832
-rw-r--r--. 1 root root   1685 Feb  5 08:01 Makefile
-rw-r--r--. 1 root root    317 Feb  5 08:01 README.md
-rwxr-xr-x. 1 root root 830920 Feb  5 08:01 TCGemm
-rw-r--r--. 1 root root  11380 Feb  5 08:01 simpleTensorCoreGEMM.cu
[root@localhost tensor-cores]# make && ./TCGemm
nvcc -o TCGemm -arch=sm_70 -lcublas -lcurand simpleTensorCoreGEMM.cu

M = 16384, N = 16384, K = 16384. alpha = 2.000000, beta = 2.000000

Running with wmma...
Running with cuBLAS...

Checking results...
8266.587891 8267.766602
8240.230469 8241.420898
8242.393555 8243.574219
8209.478516 8210.649414
8100.519043 8101.664062
8251.499023 8252.675781
8189.156738 8190.297852
8260.410156 8261.580078
8311.802734 8313.015625
WMMA does not agree with cuBLAS! 268435456 errors!
[root@localhost tensor-cores]# git remote -v
origin  https://github.com/NVIDIA-developer-blog/code-samples.git (fetch)
origin  https://github.com/NVIDIA-developer-blog/code-samples.git (push)

simpleTensorCoreGEMM has errors in output when compiled with CUDA10 for Turing GPUs

simpleTensorCoreGEMM has errors in output(beyond the additive tolerance of 1e-5 and multiplicative tol of 1.01) when compiled with CUDA10 for Turing GPU (arch=sm_70, RTX 2080Ti)

I did not modify any datatypes in the run and both the wmma based explicit GEMM implementation and the cuBlasGemmEx call use the Tensorcores.

I am wondering what might be causing the errors beyond the specified tolerance limits?

some questions about unified-memory，dataElem.cu file

In the unified-memory example, the dataElem.cu file, the codes below:

// Copy up each piece separately, including new “text” pointer value
①cudaMemcpy(d_elem, elem, sizeof(DataElement), cudaMemcpyHostToDevice);
②cudaMemcpy(d_name, elem->name, namelen, cudaMemcpyHostToDevice);
③cudaMemcpy(&(d_elem->name), &d_name, sizeof(char*), cudaMemcpyHostToDevice);

// Finally we can launch our kernel, but CPU & GPU use different copies of “elem”
④Kernel<<< 1, 1 >>>(d_elem);

⑤cudaMemcpy(&(elem->value), &(d_elem->value), sizeof(int), cudaMemcpyDeviceToHost);
⑥cudaMemcpy(elem->name, d_name, namelen, cudaMemcpyDeviceToHost);

step② and ③，why didn't copy data from elem->name to d_elem->name? if i firstly copy data from elem->name to d_name, when step ③, why is the parameter cudaMemcpyHostToDevice, not cudaMemcpyDeviceToDevice?
and after the kernel executed. step ⑥, why did you copy data from d_name to elem->name, not d_elem->name to elem->name?

Cannot reproduce the results on parallel reduce with shfl

First let me say that this post is excellent. However, when I try to run the code as in my learning repository, it creates the following mistakes.

nvcc -O3 main.cu -o reduce -arch=sm_80
In file included from /usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_namespace.cuh:41,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_arch.cuh:37,
                 from cub/block/specializations/../../block/../util_type.cuh:49,
                 from cub/block/specializations/../../block/../util_ptx.cuh:37,
                 from cub/block/specializations/../../block/block_exchange.cuh:37,
                 from cub/block/specializations/../../block/block_radix_sort.cuh:37,
                 from cub/block/specializations/block_histogram_sort.cuh:36,
                 from cub/block/block_histogram.cuh:36,
                 from cub/cub.cuh:40,
                 from main.cu:33:
/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/version.cuh:46: warning: "CUB_VERSION" redefined
   46 | #define CUB_VERSION 200001
      | 
In file included from cub/util_namespace.cuh:41,
                 from cub/util_arch.cuh:37,
                 from cub/config.cuh:35,
                 from cub/cub.cuh:37,
                 from main.cu:33:
cub/version.cuh:46: note: this is the location of the previous definition
   46 | #define CUB_VERSION 101600
      | 
/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_macro.cuh(60): error: function template "cub::min" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_macro.cuh(68): error: function template "cub::max" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_arch.cuh(141): error: class template "cub::RegBoundScaling" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_arch.cuh(154): error: class template "cub::MemBoundScaling" has already been defined

cub/block/specializations/../../block/../util_debug.cuh(64): error: function "cub::Debug" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(105): error: class template "cub::If" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(122): error: class template "cub::Equals" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(141): error: class template "cub::Log2" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(150): error: class template "cub::Log2" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(163): error: class template "cub::PowerOfTwo" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(181): error: class template "cub::IsPointer" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(198): error: class template "cub::IsVolatile" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(217): error: class template "cub::RemoveQualifiers" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(232): error: invalid redeclaration of type name "cub::NullType"
cub/block/specializations/../../block/../util_type.cuh(231): here

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(249): error: class template "cub::Int2Type" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(280): error: class template "cub::FutureValue" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(299): error: class template "cub::detail::InputValue" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(338): error: class template "cub::AlignBytes" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(364): error: class "cub::AlignBytes<short4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(365): error: class "cub::AlignBytes<ushort4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(366): error: class "cub::AlignBytes<int2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(367): error: class "cub::AlignBytes<uint2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(368): error: class "cub::AlignBytes<long long>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(369): error: class "cub::AlignBytes<unsigned long long>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(370): error: class "cub::AlignBytes<float2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(371): error: class "cub::AlignBytes<double>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(376): error: class "cub::AlignBytes<long2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(377): error: class "cub::AlignBytes<ulong2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(379): error: class "cub::AlignBytes<int4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(380): error: class "cub::AlignBytes<uint4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(381): error: class "cub::AlignBytes<float4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(382): error: class "cub::AlignBytes<long4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(383): error: class "cub::AlignBytes<ulong4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(384): error: class "cub::AlignBytes<longlong2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(385): error: class "cub::AlignBytes<ulonglong2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(386): error: class "cub::AlignBytes<double2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(387): error: class "cub::AlignBytes<longlong4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(388): error: class "cub::AlignBytes<ulonglong4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(389): error: class "cub::AlignBytes<double4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(392): error: class template "cub::AlignBytes" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(393): error: class template "cub::AlignBytes" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(394): error: class template "cub::AlignBytes" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(399): error: class template "cub::UnitWord" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(447): error: class "cub::UnitWord<float2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(457): error: class "cub::UnitWord<float4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(468): error: class "cub::UnitWord<char2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(477): error: class template "cub::UnitWord" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(478): error: class template "cub::UnitWord" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(479): error: class template "cub::UnitWord" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(496): error: "MAX_VEC_ELEMENTS" has already been declared in the current scope

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(504): error: class template "cub::CubVector" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(516): error: class template "cub::CubVector" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(529): error: class template "cub::CubVector" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(543): error: class template "cub::CubVector" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: class "cub::CubVector<char, 1>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: class "cub::CubVector<char, 2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: class "cub::CubVector<char, 3>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: class "cub::CubVector<char, 4>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(640): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: class "cub::CubVector<signed char, 1>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: class "cub::CubVector<signed char, 2>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: no instance of constructor "CubVector" matches the argument list

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: class "cub::CubVector<signed char, 3>" has already been defined

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: argument list for class template "cub::CubVector" is missing

/usr/local/cuda/bin/../targets/x86_64-linux/include/cub/util_type.cuh(641): error: argument list for class template "cub::CubVector" is missing

Error limit reached.
100 errors detected in the compilation of "main.cu".
Compilation terminated.
make: *** [Makefile:28: reduce] Error 4

I have been using the CUB v1.16.0.

change double to int in coalescing-global to avoid compiler warnings

Here:
https://github.com/parallel-forall/code-samples/blob/master/series/cuda-cpp/coalescing-global/coalescing.cu

lines 54 and 70 are passing a double for the second argument:

checkCuda( cudaMemset(d_a, 0.0, n * sizeof(T)) );

This is not really sensible and may trigger compiler warnings like this:

coalescing.cu:70: warning: passing âdoubleâ for argument 2 to âcudaError_t cudaMemset(void*, int, size_t)â

I would recommend changing those 2 instances of 0.0 to just 0

ERRORS: in simpleOnnx_*.cpp

I followed this article and attempted to import an ONNX model into Tensorrt: https://devblogs.nvidia.com/speed-up-inference-tensorrt/#disqus_thread

And I ran into these errors after I performed make in code-samples/posts/TensorRT-introduction:

`simpleOnnx_1.cpp: In lambda function:
simpleOnnx_1.cpp:86:127: error: ‘exp’ was not declared in this scope
double expSum = accumulate(batchVector, batchVector + batchElements, 0.0, [=](double acc, float value) { return acc + exp(value - maxValue); });
^~~
simpleOnnx_1.cpp:86:127: note: suggested alternative: ‘exit’
double expSum = accumulate(batchVector, batchVector + batchElements, 0.0, [=](double acc, float value) { return acc + exp(value - maxValue); });
^~~
exit
simpleOnnx_1.cpp: In function ‘void softmax(std::vector&, int)’:
simpleOnnx_1.cpp:86:25: error: ‘accumulate’ was not declared in this scope
double expSum = accumulate(batchVector, batchVector + batchElements, 0.0, [=](double acc, float value) { return acc + exp(value - maxValue); });
^~~~~~~~~~
simpleOnnx_1.cpp: In lambda function:
simpleOnnx_1.cpp:88:124: error: ‘exp’ is not a member of ‘std’
transform(batchVector, batchVector + batchElements, batchVector, [=](float input) { return static_cast(std::exp(input - maxValue) / expSum); });
^~~
simpleOnnx_1.cpp: In function ‘int main(int, char**)’:
simpleOnnx_1.cpp:144:23: error: ‘accumulate’ was not declared in this scope
size_t size = accumulate(dims.d, dims.d + dims.nbDims, batchSize, multiplies<size_t>());
^~~~~~~~~~
In file included from /usr/include/c++/7/algorithm:62:0,
from simpleOnnx_1.cpp:31:
/usr/include/c++/7/bits/stl_algo.h: In instantiation of ‘_OIter std::transform(_IIter, _IIter, _OIter, _UnaryOperation) [with _IIter = float*; _OIter = float*; _UnaryOperation = softmax(std::vector&, int)::<lambda(float)>]’:
simpleOnnx_1.cpp:88:158: required from here
/usr/include/c++/7/bits/stl_algo.h:4306:12: error: void value not ignored as it ought to be
__result = __unary_op(__first);

<builtin>: recipe for target 'simpleOnnx_1.o' failed
make: *** [simpleOnnx_1.o] Error 1
`

Could you please let me know how this can be fixed? 

Thanks!

[grCUDA] vulnerabiliy issues in packackage dependencies

The version specified in the package-lock.json contain a number of security vulnerabilities.

$ cd code-samples/posts/grcuda/mandelbrot 
$ npm audit                   
                                                                                
                       === npm audit security report ===                        
                                                                                
# Run  npm update minimist --depth 3  to resolve 1 vulnerability
┌───────────────┬──────────────────────────────────────────────────────────────┐
│ Moderate      │ Prototype Pollution                                          │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Package       │ minimist                                                     │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Dependency of │ standard [dev]                                               │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Path          │ standard > standard-engine > minimist                        │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ More info     │ https://npmjs.com/advisories/1179                            │
└───────────────┴──────────────────────────────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────────────────────┐
│                                Manual Review                                 │
│            Some vulnerabilities require your attention to resolve            │
│                                                                              │
│         Visit https://go.npm.me/audit-guide for additional guidance          │
└──────────────────────────────────────────────────────────────────────────────┘
┌───────────────┬──────────────────────────────────────────────────────────────┐
│ Moderate      │ Prototype Pollution                                          │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Package       │ minimist                                                     │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Patched in    │ >=1.2.3                                                      │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Dependency of │ standard [dev]                                               │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Path          │ standard > eslint > file-entry-cache > flat-cache > write >  │
│               │ mkdirp > minimist                                            │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ More info     │ https://npmjs.com/advisories/1179                            │
└───────────────┴──────────────────────────────────────────────────────────────┘
┌───────────────┬──────────────────────────────────────────────────────────────┐
│ Moderate      │ Prototype Pollution                                          │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Package       │ minimist                                                     │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Patched in    │ >=1.2.3                                                      │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Dependency of │ standard [dev]                                               │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Path          │ standard > eslint > mkdirp > minimist                        │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ More info     │ https://npmjs.com/advisories/1179                            │
└───────────────┴──────────────────────────────────────────────────────────────┘
found 3 moderate severity vulnerabilities in 645 scanned packages
  run `npm audit fix` to fix 1 of them.
  2 vulnerabilities require manual review. See the full report for details.

Error in simpleOnnx_1.cpp while running on jetson Nano 4 gb

I am following the article https://developer.nvidia.com/blog/speeding-up-deep-learning-inference-using-tensorrt/

And I ran into these errors after I performed to make in code-samples/posts/TensorRT-introduction:

g++ -std=c++11 -DONNX_ML=1 -Wall -I/usr/local/cuda/include -c -o simpleOnnx_1.o simpleOnnx_1.cpp
In file included from simpleOnnx_1.cpp:29:0:
ioHelper.h:42:18: error: looser throw specifier for ‘virtual void nvinfer1::Logger::log(nvinfer1::ILogger::Severity, const char*)’
virtual void log(Severity severity, const char* msg) override
^~~
In file included from /usr/include/aarch64-linux-gnu/NvInferLegacyDims.h:53:0,
from /usr/include/aarch64-linux-gnu/NvInfer.h:53,
from simpleOnnx_1.cpp:27:
/usr/include/aarch64-linux-gnu/NvInferRuntimeCommon.h:1222:18: error: overriding ‘virtual void nvinfer1::ILogger::log(nvinfer1::ILogger::Severity, const AsciiChar*) noexcept’
virtual void log(Severity severity, AsciiChar const* msg) noexcept = 0;
^~~
simpleOnnx_1.cpp: In function ‘nvinfer1::ICudaEngine* createCudaEngine(const string&, int)’:
simpleOnnx_1.cpp:75:60: warning: ‘nvinfer1::ICudaEngine* nvinfer1::IBuilder::buildEngineWithConfig(nvinfer1::INetworkDefinition&, nvinfer1::IBuilderConfig&)’ is deprecated [-Wdeprecated-declarations]
return builder->buildEngineWithConfig(network, config);
^
In file included from simpleOnnx_1.cpp:27:0:
/usr/include/aarch64-linux-gnu/NvInfer.h:7990:43: note: declared here
TRT_DEPRECATED nvinfer1::ICudaEngine buildEngineWithConfig(
^~~~~~~~~~~~~~~~~~~~~
simpleOnnx_1.cpp: In function ‘void verifyOutput(const std::vector&, const std::vector&, int)’:
simpleOnnx_1.cpp:94:26: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
for (size_t i = 0; i < size; ++i)
~~^~~~~~
In file included from simpleOnnx_1.cpp:29:0:
ioHelper.h: In instantiation of ‘void nvinfer1::Destroy::operator()(T) const [with T = nvinfer1::IBuilder]’:
/usr/include/c++/7/bits/unique_ptr.h:263:17: required from ‘std::unique_ptr<_Tp, _Dp>::~unique_ptr() [with _Tp = nvinfer1::IBuilder; _Dp = nvinfer1::Destroynvinfer1::IBuilder]’
simpleOnnx_1.cpp:55:110: required from here
ioHelper.h:53:9: warning: ‘void nvinfer1::IBuilder::destroy()’ is deprecated [-Wdeprecated-declarations]
t->destroy();
^
In file included from simpleOnnx_1.cpp:27:0:
/usr/include/aarch64-linux-gnu/NvInfer.h:7929:25: note: declared here
TRT_DEPRECATED void destroy() noexcept
^~~~~~~
In file included from simpleOnnx_1.cpp:29:0:
ioHelper.h: In instantiation of ‘void nvinfer1::Destroy::operator()(T*) const [with T = nvinfer1::INetworkDefinition]’:
/usr/include/c++/7/bits/unique_ptr.h:263:17: required from ‘std::unique_ptr<_Tp, _Dp>::~unique_ptr() [with _Tp = nvinfer1::INetworkDefinition; _Dp = nvinfer1::Destroynvinfer1::INetworkDefinition]’
simpleOnnx_1.cpp:56:132: required from here
ioHelper.h:53:9: warning: ‘void nvinfer1::INetworkDefinition::destroy()’ is deprecated [-Wdeprecated-declarations]
t->destroy();
^
In file included from simpleOnnx_1.cpp:27:0:
/usr/include/aarch64-linux-gnu/NvInfer.h:5856:25: note: declared here
TRT_DEPRECATED void destroy() noexcept
^~~~~~~
In file included from simpleOnnx_1.cpp:29:0:
ioHelper.h: In instantiation of ‘void nvinfer1::Destroy::operator()(T*) const [with T = nvonnxparser::IParser]’:
/usr/include/c++/7/bits/unique_ptr.h:263:17: required from ‘std::unique_ptr<_Tp, _Dp>::~unique_ptr() [with _Tp = nvonnxparser::IParser; _Dp = nvinfer1::Destroynvonnxparser::IParser]’
simpleOnnx_1.cpp:57:123: required from here
ioHelper.h:53:9: warning: ‘virtual void nvonnxparser::IParser::destroy()’ is deprecated [-Wdeprecated-declarations]
t->destroy();
^
In file included from simpleOnnx_1.cpp:30:0:
/usr/include/aarch64-linux-gnu/NvOnnxParser.h:197:33: note: declared here
TRT_DEPRECATED virtual void destroy() = 0;
^~~~~~~
In file included from simpleOnnx_1.cpp:29:0:
ioHelper.h: In instantiation of ‘void nvinfer1::Destroy::operator()(T*) const [with T = nvinfer1::IBuilderConfig]’:
/usr/include/c++/7/bits/unique_ptr.h:263:17: required from ‘std::unique_ptr<_Tp, _Dp>::~unique_ptr() [with _Tp = nvinfer1::IBuilderConfig; _Dp = nvinfer1::Destroynvinfer1::IBuilderConfig]’
simpleOnnx_1.cpp:58:113: required from here
ioHelper.h:53:9: warning: ‘void nvinfer1::IBuilderConfig::destroy()’ is deprecated [-Wdeprecated-declarations]
t->destroy();
^
In file included from simpleOnnx_1.cpp:27:0:
/usr/include/aarch64-linux-gnu/NvInfer.h:7535:25: note: declared here
TRT_DEPRECATED void destroy() noexcept
^~~~~~~
In file included from simpleOnnx_1.cpp:29:0:
ioHelper.h: In instantiation of ‘void nvinfer1::Destroy::operator()(T*) const [with T = nvinfer1::ICudaEngine]’:
/usr/include/c++/7/bits/unique_ptr.h:263:17: required from ‘std::unique_ptr<_Tp, _Dp>::~unique_ptr() [with _Tp = nvinfer1::ICudaEngine; _Dp = nvinfer1::Destroynvinfer1::ICudaEngine]’
simpleOnnx_1.cpp:135:65: required from here
ioHelper.h:53:9: warning: ‘void nvinfer1::ICudaEngine::destroy()’ is deprecated [-Wdeprecated-declarations]
t->destroy();
^
In file included from /usr/include/aarch64-linux-gnu/NvInfer.h:54:0,
from simpleOnnx_1.cpp:27:
/usr/include/aarch64-linux-gnu/NvInferRuntime.h:1434:25: note: declared here
TRT_DEPRECATED void destroy() noexcept
^~~~~~~
In file included from simpleOnnx_1.cpp:29:0:
ioHelper.h: In instantiation of ‘void nvinfer1::Destroy::operator()(T*) const [with T = nvinfer1::IExecutionContext]’:
/usr/include/c++/7/bits/unique_ptr.h:263:17: required from ‘std::unique_ptr<_Tp, _Dp>::~unique_ptr() [with _Tp = nvinfer1::IExecutionContext; _Dp = nvinfer1::Destroynvinfer1::IExecutionContext]’
simpleOnnx_1.cpp:137:78: required from here
ioHelper.h:53:9: warning: ‘void nvinfer1::IExecutionContext::destroy()’ is deprecated [-Wdeprecated-declarations]
t->destroy();
^
In file included from /usr/include/aarch64-linux-gnu/NvInfer.h:54:0,
from simpleOnnx_1.cpp:27:
/usr/include/aarch64-linux-gnu/NvInferRuntime.h:1888:25: note: declared here
TRT_DEPRECATED void destroy() noexcept
^~~~~~~
: recipe for target 'simpleOnnx_1.o' failed
make: *** [simpleOnnx_1.o] Error 1

How can solve this??

[TensorRT] ERROR: UFFParser: Parser error: input_2: Invalid number of Dimensions 0

I run the code-samples/TensorRT3.1/convert.ipynb, but got this error:

[TensorRT] INFO: UFFParser: parsing input_2
[TensorRT] ERROR: UFFParser: Parser error: input_2: Invalid number of Dimensions 0
[TensorRT] ERROR: Failed to parse UFF model stream
  File "/usr/local/lib/python3.5/dist-packages/tensorrt/utils/_utils.py", line 186, in uff_to_trt_engine
    assert(parser_result)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorrt/utils/_utils.py", line 186, in uff_to_trt_engine
    assert(parser_result)
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "convert.py", line 124, in <module>
    create_and_save_inference_engine()
  File "convert.py", line 99, in create_and_save_inference_engine
    trt.infer.DataType.FLOAT
  File "/usr/local/lib/python3.5/dist-packages/tensorrt/utils/_utils.py", line 194, in uff_to_trt_engine
    raise AssertionError('UFF parsing failed on line {} in statement {}'.format(line, text))
AssertionError: UFF parsing failed on line 186 in statement assert(parser_result)

We registered the input_1 as input why it still try to parse input_2?
Many appreciate if got any reply!

fatal error: cuda_runtime.h

In file included from simpleOnnx.cpp:28:0:
cudaWrapper.h:30:10: fatal error: cuda_runtime.h: No such file or directory
#include <cuda_runtime.h>
^~~~~~~~~~~~~~~~
compilation terminated.

I have set the paths as:
export PATH=/usr/local/cuda-10.0/bin:/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

nvidia-developer-blog / code-samples Goto Github PK

code-samples's Introduction

Parallel Forall Code Samples

License

List of Code Samples

code-samples's People

Contributors

Stargazers

Watchers

Forkers

code-samples's Issues

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[41647,1],2] Exit code: 255

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[41647,1],2]
Exit code: 255