zjin-lcf / hecbench Goto Github PK

Home Page: https://software.intel.com/content/www/us/en/develop/articles/repo-evaluating-performance-productivity-oneapi.html

License: BSD 3-Clause "New" or "Revised" License

Makefile 5.57% C 24.88% Cuda 25.46% C++ 42.07% Shell 0.56% Python 0.10% Roff 1.08% CMake 0.01% M4 0.22% Perl 0.05% Gnuplot 0.01%

sycl openmp cuda gpu-computing hip benchmark scientific-computing hpc-applications test-driven-development

hecbench's People

Stargazers

Watchers

hecbench's Issues

No folder called data for HotSpot3D

Hi, In "make" file of Hotspot3D there is command named "../data/hotspot3D/power_512x8". But I am unable to find out the data folder in the Benchmark folder.

fft-hip/cuda: false positive verification if all kernel launches fail

If all kernel launches fail here, the verification still "passes". This happened to me on chipStar which reported kernel launch errors in the logs. To avoid the false positive verification, the benchmarks, fft-hip and fft-cuda, probably should check for launch errors (use cuda/hipGetLastError() after the kernel launches).

Lanczos SYCL

Lanczos sycl isn't working

it returns -nan in all positions

data.tar.gz problem from HeCBench/minibude-sycl

Hello zjin-lcf

I am trying do the minibude-sycl project. However i find the data files' content(.in file) from data/bm1 and data/bm2 are almost garbled text. May you update correct data?

Missing Utils.h in Goulash.cuda Benchmark

Hi, In the Goulash-cuda benchmark program, we need a utils.h header file. But it is not available there. Can you please help regarding this ?

Makefile issue for winograd-hip

Hello!

When compiling winograd-hip with make, we are not getting the optimization flag even though it is turned on (by default) in the Makefile. We don't see any problem Makefile problem that may cause this, and this doesn't happen to other benchmarks either...

The compilation line we got with make VERIFY=yes DEBUG=yes is the following:

hipcc  -std=c++14 -Wall -DMAP_SIZE=1024 -g -c main.cu -o main.o

hipcc  -std=c++14 -Wall -DMAP_SIZE=1024 -g -c utils.cu -o utils.o

hipcc  -std=c++14 -Wall -DMAP_SIZE=1024 -g main.o utils.o -o main  -g

The -O3 flag had to be explicitly added to the Makefile again in order for it to take effect. We added the -O3 on line

HeCBench/winograd-hip/Makefile

Line 50 in d837474

$(CC) $(CFLAGS) -c $< -o $@

and successfully got the following compilation lines:

hipcc  -std=c++14 -Wall -DMAP_SIZE=1024 -g -c main.cu -o main.o -O3

hipcc  -std=c++14 -Wall -DMAP_SIZE=1024 -g -c utils.cu -o utils.o -O3

hipcc  -std=c++14 -Wall -DMAP_SIZE=1024 -g main.o utils.o -o main  -g

Thank you!!

1/3 of the content is omitted by github servers.

Hi,

When the page loads. It shows the following warning : Sorry, we had to truncate this directory to 1,000 files. 556 entries were omitted from the list. It seems that 1/3 of your content is being omited by github servers. Maybe for page loading optimizations. I suggest you to re-structure your project in such a way all your effort is visible for everyone.

cheerfully,
MELLOUKY Mohamed

"common.h" file not found

I'm trying to build miniFE-sycl port but "common.h" file is missing. I guess this has to be some primary header file so I need help here. I'm using DPC++ compiler. Below is output from make with no arguments.

./get_common_files
./generate_info_header "clang++" "-O3" "miniFE" "MINIFE"
CXX: /users/wrs505/scratch/indv-project/sycl-workspace/llvm/build/bin/clang++
Compiler version: clang version 15.0.0 (/users/wrs505/scratch/indv-project/sycl-workspace/llvm/clang 4a794dfa230c3a823ac552bc304123e668b49796)
clang++ -O3 --gcc-toolchain="/opt/apps/easybuild/software/GCCcore/11.2.0" -I. -I../utils -I../fem -DMINIFE_SCALAR=double -DMINIFE_LOCAL_ORDINAL=int -DMINIFE_GLOBAL_ORDINAL=int -DMINIFE_RESTRICT=__restrict__ -I../../include/ -DMINIFE_CSR_MATRIX -std=c++17 -Wall -fsycl -DUSE_GPU -DMINIFE_INFO=1 -DMINIFE_KERNELS=0 -c main.cpp
In file included from main.cpp:56:
In file included from ./driver.hpp:51:
./SparseMatrix_functions.hpp:38:10: fatal error: 'common.h' file not found
#include "common.h"
         ^~~~~~~~~~
1 error generated.
make: *** [main.o] Error 1

Seems there is no data files needed by program

HeCBench/src/hotspot3D-cuda/Makefile

Line 58 in 64c65e7

 $(LAUNCHER) ./$(program) 512 8 5000 ../data/hotspot3D/power_512x8 ../data/hotspot3D/temp_512x8 output.out 

resenet-kernels-hip: compilation error

src/resnet-kernels-hip> make run
hipcc -x hip -std=c++14 -Wall -O3 -c main.c -o main.o
clang-16: error: unknown argument: '-c-o'
clang-16: error: no such file or directory: 'main.o'

Source file is of C type and compilation command points to c++14, please recheck the Makefile configuration

Adding pthread to link flag list

For the benchmarks that use pthread library, the -lpthread flag is not included in the link flag list in Makefile files...

So far we encountered three benchmarks that use it (but don't have it included in the link flag list):
mmcsf-hip
sssp-hip
bm3d-hip

I can also create a pull request to include them if that helps (and if adding them to the Makefile sounds good :))

Thank you!

simpleMultiDevice: compilatoin errors

HeCBench/src/simpleMultiDevice-hip/main.cu

Line 26 in 5bbdf8e

Above line creates a compilation issue, this needs to be removed. Post removal app runs smooth.

src/simpleMultiDevice-hip> make run
hipcc -std=c++14 -DMAX_GPU_COUNT=4 -Wall -O3 -c main.cu -o main.o
main.cu:26:1: error: expected external declaration

^
1 error generated when compiling for .

failed to execute:/home/kballeda/local/llvm-16_0816/bin/clang++ -I//home/kballeda/local/chipstar_0914/include -std=c++14 -DMAX_GPU_COUNT=4 -Wall -O3 -c -o main.o -x hip main.cu -D__HIP_PLATFORM_SPIRV__= --offload=spirv64 -nohipwrapperinc --src/simpleMultiDevice-hip> vim main.cu
src/simpleMultiDevice-hip> make run
hipcc -std=c++14 -DMAX_GPU_COUNT=4 -Wall -O3 -c main.cu -o main.o

hipcc -std=c++14 -DMAX_GPU_COUNT=4 -Wall -O3 main.o -o main

./main 1000
Starting simpleMultiDevice
GPU device count: 1
Generating input data of size 33554432 ...

Computing with 1 GPUs...
Average GPU Processing time: 4197.601074 (us)

Computing with Host CPU...

Comparing GPU and Host CPU results...
GPU sum: 16777296.000000
CPU sum: 16777294.395033
Relative difference: 9.566307E-08

Verification result is fixed.

HeCBench/xsbench-cuda/io.cu

Lines 72 to 81 in 5abb471

 if( in.simulation_method == EVENT_BASED ) 

 { 

 small = 945990; 

 large = 952131; 

 } 

 else if( in.simulation_method == HISTORY_BASED ) 

 { 

 small = 941535; 

 large = 954318; 

 }

The result is fixed with the default input parameter, if num of lookups changes, it will fail here, but the calculation result is correct.

matrixT-hip case is not compiling

matrixT-hip is not compiling as it misses to contain cg namespace. Please take a look

:~/ANL_WORK/HeCBench/matrixT-hip> make run
hipcc -std=c++14 -Wall -O3 -c main.cu -o main.o
main.cu:33:16: error: expected namespace name
namespace cg = cooperative_groups;
^
main.cu:81:3: error: use of undeclared identifier 'cg'
cg::thread_block cta = cg::this_thread_block();
^
main.cu:81:26: error: use of undeclared identifier 'cg'
cg::thread_block cta = cg::this_thread_block();
^
main.cu:97:3: error: use of undeclared identifier 'cg'
cg::sync(cta);
^
main.cu:138:3: error: use of undeclared identifier 'cg'
cg::thread_block cta = cg::this_thread_block();
^
main.cu:138:26: error: use of undeclared identifier 'cg'
cg::thread_block cta = cg::this_thread_block();

std:bad_alloc when running miniFE-sycl port

I'm trying to run the miniFE-sycl code compiled using hipSYCL and it throws an error when CG Solver starts. Below is the stdout.

[ri-wshilpage@login-01 src]$ ./miniFE.x
MiniFE Mini-App, OpenMP Peer Implementation
Creating OpenMP Thread Pool...
Counted: 16 threads.
Running MiniFE Mini-App...
      creating/filling mesh...0.000133991s, total time: 0.000133991
generating matrix structure...0.000746965s, total time: 0.000880957
         assembling FE data...0.000336885s, total time: 0.00121784
      imposing Dirichlet BC...0.000154972s, total time: 0.00137281
      imposing Dirichlet BC...0.000139952s, total time: 0.00151277
making matrix indices local...0s, total time: 0.00151277
Starting CG solver ... 
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

The same code run successfully when compiled with Intel DPC++. Any lead to what could be wrong?

Missing Reference.h from STDDEV(Standard Deviation) Benchmark Program

Hi,
In the main.cu file of stddev-cuda folder, we can see a code like the following:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda.h>

#include "reference.h"

While compiling the code, we get an error stating: "fatal error: reference.h: No such file or directory".
The cause is the lack of "reference.h" file in that particular folder.
Can you please help us with that?

Missing header files for testSNAP benchmark

It seems like several header files are missing for the testSNAP benchmark: (applies to hip, cuda, sycl, and omp)

HeCBench/testSNAP-hip/main.cu

Lines 27 to 35 in c8f6564

 #if REFDATA_TWOJ == 14 

 #include "refdata_2J14_W.h" 

 #elif REFDATA_TWOJ == 8 

 #include "refdata_2J8_W.h" 

 #elif REFDATA_TWOJ == 4 

 #include "refdata_2J4_W.h" 

 #else 

 #include "refdata_2J2_W.h" 

 #endif

I tried to find the headers files in the entire repository, but could not find them...
Thank you!

dxtc1-cuda generates incorrect results, validation fails.

Found that dxtc1-cuda is failing in CPU/GPU validation.
dxtc1-cuda$ make run

./main ../dxtc1-sycl/data/lena_std.ppm
../dxtc1-sycl/data/lena_ref.dds 100
Loaded '../dxtc1-sycl/data/lena_std.ppm', 512 x 512 pixels

Running DXT Compression on 512 x 512 image...

16384 Workgroups, 64 Work Items per Workgroup, 1048576 Work Items in NDRange...

Average kernel execution time 0.000545 (s)

Comparing against Host/C++ computation...
RMS(reference, result) = 12.503659

FAIL

@zjin-lcf verified other flows HIP/ChipStar, both are passing.

Missing makefile in mandelbrot-sycl

The mandelbrot-sycl directory no longer has a Makefile:

https://github.com/zjin-lcf/HeCBench/tree/master/mandelbrot-sycl

It might have been lost in the recent makefiles refactoring.

Dereferencing an array with out of range index.

HeCBench/src/bn-cuda/kernels.cu

Line 44 in 94d1eed

if(i!=node)pre[j++]=i;

Variable j may reach NODE_N, cause out of index access to array.

tc-hip / tc-sycl/tc-cuda benchmarks missing Makefile

Missing makefile for tc benchmark in all variants i.e., hip/sycl/cuda

adam-sycl: kernel does not compute anything, verification is broken

I was wondering why SYCL was much faster than HIP on chipStar, both targeting the same device through OpenCL. I looked into the SYCL's kernel bitcode and found out that it effectively does not compute anything meaningful. The kernel botcode looks like this:

; Function Attrs: norecurse nounwind
define weak_odr dso_local spir_kernel void @_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_E6kernel(i32 noundef %0) local_unnamed_addr #0 comdat !srcloc !60 !kernel_arg_buffer_location !61 !sycl_fixed_targets !58 !sycl_kernel_omit_args !62 {
  call void @__itt_offload_wi_start_wrapper()
  %2 = load i64, ptr addrspace(1) @__spirv_BuiltInWorkgroupSize, align 32, !noalias !63
  %3 = load i64, ptr addrspace(1) @__spirv_BuiltInNumWorkgroups, align 32, !noalias !70
  %4 = load i64, ptr addrspace(1) @__spirv_BuiltInGlobalInvocationId, align 32, !noalias !75
  %5 = sext i32 %0 to i64
  %6 = icmp ult i64 %4, 2147483648
  tail call void @llvm.assume(i1 %6)
  %7 = icmp ult i64 %3, 2147483648
  tail call void @llvm.assume(i1 %7)
  %8 = icmp ult i64 %2, 2147483648
  tail call void @llvm.assume(i1 %8)
  %9 = mul nuw nsw i64 %3, %2
  %10 = shl i64 %9, 32
  %11 = ashr exact i64 %10, 32
  br label %12

12:                                               ; preds = %15, %1
  %13 = phi i64 [ %4, %1 ], [ %16, %15 ]
  %14 = icmp ult i64 %13, %5
  br i1 %14, label %15, label %17

15:                                               ; preds = %12
  %16 = add i64 %13, %11
  br label %12, !llvm.loop !80

17:                                               ; preds = %12
  call void @__itt_offload_wi_finish_wrapper()
  ret void
}

It further turned out that the verification is broken too. It gives a pass even though the SYCL kernel doesn't compute anything. The verification is also broken in adam-hip benchmark (it passes even when kernel launches are removed in the code).

Texture cache performance investigation

Hello @zjin-lcf
I am investigating the performance impact of using the texture cache on Nvidia GPU's in SYCL and CUDA. I have noticed that some benchmarks (clink, convolution1D, convolutionsSeparable, page-rank, swish, all-pairs-distance) have had the explicit ldg instructions already added to them. To that I have a question of how was this selection of benchmarks chosen? Have you tested any other benchmarks and just not noticed any performance benefits of using it there? If so which other benchmarks did you test, and which GPU architectures have you tested them with?

lebesgue-hip: atomicMax compilation error

Benchmark: lebesgue-hip fails in compilation due to atomicMax prototype difference in ChipStar whereas the same benchmark passes using AMD hip.

/src/lebesgue-hip> make run
hipcc -std=c++14 -Wall -O3 -c ../lebesgue-cuda/main.cpp -o main.o

hipcc -std=c++14 -Wall -O3 -c ../lebesgue-cuda/utils.cpp -o utils.o

hipcc -std=c++14 -Wall -O3 -c kernels.cu -o kernels.o
kernels.cu:40:3: error: no matching function for call to 'atomicMax'
atomicMax(lmax, t);
^~~~~~~~~
hip/devicelib/atomics.hh:217:36: note: candidate function not viable: no known conversion from 'double *__restrict' to 'int *' for 1st argument
extern "C++" inline device int atomicMax(int *address, int val) {
^
hip/devicelib/atomics.hh:223:45: note: candidate function not viable: no known conversion from 'double *__restrict' to 'unsigned int *' for 1st argument
extern "C++" inline device unsigned int atomicMax(unsigned int *address,
^
/hip/devicelib/atomics.hh:231:1: note: candidate function not viable: no known conversion from 'double *__restrict' to 'unsigned long long *' for 1st argument
atomicMax(unsigned long long *address, unsigned long long val) {

AMD-Hip Compilation log:
HeCBench/src/lebesgue-hip$ make run
hipcc -std=c++14 -Wall -O3 -c ../lebesgue-cuda/main.cpp -o main.o
hipcc -std=c++14 -Wall -O3 -c ../lebesgue-cuda/utils.cpp -o utils.o
hipcc -std=c++14 -Wall -O3 -c kernels.cu -o kernels.o
hipcc -std=c++14 -Wall -O3 main.o utils.o kernels.o -o main
./main 1000000 2

geodesic-hip : Please add a readme for input data (sym link)

This benchmark has sym link to locations.txt pointing to geodesic-sycl/locations.txt . First one must untar the locations.tar from geodesic-sycl.

ga-hip compilation issue

This is specific to printf is coded as printn , after fixing this application compiles smoothly.

:~/ANL_WORK/HeCBench/ga-hip> make
hipcc -std=c++14 -Wall -I../ga-cuda -O3 -c main.cu -o main.o
main.cu:117:3: error: use of undeclared identifier 'printn'
printn("Total kernel execution time %f (s)\n", total_time * 1e-9f);
^
1 error generated when compiling for .

[DPC++] lavaMD SYCL kernel performance issue for NVPTX targets

Hello,

We occasionally run the benchmarks to look for and analyse cases where DPC++ SYCL performance is slower than that of CUDA on Nvidia GPUs.

Following an investigation on why the lavaMD SYCL kernel performs significantly worse on NVidia/PTX target than CUDA and we have a few suggestions that can help achieve parity.

Having inspected the target PTX device code, in our case on Nvidia GeForce RTX 2060 with arch=sm_60, the compiler produces device code which results in the execution of local / global memory instructions very extremely.

For comparison here are the timings of the native CUDA kernel program.

Original CUDA

Kernel execution time: 8.2192 ms

Original SYCL

Kernel execution time: 32.2551 ms

So, we found the main issue comes from the following inner for-loop computation.

// loop for the number of particles in the current nei box
for (j=0; j<NUMBER_PAR_PER_BOX; j++){
  r2 = rA_shared[wtx].v + rB_shared[j].v - DOT(rA_shared[wtx],rB_shared[j]); 
  u2 = a2*r2;
  vij= cl::sycl::exp(-u2);
  fs = 2*vij;
  d.x = rA_shared[wtx].x  - rB_shared[j].x;
  fxij=fs*d.x;
  d.y = rA_shared[wtx].y  - rB_shared[j].y;
  fyij=fs*d.y;
  d.z = rA_shared[wtx].z  - rB_shared[j].z;
  fzij=fs*d.z;
  d_fv_gpu_acc[first_i+wtx].v +=  qB_shared[j]*vij;
  d_fv_gpu_acc[first_i+wtx].x +=  qB_shared[j]*fxij;
  d_fv_gpu_acc[first_i+wtx].y +=  qB_shared[j]*fyij;
  d_fv_gpu_acc[first_i+wtx].z +=  qB_shared[j]*fzij;
}

// increment work thread index
wtx = wtx + NUMBER_THREADS;
[...]

The heavy stalling instruction sequence pattern detected in the inner for-loop look like this in PTX device code:

ld.shared.f32 %f50, [%rd70];
fma.rn.f32 %f57, %f50, %f42, %f57;
st.global.f32 [%rd16], %f57;
ld.shared.f32 %f51, [%rd70];
fma.rn.f32 %f56, %f51, %f45, %f56;
st.global.f32 [%rd16+4], %f56;
ld.shared.f32 %f52, [%rd70];
fma.rn.f32 %f55, %f52, %f47, %f55;
st.global.f32 [%rd16+8], %f55;
ld.shared.f32 %f53, [%rd70];
fma.rn.f32 %f54, %f53, %f49, %f54;
st.global.f32 [%rd16+12], %f54;

This inefficient sequence of instructions is generated as a side effect of accumulation in the for-loop. The shared memory language semantics classify it as volatile and for that reason in our case the compiler is unable to safely load the scalar value just once from shared memory in a register and reuse it in consecutively scheduled FMAs (+= shared*constant) efficiently.
Same goes for the stores to the global vector resulting in instruction interlieving and wasted cycles (a lot of) between instructions, waiting on global/local memory queues to free up.

Suggestions

Hoisting the volatile shared memory data into a scalar register.

diff --git a/kernel.sycl b/kernel.sycl
index 30e264b..df064df 100644
--- a/kernel.sycl
+++ b/kernel.sycl
@@ -106,10 +106,11 @@ if(bx<dim_cpu_number_boxes) {
         fyij=fs*d.y;
         d.z = rA_shared[wtx].z  - rB_shared[j].z;
         fzij=fs*d.z;
-        d_fv_gpu_acc[first_i+wtx].v +=  qB_shared[j]*vij;
-        d_fv_gpu_acc[first_i+wtx].x +=  qB_shared[j]*fxij;
-        d_fv_gpu_acc[first_i+wtx].y +=  qB_shared[j]*fyij;
-        d_fv_gpu_acc[first_i+wtx].z +=  qB_shared[j]*fzij;
+        auto N = qB_shared[j];
+        d_fv_gpu_acc[first_i+wtx].v += N*vij;
+        d_fv_gpu_acc[first_i+wtx].x += N*fxij;
+        d_fv_gpu_acc[first_i+wtx].y += N*fyij;
+        d_fv_gpu_acc[first_i+wtx].z += N*fzij;
       }

       // increment work thread index

Kernel execution time: 17.3396 ms

Hoisting the global memory buffer data into a vector register and accumulate that register, then write the results into global memory after the loop.
This is the best approach here because the index into global memory first_i+wtx is loop invariant for the inner loop that performs the actual computation.

diff --git a/lavaMD-sycl/kernel.sycl b/lavaMD-sycl/kernel.sycl
index 4263aead..6c8fcbfc 100644
--- a/lavaMD-sycl/kernel.sycl
+++ b/lavaMD-sycl/kernel.sycl
@@ -92,7 +92,7 @@ if(bx<dim_cpu_number_boxes) {
 
     // loop for the number of particles in the home box
     while(wtx<NUMBER_PAR_PER_BOX){
-
+      auto out_buf_accum = d_fv_gpu_acc[first_i+wtx];
       // loop for the number of particles in the current nei box
       for (j=0; j<NUMBER_PAR_PER_BOX; j++){
 
@@ -106,12 +106,16 @@ if(bx<dim_cpu_number_boxes) {
         fyij=fs*d.y;
         d.z = rA_shared[wtx].z  - rB_shared[j].z;
         fzij=fs*d.z;
-        d_fv_gpu_acc[first_i+wtx].v +=  qB_shared[j]*vij;
-        d_fv_gpu_acc[first_i+wtx].x +=  qB_shared[j]*fxij;
-        d_fv_gpu_acc[first_i+wtx].y +=  qB_shared[j]*fyij;
-        d_fv_gpu_acc[first_i+wtx].z +=  qB_shared[j]*fzij;
+
+       const auto N = qB_shared[j];
+        out_buf_accum.v +=  N*vij;
+        out_buf_accum.x +=  N*fxij;
+        out_buf_accum.y +=  N*fyij;
+        out_buf_accum.z +=  N*fzij;
       }
 
+      d_fv_gpu_acc[first_i+wtx] = out_buf_accum;
+
       // increment work thread index

Kernel execution time: 9.0887 ms

Also, with __attribute__((aligned(16))) for the FOUR_VECTOR struct, the LLVM is be able to produce vectorized loads and stores (related to device data of that type) and save a little more on the latency in comparison to the consecutive scalar ones, hence achieving even better kernel performance.

Anyways, the expectation is that production code should be using properly aligned vector types otherwise users shouldn't really expect guaranteed vectorizations. The code can also be re-written using the sycl::vec types (e.g. sycl::float4).

diff --git a/lavaMD-sycl/main.h b/lavaMD-sycl/main.h
index afbe6403..2072b981 100644
--- a/lavaMD-sycl/main.h
+++ b/lavaMD-sycl/main.h
@@ -37,7 +37,7 @@ typedef struct
 {
        fp v, x, y, z;
 
-} FOUR_VECTOR;
+} __attribute__((aligned(16))) FOUR_VECTOR;
 
 typedef struct nei_str
 {

Kernel execution time: 8.81664 ms, much close to the native CUDA.

Timing SYCL

In order to more accurately time kernel execution in SYCL you would want to enable the profiling property on the sycl::queue and use the cl::sycl::event class to gather the command group submission and kernel execution start points and time them.

e.g. create the sycl::queue as follows:

const cl::sycl::property_list queueProps = {cl::sycl::property::queue::enable_profiling()};
cl::sycl::queue q(dev_sel, queueProps);

The total time consists of the command group submission and kernel execution times. In the case of the SYCL variant of the benchmark programs in the HeCBench repo, using a wall clock to time the kernel ends up inflating the actual kernel execution time as it includes command group submission too. The larger the data set the less noticeable the impact of the submission will be.

More on this can be read on Codeplay's developer blog post about SYCL Profiling.

Possible data race

In the folder lsqt-cuda, in the file vector.cu:

#ifndef CPU_ONLY
__device__ void warp_reduce(volatile real* s, int t)
{
  s[t] += s[t + 32];
  s[t] += s[t + 16];
  s[t] += s[t + 8];
  s[t] += s[t + 4];
  s[t] += s[t + 2];
  s[t] += s[t + 1];
}
#endif

In newer architectures of NVidia (Volta and above) the code has a data-race. We can use __syncwarp() between each memory operation to fix this problem. Further, the CUDA compiler may elide some of these synchronization instructions in the final generated code depending on the target architecture (e.g. on pre-Volta architectures), as per blog post.
It can be solved like this:

#ifndef CPU_ONLY
__device__ void warp_reduce(volatile real* s, int t)
{
  int v = 0;
  v+= s[t + 32];    __syncwrap();
  s[t] = v;         __syncwrap();
  v += s[t + 16];   __syncwrap();
  s[t] = v;         __syncwrap();
  v += s[t + 8];    __syncwrap();
  s[t] = v;         __syncwrap();
  v += s[t + 4];    __syncwrap();
  s[t] = v;         __syncwrap();
  v += s[t + 2];    __syncwrap();
  s[t] = v;         __syncwrap();
  v += s[t + 1];    __syncwrap();
  s[t] = v;
}
#endif

Please let me know if you'd like me to create a PR.
@cogumbreiro

help with autobench.py

Hey, could you please elaborate on the usage of the ./autohecbench.py script ?

I have tried the following as stated in the readme:

./autohecbench.py sycl -o sycl.csv

But all I get is failures to compile.

for example:

Failed compilation in /home/cerqueira/faculdade/thesis/code/HeCBench/src/scripts/../bh-sycl.
Command '['make', 'GCC_TOOLCHAIN=""', 'CUDA=yes', 'CUDA_ARCH=sm_60']' returned non-zero exit status 2.

I am using intel oneApi and would like to compare the usage of the runtime of my iGPU and my CPU.

Benchmarks missing input files

Hello!

Here is a list of benchmarks for which I couldn't find the input files specified in the Makefiles. I have been looking for them around but still don't see them...

cmp-hip (Makefile says: /cmp-cuda/data/simple-synthetic.su)
mmcsf-hip (Makefile says: /mmcsf-cuda/output.tns)
cc-hip (Makefile says: /cc-cuda/delaunay_n24.egr)
chi2-hip (Makefile says: /chi2-cuda/traindata)
bfs-hip (Makefile says: /data/bfs/graph1MW_6.txt)
b+tree-hip (Makefile says: [Input File]: /data/b+tree/mil.txt [Command File]: /data/b+tree/command.txt)
bmf-hip (Makefile says: /bmf-cuda/data/MNIST.in)

Thank you! : )

Shared memory reference may out of index.

HeCBench/src/tpacf-cuda/histogram_kernel.cu

Line 143 in 86bd0c5

s_Hist[threadPos + __mul24( (dataTemp >> 2) & 63, NUMTHREADS)]++;

The shared memory is allocated as:

__shared__ unsigned char s_Hist[MEMPERBLOCK];

Which:

#define   NUMBINS       32
#define   NUMTHREADS    128
#define   MAXNUMBLOCKS  16384
#define   MAXBLOCKSEND  32
#define   DATAPERBLOCK  (NUMTHREADS * 63)
#define   MEMPERBLOCK   (NUMTHREADS * NUMBINS)

MEMPERBLOCK is 32 * 128 = 4K.

When references this shared memory,

s_Hist[threadPos + __mul24( (dataTemp >>  2) & 63, NUMTHREADS)]++;

The maximum possible index is:

threadPos + 63 * NUMTHREADS = threadPos + 63*128;

Which exceeds the size of shared memory.

black-scholes-sycl using constants in double type

Whole example is using only float type, but because some constant values are set without 'f' suffix are threated by compiler as double values.
In some cases (when left hand argument is such constant) calculations are performed using double values and results casted to float values.
It impacts performance of this workload by about 10% (in dpcpp compiler).

Patch attached
black-scholes-sycl.zip

nms-hip : input data related

nms-hip refers to a data present in nms-cuda/detections.txt, please update Makefile to refer this path instead of local directory.

easyWave-sycl compilation issue

Looks like a applicaiton specific issue, do we have a fix for this ?

main.cpp:893:17: error: use of undeclared identifier 'NodeD'
if( NodeD(m, iD) == 0 ) return;
^
main.cpp:894:13: error: use of undeclared identifier 'NodeD'
NodeD(m, iH) = NodeD(m, iH) - NodeD(m, iR1)*( NodeD(m, iM) - NodeD(m-NLat, iM) + NodeD(m, iN)*d_R6[j] - NodeD(m-1, iN)*d_R6[j-1] );

chi2: input data link is not valid

Input data link for this benchmark is not valid. Please take a look,
Following is the input data link present in README: https://web.njit.edu/~usman/courses/cs677_spring19/traindata.gz

prna-sycl missing header file

/src/prna-sycl> make run
icpx -D__CUDACC__ -DFLOAT -std=c++17 -Wall -fsycl -O3 -ffast-math -DUSE_GPU -c main.c -o main.o
icpx: warning: treating 'c' input as 'c++' when -fsycl is used [-Wexpected-file-type]
In file included from main.c:10:
In file included from ./prna.h:4:
In file included from ./param.h:5:
./real.h:39:10: fatal error: 'common.h' file not found
#include "common.h"

@zjin-lcf : Please check if a source file is missed out here for prna test case.

Define issue - compilation error

Hi!

For the cm-hip benchmark (using the utils.h header file from cm-cuda), we are getting the following compilation error:

../cm-cuda/utils.h:99:32: error: no type named 'string' in the global namespace; did you mean 'std::string'?
int changeToDirectory(const std::string &);
                               ^~~~~~~~
                               std::string

It seems like it is because in this code section in the utils.h header file:

HeCBench/cm-cuda/utils.h

Lines 17 to 24 in 6da926c

 #if !defined(__CUDACC__) 

 // Define the keywords, so that the IDE does not complain about them 

 #define __global__ 

 #define __device__ 

 #define __shared__ 

 #define std 

 #define __host__ 

 #endif

__CUDACC__ is not defined (which is valid because we are using hip/AMD), and therefore the std is defined as empty (line 22), and that results in std::string being interpreted as ::string.

Thank you!

	if( in.simulation_method == EVENT_BASED )
	{
	small = 945990;
	large = 952131;
	}
	else if( in.simulation_method == HISTORY_BASED )
	{
	small = 941535;
	large = 954318;
	}

	#if REFDATA_TWOJ == 14
	#include "refdata_2J14_W.h"
	#elif REFDATA_TWOJ == 8
	#include "refdata_2J8_W.h"
	#elif REFDATA_TWOJ == 4
	#include "refdata_2J4_W.h"
	#else
	#include "refdata_2J2_W.h"
	#endif

	#if !defined(__CUDACC__)
	// Define the keywords, so that the IDE does not complain about them
	#define __global__
	#define __device__
	#define __shared__
	#define std
	#define __host__
	#endif

zjin-lcf / hecbench Goto Github PK

hecbench's People

Stargazers

Watchers

Forkers

hecbench's Issues

src/simpleMultiDevice-hip> make run hipcc -std=c++14 -DMAX_GPU_COUNT=4 -Wall -O3 -c main.cu -o main.o main.cu:26:1: error: expected external declaration

Original CUDA

Original SYCL

Suggestions

Timing SYCL

Recommend Projects

Recommend Topics

Recommend Org

Jobs

src/simpleMultiDevice-hip> make run
hipcc -std=c++14 -DMAX_GPU_COUNT=4 -Wall -O3 -c main.cu -o main.o
main.cu:26:1: error: expected external declaration