zjin-lcf / hecbench Goto Github PK
View Code? Open in Web Editor NEWLicense: BSD 3-Clause "New" or "Revised" License
License: BSD 3-Clause "New" or "Revised" License
Hi, In "make" file of Hotspot3D there is command named "../data/hotspot3D/power_512x8". But I am unable to find out the data folder in the Benchmark folder.
If all kernel launches fail here, the verification still "passes". This happened to me on chipStar which reported kernel launch errors in the logs. To avoid the false positive verification, the benchmarks, fft-hip and fft-cuda, probably should check for launch errors (use cuda/hipGetLastError()
after the kernel launches).
Lanczos sycl isn't working
it returns -nan in all positions
Hello zjin-lcf
I am trying do the minibude-sycl project. However i find the data files' content(.in file) from data/bm1 and data/bm2 are almost garbled text. May you update correct data?
Hi, In the Goulash-cuda benchmark program, we need a utils.h header file. But it is not available there. Can you please help regarding this ?
Hello!
When compiling winograd-hip with make, we are not getting the optimization flag even though it is turned on (by default) in the Makefile. We don't see any problem Makefile problem that may cause this, and this doesn't happen to other benchmarks either...
The compilation line we got with make VERIFY=yes DEBUG=yes is the following:
hipcc -std=c++14 -Wall -DMAP_SIZE=1024 -g -c main.cu -o main.o
hipcc -std=c++14 -Wall -DMAP_SIZE=1024 -g -c utils.cu -o utils.o
hipcc -std=c++14 -Wall -DMAP_SIZE=1024 -g main.o utils.o -o main -g
The -O3 flag had to be explicitly added to the Makefile again in order for it to take effect. We added the -O3 on line
HeCBench/winograd-hip/Makefile
Line 50 in d837474
hipcc -std=c++14 -Wall -DMAP_SIZE=1024 -g -c main.cu -o main.o -O3
hipcc -std=c++14 -Wall -DMAP_SIZE=1024 -g -c utils.cu -o utils.o -O3
hipcc -std=c++14 -Wall -DMAP_SIZE=1024 -g main.o utils.o -o main -g
Thank you!!
Hi,
When the page loads. It shows the following warning : Sorry, we had to truncate this directory to 1,000 files. 556 entries were omitted from the list.
It seems that 1/3 of your content is being omited by github servers. Maybe for page loading optimizations. I suggest you to re-structure your project in such a way all your effort is visible for everyone.
cheerfully,
MELLOUKY Mohamed
I'm trying to build miniFE-sycl port but "common.h" file is missing. I guess this has to be some primary header file so I need help here. I'm using DPC++ compiler. Below is output from make
with no arguments.
./get_common_files
./generate_info_header "clang++" "-O3" "miniFE" "MINIFE"
CXX: /users/wrs505/scratch/indv-project/sycl-workspace/llvm/build/bin/clang++
Compiler version: clang version 15.0.0 (/users/wrs505/scratch/indv-project/sycl-workspace/llvm/clang 4a794dfa230c3a823ac552bc304123e668b49796)
clang++ -O3 --gcc-toolchain="/opt/apps/easybuild/software/GCCcore/11.2.0" -I. -I../utils -I../fem -DMINIFE_SCALAR=double -DMINIFE_LOCAL_ORDINAL=int -DMINIFE_GLOBAL_ORDINAL=int -DMINIFE_RESTRICT=__restrict__ -I../../include/ -DMINIFE_CSR_MATRIX -std=c++17 -Wall -fsycl -DUSE_GPU -DMINIFE_INFO=1 -DMINIFE_KERNELS=0 -c main.cpp
In file included from main.cpp:56:
In file included from ./driver.hpp:51:
./SparseMatrix_functions.hpp:38:10: fatal error: 'common.h' file not found
#include "common.h"
^~~~~~~~~~
1 error generated.
make: *** [main.o] Error 1
HeCBench/src/hotspot3D-cuda/Makefile
Line 58 in 64c65e7
src/resnet-kernels-hip> make run
hipcc -x hip -std=c++14 -Wall -O3 -c main.c -o main.o
clang-16: error: unknown argument: '-c-o'
clang-16: error: no such file or directory: 'main.o'
Source file is of C type and compilation command points to c++14, please recheck the Makefile configuration
For the benchmarks that use pthread library, the -lpthread
flag is not included in the link flag list in Makefile files...
So far we encountered three benchmarks that use it (but don't have it included in the link flag list):
mmcsf-hip
sssp-hip
bm3d-hip
I can also create a pull request to include them if that helps (and if adding them to the Makefile sounds good :))
Thank you!
^
1 error generated when compiling for .
failed to execute:/home/kballeda/local/llvm-16_0816/bin/clang++ -I//home/kballeda/local/chipstar_0914/include -std=c++14 -DMAX_GPU_COUNT=4 -Wall -O3 -c -o main.o -x hip main.cu -D__HIP_PLATFORM_SPIRV__= --offload=spirv64 -nohipwrapperinc --src/simpleMultiDevice-hip> vim main.cu
src/simpleMultiDevice-hip> make run
hipcc -std=c++14 -DMAX_GPU_COUNT=4 -Wall -O3 -c main.cu -o main.o
hipcc -std=c++14 -DMAX_GPU_COUNT=4 -Wall -O3 main.o -o main
./main 1000
Starting simpleMultiDevice
GPU device count: 1
Generating input data of size 33554432 ...
Computing with 1 GPUs...
Average GPU Processing time: 4197.601074 (us)
Computing with Host CPU...
Comparing GPU and Host CPU results...
GPU sum: 16777296.000000
CPU sum: 16777294.395033
Relative difference: 9.566307E-08
Lines 72 to 81 in 5abb471
matrixT-hip is not compiling as it misses to contain cg namespace. Please take a look
:~/ANL_WORK/HeCBench/matrixT-hip> make run
hipcc -std=c++14 -Wall -O3 -c main.cu -o main.o
main.cu:33:16: error: expected namespace name
namespace cg = cooperative_groups;
^
main.cu:81:3: error: use of undeclared identifier 'cg'
cg::thread_block cta = cg::this_thread_block();
^
main.cu:81:26: error: use of undeclared identifier 'cg'
cg::thread_block cta = cg::this_thread_block();
^
main.cu:97:3: error: use of undeclared identifier 'cg'
cg::sync(cta);
^
main.cu:138:3: error: use of undeclared identifier 'cg'
cg::thread_block cta = cg::this_thread_block();
^
main.cu:138:26: error: use of undeclared identifier 'cg'
cg::thread_block cta = cg::this_thread_block();
I'm trying to run the miniFE-sycl code compiled using hipSYCL and it throws an error when CG Solver starts. Below is the stdout.
[ri-wshilpage@login-01 src]$ ./miniFE.x
MiniFE Mini-App, OpenMP Peer Implementation
Creating OpenMP Thread Pool...
Counted: 16 threads.
Running MiniFE Mini-App...
creating/filling mesh...0.000133991s, total time: 0.000133991
generating matrix structure...0.000746965s, total time: 0.000880957
assembling FE data...0.000336885s, total time: 0.00121784
imposing Dirichlet BC...0.000154972s, total time: 0.00137281
imposing Dirichlet BC...0.000139952s, total time: 0.00151277
making matrix indices local...0s, total time: 0.00151277
Starting CG solver ...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
The same code run successfully when compiled with Intel DPC++. Any lead to what could be wrong?
Hi,
In the main.cu file of stddev-cuda folder, we can see a code like the following:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda.h>
#include "reference.h"
While compiling the code, we get an error stating: "fatal error: reference.h: No such file or directory".
The cause is the lack of "reference.h" file in that particular folder.
Can you please help us with that?
It seems like several header files are missing for the testSNAP benchmark: (applies to hip, cuda, sycl, and omp)
Lines 27 to 35 in c8f6564
I tried to find the headers files in the entire repository, but could not find them...
Thank you!
Found that dxtc1-cuda is failing in CPU/GPU validation.
dxtc1-cuda$ make run
./main ../dxtc1-sycl/data/lena_std.ppm
../dxtc1-sycl/data/lena_ref.dds 100
Loaded '../dxtc1-sycl/data/lena_std.ppm', 512 x 512 pixels
Running DXT Compression on 512 x 512 image...
16384 Workgroups, 64 Work Items per Workgroup, 1048576 Work Items in NDRange...
Average kernel execution time 0.000545 (s)
Comparing against Host/C++ computation...
RMS(reference, result) = 12.503659
FAIL
@zjin-lcf verified other flows HIP/ChipStar, both are passing.
The mandelbrot-sycl
directory no longer has a Makefile:
It might have been lost in the recent makefiles refactoring.
HeCBench/src/bn-cuda/kernels.cu
Line 44 in 94d1eed
Variable j may reach NODE_N, cause out of index access to array.
Missing makefile for tc benchmark in all variants i.e., hip/sycl/cuda
I was wondering why SYCL was much faster than HIP on chipStar, both targeting the same device through OpenCL. I looked into the SYCL's kernel bitcode and found out that it effectively does not compute anything meaningful. The kernel botcode looks like this:
; Function Attrs: norecurse nounwind
define weak_odr dso_local spir_kernel void @_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_E6kernel(i32 noundef %0) local_unnamed_addr #0 comdat !srcloc !60 !kernel_arg_buffer_location !61 !sycl_fixed_targets !58 !sycl_kernel_omit_args !62 {
call void @__itt_offload_wi_start_wrapper()
%2 = load i64, ptr addrspace(1) @__spirv_BuiltInWorkgroupSize, align 32, !noalias !63
%3 = load i64, ptr addrspace(1) @__spirv_BuiltInNumWorkgroups, align 32, !noalias !70
%4 = load i64, ptr addrspace(1) @__spirv_BuiltInGlobalInvocationId, align 32, !noalias !75
%5 = sext i32 %0 to i64
%6 = icmp ult i64 %4, 2147483648
tail call void @llvm.assume(i1 %6)
%7 = icmp ult i64 %3, 2147483648
tail call void @llvm.assume(i1 %7)
%8 = icmp ult i64 %2, 2147483648
tail call void @llvm.assume(i1 %8)
%9 = mul nuw nsw i64 %3, %2
%10 = shl i64 %9, 32
%11 = ashr exact i64 %10, 32
br label %12
12: ; preds = %15, %1
%13 = phi i64 [ %4, %1 ], [ %16, %15 ]
%14 = icmp ult i64 %13, %5
br i1 %14, label %15, label %17
15: ; preds = %12
%16 = add i64 %13, %11
br label %12, !llvm.loop !80
17: ; preds = %12
call void @__itt_offload_wi_finish_wrapper()
ret void
}
It further turned out that the verification is broken too. It gives a pass even though the SYCL kernel doesn't compute anything. The verification is also broken in adam-hip
benchmark (it passes even when kernel launches are removed in the code).
Hello @zjin-lcf
I am investigating the performance impact of using the texture cache on Nvidia GPU's in SYCL and CUDA. I have noticed that some benchmarks (clink, convolution1D, convolutionsSeparable, page-rank, swish, all-pairs-distance) have had the explicit ldg instructions already added to them. To that I have a question of how was this selection of benchmarks chosen? Have you tested any other benchmarks and just not noticed any performance benefits of using it there? If so which other benchmarks did you test, and which GPU architectures have you tested them with?
Benchmark: lebesgue-hip fails in compilation due to atomicMax prototype difference in ChipStar whereas the same benchmark passes using AMD hip.
/src/lebesgue-hip> make run
hipcc -std=c++14 -Wall -O3 -c ../lebesgue-cuda/main.cpp -o main.o
hipcc -std=c++14 -Wall -O3 -c ../lebesgue-cuda/utils.cpp -o utils.o
hipcc -std=c++14 -Wall -O3 -c kernels.cu -o kernels.o
kernels.cu:40:3: error: no matching function for call to 'atomicMax'
atomicMax(lmax, t);
^~~~~~~~~
hip/devicelib/atomics.hh:217:36: note: candidate function not viable: no known conversion from 'double *__restrict' to 'int *' for 1st argument
extern "C++" inline device int atomicMax(int *address, int val) {
^
hip/devicelib/atomics.hh:223:45: note: candidate function not viable: no known conversion from 'double *__restrict' to 'unsigned int *' for 1st argument
extern "C++" inline device unsigned int atomicMax(unsigned int *address,
^
/hip/devicelib/atomics.hh:231:1: note: candidate function not viable: no known conversion from 'double *__restrict' to 'unsigned long long *' for 1st argument
atomicMax(unsigned long long *address, unsigned long long val) {
AMD-Hip Compilation log:
HeCBench/src/lebesgue-hip$ make run
hipcc -std=c++14 -Wall -O3 -c ../lebesgue-cuda/main.cpp -o main.o
hipcc -std=c++14 -Wall -O3 -c ../lebesgue-cuda/utils.cpp -o utils.o
hipcc -std=c++14 -Wall -O3 -c kernels.cu -o kernels.o
hipcc -std=c++14 -Wall -O3 main.o utils.o kernels.o -o main
./main 1000000 2
This benchmark has sym link to locations.txt pointing to geodesic-sycl/locations.txt . First one must untar the locations.tar from geodesic-sycl.
This is specific to printf is coded as printn , after fixing this application compiles smoothly.
:~/ANL_WORK/HeCBench/ga-hip> make
hipcc -std=c++14 -Wall -I../ga-cuda -O3 -c main.cu -o main.o
main.cu:117:3: error: use of undeclared identifier 'printn'
printn("Total kernel execution time %f (s)\n", total_time * 1e-9f);
^
1 error generated when compiling for .
Hello,
We occasionally run the benchmarks to look for and analyse cases where DPC++ SYCL performance is slower than that of CUDA on Nvidia GPUs.
Following an investigation on why the lavaMD SYCL kernel performs significantly worse on NVidia/PTX target than CUDA and we have a few suggestions that can help achieve parity.
Having inspected the target PTX device code, in our case on Nvidia GeForce RTX 2060 with arch=sm_60
, the compiler produces device code which results in the execution of local / global memory instructions very extremely.
For comparison here are the timings of the native CUDA kernel program.
Kernel execution time: 8.2192 ms
Kernel execution time: 32.2551 ms
So, we found the main issue comes from the following inner for-loop computation.
// loop for the number of particles in the current nei box
for (j=0; j<NUMBER_PAR_PER_BOX; j++){
r2 = rA_shared[wtx].v + rB_shared[j].v - DOT(rA_shared[wtx],rB_shared[j]);
u2 = a2*r2;
vij= cl::sycl::exp(-u2);
fs = 2*vij;
d.x = rA_shared[wtx].x - rB_shared[j].x;
fxij=fs*d.x;
d.y = rA_shared[wtx].y - rB_shared[j].y;
fyij=fs*d.y;
d.z = rA_shared[wtx].z - rB_shared[j].z;
fzij=fs*d.z;
d_fv_gpu_acc[first_i+wtx].v += qB_shared[j]*vij;
d_fv_gpu_acc[first_i+wtx].x += qB_shared[j]*fxij;
d_fv_gpu_acc[first_i+wtx].y += qB_shared[j]*fyij;
d_fv_gpu_acc[first_i+wtx].z += qB_shared[j]*fzij;
}
// increment work thread index
wtx = wtx + NUMBER_THREADS;
[...]
The heavy stalling instruction sequence pattern detected in the inner for-loop
look like this in PTX device code:
ld.shared.f32 %f50, [%rd70];
fma.rn.f32 %f57, %f50, %f42, %f57;
st.global.f32 [%rd16], %f57;
ld.shared.f32 %f51, [%rd70];
fma.rn.f32 %f56, %f51, %f45, %f56;
st.global.f32 [%rd16+4], %f56;
ld.shared.f32 %f52, [%rd70];
fma.rn.f32 %f55, %f52, %f47, %f55;
st.global.f32 [%rd16+8], %f55;
ld.shared.f32 %f53, [%rd70];
fma.rn.f32 %f54, %f53, %f49, %f54;
st.global.f32 [%rd16+12], %f54;
This inefficient sequence of instructions is generated as a side effect of accumulation in the for-loop
. The shared memory language semantics classify it as volatile and for that reason in our case the compiler is unable to safely load the scalar value just once from shared memory in a register and reuse it in consecutively scheduled FMAs (+= shared*constant) efficiently.
Same goes for the stores to the global vector resulting in instruction interlieving and wasted cycles (a lot of) between instructions, waiting on global/local memory queues to free up.
diff --git a/kernel.sycl b/kernel.sycl
index 30e264b..df064df 100644
--- a/kernel.sycl
+++ b/kernel.sycl
@@ -106,10 +106,11 @@ if(bx<dim_cpu_number_boxes) {
fyij=fs*d.y;
d.z = rA_shared[wtx].z - rB_shared[j].z;
fzij=fs*d.z;
- d_fv_gpu_acc[first_i+wtx].v += qB_shared[j]*vij;
- d_fv_gpu_acc[first_i+wtx].x += qB_shared[j]*fxij;
- d_fv_gpu_acc[first_i+wtx].y += qB_shared[j]*fyij;
- d_fv_gpu_acc[first_i+wtx].z += qB_shared[j]*fzij;
+ auto N = qB_shared[j];
+ d_fv_gpu_acc[first_i+wtx].v += N*vij;
+ d_fv_gpu_acc[first_i+wtx].x += N*fxij;
+ d_fv_gpu_acc[first_i+wtx].y += N*fyij;
+ d_fv_gpu_acc[first_i+wtx].z += N*fzij;
}
// increment work thread index
Kernel execution time: 17.3396 ms
first_i+wtx
is loop invariant for the inner loop that performs the actual computation.diff --git a/lavaMD-sycl/kernel.sycl b/lavaMD-sycl/kernel.sycl
index 4263aead..6c8fcbfc 100644
--- a/lavaMD-sycl/kernel.sycl
+++ b/lavaMD-sycl/kernel.sycl
@@ -92,7 +92,7 @@ if(bx<dim_cpu_number_boxes) {
// loop for the number of particles in the home box
while(wtx<NUMBER_PAR_PER_BOX){
-
+ auto out_buf_accum = d_fv_gpu_acc[first_i+wtx];
// loop for the number of particles in the current nei box
for (j=0; j<NUMBER_PAR_PER_BOX; j++){
@@ -106,12 +106,16 @@ if(bx<dim_cpu_number_boxes) {
fyij=fs*d.y;
d.z = rA_shared[wtx].z - rB_shared[j].z;
fzij=fs*d.z;
- d_fv_gpu_acc[first_i+wtx].v += qB_shared[j]*vij;
- d_fv_gpu_acc[first_i+wtx].x += qB_shared[j]*fxij;
- d_fv_gpu_acc[first_i+wtx].y += qB_shared[j]*fyij;
- d_fv_gpu_acc[first_i+wtx].z += qB_shared[j]*fzij;
+
+ const auto N = qB_shared[j];
+ out_buf_accum.v += N*vij;
+ out_buf_accum.x += N*fxij;
+ out_buf_accum.y += N*fyij;
+ out_buf_accum.z += N*fzij;
}
+ d_fv_gpu_acc[first_i+wtx] = out_buf_accum;
+
// increment work thread index
Kernel execution time: 9.0887 ms
__attribute__((aligned(16)))
for the FOUR_VECTOR
struct, the LLVM is be able to produce vectorized loads and stores (related to device data of that type) and save a little more on the latency in comparison to the consecutive scalar ones, hence achieving even better kernel performance.Anyways, the expectation is that production code should be using properly aligned vector types otherwise users shouldn't really expect guaranteed vectorizations. The code can also be re-written using the sycl::vec
types (e.g. sycl::float4
).
diff --git a/lavaMD-sycl/main.h b/lavaMD-sycl/main.h
index afbe6403..2072b981 100644
--- a/lavaMD-sycl/main.h
+++ b/lavaMD-sycl/main.h
@@ -37,7 +37,7 @@ typedef struct
{
fp v, x, y, z;
-} FOUR_VECTOR;
+} __attribute__((aligned(16))) FOUR_VECTOR;
typedef struct nei_str
{
Kernel execution time: 8.81664 ms, much close to the native CUDA.
In order to more accurately time kernel execution in SYCL you would want to enable the profiling property on the sycl::queue and use the cl::sycl::event
class to gather the command group submission and kernel execution start points and time them.
e.g. create the sycl::queue as follows:
const cl::sycl::property_list queueProps = {cl::sycl::property::queue::enable_profiling()};
cl::sycl::queue q(dev_sel, queueProps);
The total time consists of the command group submission and kernel execution times. In the case of the SYCL variant of the benchmark programs in the HeCBench repo, using a wall clock to time the kernel ends up inflating the actual kernel execution time as it includes command group submission too. The larger the data set the less noticeable the impact of the submission will be.
More on this can be read on Codeplay's developer blog post about SYCL Profiling.
In the folder lsqt-cuda
, in the file vector.cu
:
#ifndef CPU_ONLY
__device__ void warp_reduce(volatile real* s, int t)
{
s[t] += s[t + 32];
s[t] += s[t + 16];
s[t] += s[t + 8];
s[t] += s[t + 4];
s[t] += s[t + 2];
s[t] += s[t + 1];
}
#endif
In newer architectures of NVidia (Volta and above) the code has a data-race. We can use __syncwarp()
between each memory operation to fix this problem. Further, the CUDA compiler may elide some of these synchronization instructions in the final generated code depending on the target architecture (e.g. on pre-Volta architectures), as per blog post.
It can be solved like this:
#ifndef CPU_ONLY
__device__ void warp_reduce(volatile real* s, int t)
{
int v = 0;
v+= s[t + 32]; __syncwrap();
s[t] = v; __syncwrap();
v += s[t + 16]; __syncwrap();
s[t] = v; __syncwrap();
v += s[t + 8]; __syncwrap();
s[t] = v; __syncwrap();
v += s[t + 4]; __syncwrap();
s[t] = v; __syncwrap();
v += s[t + 2]; __syncwrap();
s[t] = v; __syncwrap();
v += s[t + 1]; __syncwrap();
s[t] = v;
}
#endif
Please let me know if you'd like me to create a PR.
@cogumbreiro
Hey, could you please elaborate on the usage of the ./autohecbench.py
script ?
I have tried the following as stated in the readme:
./autohecbench.py sycl -o sycl.csv
But all I get is failures to compile.
for example:
Failed compilation in /home/cerqueira/faculdade/thesis/code/HeCBench/src/scripts/../bh-sycl.
Command '['make', 'GCC_TOOLCHAIN=""', 'CUDA=yes', 'CUDA_ARCH=sm_60']' returned non-zero exit status 2.
I am using intel oneApi and would like to compare the usage of the runtime of my iGPU and my CPU.
Hello!
Here is a list of benchmarks for which I couldn't find the input files specified in the Makefiles. I have been looking for them around but still don't see them...
cmp-hip (Makefile says: /cmp-cuda/data/simple-synthetic.su)
mmcsf-hip (Makefile says: /mmcsf-cuda/output.tns)
cc-hip (Makefile says: /cc-cuda/delaunay_n24.egr)
chi2-hip (Makefile says: /chi2-cuda/traindata)
bfs-hip (Makefile says: /data/bfs/graph1MW_6.txt)
b+tree-hip (Makefile says: [Input File]: /data/b+tree/mil.txt [Command File]: /data/b+tree/command.txt)
bmf-hip (Makefile says: /bmf-cuda/data/MNIST.in)
Thank you! : )
HeCBench/src/tpacf-cuda/histogram_kernel.cu
Line 143 in 86bd0c5
__shared__ unsigned char s_Hist[MEMPERBLOCK];
Which:
#define NUMBINS 32
#define NUMTHREADS 128
#define MAXNUMBLOCKS 16384
#define MAXBLOCKSEND 32
#define DATAPERBLOCK (NUMTHREADS * 63)
#define MEMPERBLOCK (NUMTHREADS * NUMBINS)
MEMPERBLOCK is 32 * 128 = 4K.
When references this shared memory,
s_Hist[threadPos + __mul24( (dataTemp >> 2) & 63, NUMTHREADS)]++;
The maximum possible index is:
threadPos + 63 * NUMTHREADS = threadPos + 63*128;
Which exceeds the size of shared memory.
Whole example is using only float type, but because some constant values are set without 'f' suffix are threated by compiler as double values.
In some cases (when left hand argument is such constant) calculations are performed using double values and results casted to float values.
It impacts performance of this workload by about 10% (in dpcpp compiler).
Patch attached
black-scholes-sycl.zip
nms-hip refers to a data present in nms-cuda/detections.txt, please update Makefile to refer this path instead of local directory.
Looks like a applicaiton specific issue, do we have a fix for this ?
main.cpp:893:17: error: use of undeclared identifier 'NodeD'
if( NodeD(m, iD) == 0 ) return;
^
main.cpp:894:13: error: use of undeclared identifier 'NodeD'
NodeD(m, iH) = NodeD(m, iH) - NodeD(m, iR1)*( NodeD(m, iM) - NodeD(m-NLat, iM) + NodeD(m, iN)*d_R6[j] - NodeD(m-1, iN)*d_R6[j-1] );
Input data link for this benchmark is not valid. Please take a look,
Following is the input data link present in README: https://web.njit.edu/~usman/courses/cs677_spring19/traindata.gz
/src/prna-sycl> make run
icpx -D__CUDACC__ -DFLOAT -std=c++17 -Wall -fsycl -O3 -ffast-math -DUSE_GPU -c main.c -o main.o
icpx: warning: treating 'c' input as 'c++' when -fsycl is used [-Wexpected-file-type]
In file included from main.c:10:
In file included from ./prna.h:4:
In file included from ./param.h:5:
./real.h:39:10: fatal error: 'common.h' file not found
#include "common.h"
@zjin-lcf : Please check if a source file is missed out here for prna test case.
Hi!
For the cm-hip benchmark (using the utils.h header file from cm-cuda), we are getting the following compilation error:
../cm-cuda/utils.h:99:32: error: no type named 'string' in the global namespace; did you mean 'std::string'?
int changeToDirectory(const std::string &);
^~~~~~~~
std::string
It seems like it is because in this code section in the utils.h header file:
Lines 17 to 24 in 6da926c
__CUDACC__
is not defined (which is valid because we are using hip/AMD), and therefore the std is defined as empty (line 22), and that results in std::string
being interpreted as ::string
.
Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.