uob-hpc / minibude Goto Github PK

A BUDE virtual-screening benchmark, in many programming models

License: Apache License 2.0

Makefile 1.38% C 61.89% Cuda 1.83% CMake 6.88% C++ 18.74% Python 1.18% Julia 8.05% Shell 0.05%

minibude's Introduction

miniBUDE

This mini-app is an implementation of the core computation of the Bristol University Docking Engine (BUDE) in different HPC programming models. The benchmark is a virtual screening run of the NDM-1 protein and runs the energy evaluation for a single generation of poses repeatedly, for a configurable number of iterations. Increasing the iteration count has similar performance effects to docking multiple ligands back-to-back in a production BUDE docking run.

Structure

The top-level data directory contains the input common to implementations. The top-level makedeck directory contains an input deck generation program and a set of mol2/bhff input files. Each other subdirectory contains a separate C/C++ implementation:

OpenMP for CPUs
OpenMP target for GPUs
CUDA for GPUs
OpenCL for GPUs
OpenACC for GPUs
SYCL for CPUs and GPUs
Kokkos for CPUs and GPUs

We also include implementations in emerging programming languages as direct ports of miniBUDE:

Julia for CPUs (@threads) and GPUs (CUDA.jl, AMDGPU.jl, oneAPI.jl, etc)

Building

To build with the default options, type make in an implementation directory. There are options to choose the compiler used and the architecture targeted.

Refer to each implementation's README for further build instructions.

Running

To run with the default options, run the binary without any flags. To adjust the run time, use -i to set the number of iterations. For very short runs, e.g. for simulation, use -n 1024 to reduce the number of poses.

Refer to each implementation's README for further run instructions.

Benchmarks

Two input decks are included in this repository:

bm1 is a short benchmark (~100 ms/iteration on a 64-core ThunderX2 node) based on a small ligand (26 atoms)
bm2 is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (2672 atoms)* bm2 is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (2672 atoms)
bm2_long is a very long benchmark based on bm2 but with 1048576 poses instead of 65536

They are located in the data directory, and bm1 is run by default. All implementations accept a --deck parameter to specify an input deck directory. See makedeck for how to generate additional input decks.

Citing

Please cite miniBUDE using the following reference:

Andrei Poenaru, Wei-Chen Lin and Simon McIntosh-Smith. ‘A Performance Analysis of Modern Parallel Programming Models Using a Compute-Bound Application’. In: 36th International Conference, ISC High Performance 2021. Frankfurt, Germany, 2021. In press.

minibude's People

Contributors

Stargazers

Watchers

Forkers

andreipoe wfang jeffhammond jyoung3131 akitajin xianghao-wang milthorpe hpcgroup illuhad

minibude's Issues

Problem to Run SYCL Part

Hi,

I'm sorry for disturb you, but i have tried run miniBude using sycl, but i got a error:

Available SYCL devices:
0. Host Device(host)

NVIDIA GeForce RTX 2080 Ti(gpu)
Device : NVIDIA GeForce RTX 2080 Ti
Type : gpu
Profile : FULL_PROFILE
Version : OpenCL 3.0 CUDA
Vendor : NVIDIA Corporation
Driver : 510.73.05
Poses : 65536
Iterations: 8
Ligands : 26
Proteins : 938
Deck : ../data/bm1
WG : 4 (use nd_range:true)
free(): invalid pointer
Aborted

I managed to identify the function -> clCreateProgramWithBinary

Can you help to overcome this problem ?

Rui

Implement validation

All the implementations need a validation procedure against a known good output.

OpenMP target implementation

kokkos version: cmake error

Using cmake 2.12.2, cmake errors out with the following when attempting to build the kokkos version:

CMake Error at CMakeLists.txt:86 (target_link_libraries):
  The plain signature for target_link_libraries has already been used with
  the target "bude".  All uses of target_link_libraries with a target must be
  either all-keyword or all-plain.

  The uses of the plain signature are here:

   * CMakeLists.txt:76 (target_link_libraries)

It seems that when there are multiple target_link_libraries lines, they either both need to have a keyword or not. In your file kokkos/CMakeLists.txt, there are lines that have the PUBLIC keyword and one that does not. I changed as follows to get a successful build:

--- a/kokkos/CMakeLists.txt
+++ b/kokkos/CMakeLists.txt
@@ -73,7 +73,7 @@ if(DEFINED OLD_CMAKE_CXX_FLAGS) # restore if overwritten before, as required by
     set(CMAKE_CXX_FLAGS ${OLD_CMAKE_CXX_FLAGS})
 endif()
 
-target_link_libraries(bude Kokkos::kokkos)
+target_link_libraries(bude PUBLIC Kokkos::kokkos)
 
 if (${CMAKE_VERSION} VERSION_LESS "3.13.0")
     message(WARNING "target_link_options is only available in CMake >= 3.13.0, using fallback target_link_libraries, this may cause issues with some compilers")

This may not be the only possible solution, but it worked for me -- I am no cmake expert... I can PR if you like.

SYCL Performance Regression in f527c4c

Commit f527c4c decreases performance by more than 2X on Cascade Lake:

< 099e6ed
---
> f527c4c
12,13c12,13
< - Total time:     5603.552 ms
< - Average time:   700.444 ms
---
> - Total time:     12375.770 ms
> - Average time:   1546.971 ms

SYCL implementation

CUDA shared case (bug ?)

For the CUDA example

// Get index of first TD
int ix = blockIdx.xblockDim.xNUM_TD_PER_THREAD + threadIdx.x;

// Have extra threads do the last member intead of return.
// A return would disable use of barriers, so not using return is better
ix = ix < numTransforms ? ix : numTransforms - NUM_TD_PER_THREAD;

#ifdef USE_SHARED
extern shared FFParams forcefield[];
if(ix < num_atom_types)
{
forcefield[ix] = global_forcefield[ix];
}
#else

I think the ix in the shared case should be threadIdx.x. should't be ?

Problems to execute benchmark on Nvidia GPU

Hi,

I'm using the computecpp 2.0.0, ubuntu 20.04 and a GPU Nvidia 1070.

when i trying build your benchmark, the system return the following error:
ptxas fatal : Unresolved extern function '_Z4fabsf'

What could be wrong?

Regards,

Rui

CUDA implementation

OpenACC implementation

SYCL version won't build for oneAPI

It has compiling issue (see the below)
In file included from /opt/intel/oneapi/compiler/2021.3.0/linux/bin/../include/sycl/CL/sycl/detail/generic_type_traits.hpp:16:
/opt/intel/oneapi/compiler/2021.3.0/linux/bin/../include/sycl/CL/sycl/half_type.hpp:79:9: fatal error: cannot assign to non-static data member within const member function 'operator-'
Buf ^= 0x8000;
~~~ ^
/opt/intel/oneapi/compiler/2021.3.0/linux/bin/../include/sycl/CL/sycl/half_type.hpp:78:19: note: member function 'sycl::detail::host_half_impl::half::operator-' is declared const here
constexpr half &operator-() {

1 error generated.

Here is oneAPI version,
Intel(R) oneAPI DPC++/C++ Compiler 2021.3.0 (2021.3.0.20210619)

After changing c++ standard from c++11 to c++17,  (-std=c++11 => -std=c++17), now it can compile.

Julia version neesd to match the v2 changes

The current Julia version in the v2 branch doesn't implement all the auto-tuning features of the v2 C++ driver.

numposes only works when set to 65536

At least for the C++ implementations, if the numpose parameter is set to something other than 65536, the benchmark terminates saying bad poses: N.

UoB-HPC / miniBUDE Public/data problem

Hello tom91136

I am trying do the miniBUDE)/sycl/ project and try to implement on opeapi. However i find the data files' content(.in file) from data/bm1 and data/bm2 are almost garbled text. May you update correct data?

Update FLOPS measure in Kokkos and SYCL

The current calculations are incorrect, and so the results are inconsistent with the OpenMP implementation, where the maths was corrected in bb049b3.

Energy verification for < 1.f entries

For the energy verification code, different implementation seems to disagree on whether entries are skipped based on the reference values or actual implementation generated values.
For CUDA, CL, and omp-target, values are skipped based on the computed values.
For SYCL and omp, values are skipped based on the reference values.
We probably want to verify both numbers are less than 1.f before ignoring the entries? Something like this:

if (fabs(resultsImpl[i]) < 1.f && fabs(resultsRef[i]) < 1.f) continue;

[Kokkos] Build system does not support Fujitsu Compiler in clang mode

Summary

The Fujitsu compiler for the A64FX has two modes of operations: trad (the default), in which it uses a proprietary frontend and CLI flags, and clang, based on the clang frontend. To use clang mode, all commands needs to include -Nclang and clang (not trad) flags. The current BUDE Kokkos build system doesn't support a way to invoke the Fujitsu compiler in clang mode.

What should happen

The user should be able to specify the compiler to be either FCC (for trad mode) or FCC -Nclang (for clang mode). The selected mode needs to be used for both compiling objects and linking.

What actually happens

Including -Nclang in CXX_EXTRA_FLAGS results in CMake detecting (and using) FCC in the default trad mode:

$ cmake -DCMAKE_CXX_COMPILER='FCC'  -DCXX_EXTRA_FLAGS=-Nclang ...
-- The CXX compiler identification is Fujitsu

$ make VERBOSE=1
[  4%] Building CXX object kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o
/opt/FJSVstclanga/cp-1.0.20.04/bin/FCC -I... -fopenmp -march=armv8.2-a+sve -std=c++14 -o ...

Notice that the -Nclang flag is not included above. This will lead to objects being compiled in trad mode. The linker, however, will be passed the CXX_EXTRA_FLAGS, and so will try to link in clang mode, which will fail with unresolved symbols in the standard library.

On the other hand, manually setting CXXFLAGS detects the compiler as clang and applies the right flags:

$ CXX=FCC CXXFLAGS=-Nclang cmake ...
-- The CXX compiler identification is Clang 7.1.0

$ make VERBOSE=1
[  4%] Building CXX object kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o
/opt/FJSVstclanga/cp-1.0.20.04/bin/FCC -I...  -Nclang --mcpu=a64fx -O3 -fopenmp=libomp ...

This leads to a complete build, but setting the environment variable this way completely overrides the flags passed to CXX_EXTRA_FLAGS, which is used in our portability scripts.

Setting CMAKE_CXX_COMPILER='FCC -Nclang' is not accepted by CMake.

Proposed fix

Ideally, the CXX_EXTRA_FLAGS should be passed to CMake early, so that it detects FCC in clang mode successfully. If this is not possible, then we will need to have a documented workaround for this specific compiler...

[SYCL] Link fails with hipSYCL 0.9.1

hipSYCL 0.9.1 was released recently. Using GCC 11.0 and Boost 1.73.0, linking fails (at least on A64FX).

V2 fails to prevent invalid wgsizes from launching

If we try to launch the benchmark with an non existent kernel WGSIZE, the program actually gives you an invalid result instead of reporting this and terminating early:

miniBUDE:  
compile_commands:
   - "/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/bin/nvcc -forward-unknown-to-host-compiler -DCUDA -DMEM=MANAGED -DUSE_PPWI="1\\,2\\,4\\,8\\,16\\,32\\,64\\,128" --options-file <OUT>/includes_CUDA.rsp  -std=c++17 -forward-unknown-to-host-compiler -arch=sm_61 -use_fast_math -restrict -keep   -DNDEBUG -std=c++17 -O3 -march=native -x cu -c <SRC>/main.cpp -o <OUT>/src/main.cpp.o"
vcs:
  commit:  e7339d6cd9b832f0ba59ed73d2bc406e4345d495*
  author:  "Tom Lin ([email protected])"
  date:    "2023-10-02 15:21:22 +0100"
  subject: "Prevent NVHPC from optimising away task barrier (likely a bug)"
host_cpu:
  ~
time: { epoch_s:1698373309, formatted: "Fri Oct 27 02:21:49 2023 GMT" }
deck:
  path:         "../data/bm1"
  poses:        65536
  proteins:     938
  ligands:      26
  forcefields:  34
config:
  iterations:   8
  poses:        65536
  ppwi:
    available:  [1,2,4,8,16,32,64,128]
    selected:   [64]
  wgsize:       [512]
device: { index: 0,  name: "NVIDIA TITAN X (Pascal) (12189MB;sm_61)" }
# Device and kernel cc: sm_61
# Verification failed for ppwi=64, wgsize=512; difference exceeded tolerance (0.025%)
# Bad energies (failed/total=58671/65536, showing first 8): 
# index,actual,expected,difference_%
# 0,0,865.523,100
# 1,0,25.0715,100
# 2,0,368.434,100
# 3,0,14.6651,100
# 4,0,574.987,100
# 5,0,707.354,100
# 6,0,33.947,100
# 7,0,135.588,100
# (ppwi=64,wgsize=512,valid=0)
results:
  - outcome:             { valid: false, max_diff_%: 100.000 }
    param:               { ppwi: 64, wgsize: 512 }
    raw_iterations:      [3.50847,0.00114,0.00047,0.00039,0.00041,0.00038,0.00036,0.00037,0.00034,0.00039]
    context_ms:          0.635100
    sum_ms:              0.003
    avg_ms:              0.000
    min_ms:              0.000
    max_ms:              0.000
    stddev_ms:           0.000
    giga_interactions/s: 4111361.976
    gflop/s:             124067012.898
    gfinst/s:            102784049.389
    energies:            
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
best: { min_ms: 0.00, max_ms: 0.00, sum_ms: 0.00, avg_ms: 0.00, ppwi: 64, wgsize: 512 }

We also need to add a hint in the error such that the missing WGSIZE can be added.
Thanks to @jhdavis8 for discovering this.

OpenCL version segfaults on Intel NEO

Trying the current(37a6bd8) OpenCL version on Intel UHD630 with the NEO CL driver from Intel produces a segfault:

Running OpenCL
[New Thread 0x7fffef5db700 (LWP 219939)]
Using device: Intel(R) Gen9 HD Graphics NEO

Thread 1 "bude-opencl" received signal SIGSEGV, Segmentation fault.
0x00007fffe0d989bf in clang::serialization::BasicReaderBase<clang::ASTRecordReader>::readDeclarationName() () from /lib64/../lib64/libclang-cpp.so.10
Missing separate debuginfos, use: dnf debuginfo-install clang-libs-10.0.1-2.fc32.x86_64 intel-gmmlib-20.2.2-1.fc32.x86_64 intel-igc-core-1.0.4241-1.fc32.x86_64 intel-igc-opencl-1.0.4241-1.fc32.x86_64 intel-opencl-20.28.17293-1.fc32.x86_64 intel-opencl-clang-10.0.12-1.fc32.x86_64 libedit-3.1-32.20191231cvs.fc32.x86_64 libffi-3.1-24.fc32.x86_64 libgcc-10.2.1-1.fc32.x86_64 libgomp-10.2.1-1.fc32.x86_64 libstdc++-10.2.1-1.fc32.x86_64 libva-2.7.1-1.fc32.x86_64 llvm-libs-10.0.1-4.fc32.x86_64 ncurses-libs-6.1-15.20191109.fc32.x86_64 nvidia-driver-cuda-libs-455.28-1.fc32.x86_64 ocl-icd-2.2.13-1.fc32.x86_64 spirv-llvm-translator-10.0.12-1.fc32.x86_64 zlib-1.2.11-21.fc32.x86_64
(gdb) backtrace
#0  0x00007fffe0d989bf in clang::serialization::BasicReaderBase<clang::ASTRecordReader>::readDeclarationName() () from /lib64/../lib64/libclang-cpp.so.10
#1  0x00007fffe0dd8c6e in clang::ASTDeclReader::VisitNamedDecl(clang::NamedDecl*) () from /lib64/../lib64/libclang-cpp.so.10
#2  0x00007fffe0dd9285 in clang::ASTDeclReader::VisitValueDecl(clang::ValueDecl*) () from /lib64/../lib64/libclang-cpp.so.10
#3  0x00007fffe0dd9319 in clang::ASTDeclReader::VisitDeclaratorDecl(clang::DeclaratorDecl*) () from /lib64/../lib64/libclang-cpp.so.10
#4  0x00007fffe0de94f7 in clang::ASTDeclReader::VisitFunctionDecl(clang::FunctionDecl*) () from /lib64/../lib64/libclang-cpp.so.10
#5  0x00007fffe0df07f6 in clang::ASTDeclReader::Visit(clang::Decl*) () from /lib64/../lib64/libclang-cpp.so.10
#6  0x00007fffe0df0c2b in clang::ASTReader::ReadDeclRecord(unsigned int) () from /lib64/../lib64/libclang-cpp.so.10
#7  0x00007fffe0d8da91 in clang::ASTReader::GetDecl(unsigned int) () from /lib64/../lib64/libclang-cpp.so.10
#8  0x00007fffe0db087e in clang::ASTReader::ReadASTBlock(clang::serialization::ModuleFile&, unsigned int) () from /lib64/../lib64/libclang-cpp.so.10
#9  0x00007fffe0dbade3 in clang::ASTReader::ReadAST(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, unsigned int, llvm::SmallVectorImpl<clang::ASTReader::ImportedSubmodule>*) () from /lib64/../lib64/libclang-cpp.so.10
#10 0x00007fffe0f1bd50 in clang::CompilerInstance::loadModuleFile(llvm::StringRef) () from /lib64/../lib64/libclang-cpp.so.10
#11 0x00007fffe0f5c49c in clang::FrontendAction::BeginSourceFile(clang::CompilerInstance&, clang::FrontendInputFile const&) () from /lib64/../lib64/libclang-cpp.so.10
#12 0x00007fffe0f13269 in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) () from /lib64/../lib64/libclang-cpp.so.10
#13 0x00007fffe0fcd12c in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) () from /lib64/../lib64/libclang-cpp.so.10
#14 0x00007fffe1ddf6b2 in Compile () from /lib64/libopencl-clang.so.10
#15 0x00007fffed498137 in TC::CClangTranslationBlock::TranslateClang(TC::TranslateClangArgs const*, TC::STB_TranslateOutputArgs*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, char const*) () from /lib64/libigdfcl.so.1
#16 0x00007fffed499a5e in TC::CClangTranslationBlock::Translate(TC::STB_TranslateInputArgs const*, TC::STB_TranslateOutputArgs*) () from /lib64/libigdfcl.so.1
#17 0x00007fffed49d969 in IGC::FclOclTranslationCtx<0ul>::Impl::Translate(unsigned long, CIF::Builtins::Buffer<1ul>*, CIF::Builtins::Buffer<1ul>*, CIF::Builtins::Buffer<1ul>*, CIF::Builtins::Buffer<1ul>*, unsigned int) () from /lib64/libigdfcl.so.1
#18 0x00007ffff2153a7a in NEO::CompilerInterface::build(NEO::Device const&, NEO::TranslationInput const&, NEO::TranslationOutput&) ()
   from /usr/lib64/intel-opencl/libigdrcl.so
#19 0x00007ffff1f9bbef in NEO::Program::build(unsigned int, _cl_device_id* const*, char const*, void (*)(_cl_program*, void*), void*, bool) ()
   from /usr/lib64/intel-opencl/libigdrcl.so
#20 0x00007ffff1f3d7c8 in clBuildProgram () from /usr/lib64/intel-opencl/libigdrcl.so
#21 0x00007ffff7f80472 in clBuildProgram () from /lib64/libOpenCL.so.1
#22 0x0000000000402908 in initCL () at bude.c:674
#23 0x0000000000402a7f in runOpenCL (results=results@entry=0x4243a0) at bude.c:266
#24 0x000000000040130e in main (argc=<optimized out>, argv=<optimized out>) at bude.c:97

So the kernel compilation crashed at runtime, this looks like a CL runtime bug on Intel's side TBH.

For sanity, I've ran the exact same binary on a Nvidia Quadro P1000 and the result was correct:

./bude-opencl --device 1 -w 4 -p 1 -i 8                                                             

Running C/OpenMP
- Total time:     1699.10 ms
- Average time:    212.39 ms
- Interactions/s:    0.47 billion
- GFLOP/s:          19.29

Running OpenCL
Using device: Quadro P1000
- Total time:      642.43 ms
- Average time:     80.30 ms
- Interactions/s:    1.24 billion
- GFLOP/s:          51.03

 OpenMP      OpenCL   (diff)
 865.52  vs  865.52  ( 0.00%)
  25.07  vs   25.07  ( 0.00%)
 368.43  vs  368.43  ( 0.00%)
  14.67  vs   14.67  ( 0.00%)
 574.99  vs  574.99  ( 0.00%)
 707.35  vs  707.35  ( 0.00%)
  33.95  vs   33.95  ( 0.00%)
 135.59  vs  135.59  ( 0.00%)

Largest difference was 0.000%

UNROLL_ITER for the static vector/loop unroll thing. Should be available in all impl.
WG_SIZE only exists for implementations with nd_range, such as CL, CUDA, and SYCL
And no more NUM_TD_PER_THREAD or ppWI