GithubHelp home page GithubHelp logo

uob-hpc / minibude Goto Github PK

View Code? Open in Web Editor NEW
22.0 8.0 8.0 19.54 MB

A BUDE virtual-screening benchmark, in many programming models

License: Apache License 2.0

Makefile 1.38% C 61.89% Cuda 1.83% CMake 6.88% C++ 18.74% Python 1.18% Julia 8.05% Shell 0.05%
hpc benchmark performance-portability

minibude's Introduction

miniBUDE

This mini-app is an implementation of the core computation of the Bristol University Docking Engine (BUDE) in different HPC programming models. The benchmark is a virtual screening run of the NDM-1 protein and runs the energy evaluation for a single generation of poses repeatedly, for a configurable number of iterations. Increasing the iteration count has similar performance effects to docking multiple ligands back-to-back in a production BUDE docking run.

Structure

The top-level data directory contains the input common to implementations. The top-level makedeck directory contains an input deck generation program and a set of mol2/bhff input files. Each other subdirectory contains a separate C/C++ implementation:

We also include implementations in emerging programming languages as direct ports of miniBUDE:

Building

To build with the default options, type make in an implementation directory. There are options to choose the compiler used and the architecture targeted.

Refer to each implementation's README for further build instructions.

Running

To run with the default options, run the binary without any flags. To adjust the run time, use -i to set the number of iterations. For very short runs, e.g. for simulation, use -n 1024 to reduce the number of poses.

Refer to each implementation's README for further run instructions.

Benchmarks

Two input decks are included in this repository:

  • bm1 is a short benchmark (~100 ms/iteration on a 64-core ThunderX2 node) based on a small ligand (26 atoms)
  • bm2 is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (2672 atoms)* bm2 is a long benchmark (~25 s/iteration on a 64-core ThunderX2 node) based on a big ligand (2672 atoms)
  • bm2_long is a very long benchmark based on bm2 but with 1048576 poses instead of 65536

They are located in the data directory, and bm1 is run by default. All implementations accept a --deck parameter to specify an input deck directory. See makedeck for how to generate additional input decks.

Citing

Please cite miniBUDE using the following reference:

Andrei Poenaru, Wei-Chen Lin and Simon McIntosh-Smith. ‘A Performance Analysis of Modern Parallel Programming Models Using a Compute-Bound Application’. In: 36th International Conference, ISC High Performance 2021. Frankfurt, Germany, 2021. In press.

minibude's People

Contributors

andreipoe avatar tom91136 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

minibude's Issues

Problem to Run SYCL Part

Hi,

I'm sorry for disturb you, but i have tried run miniBude using sycl, but i got a error:

Available SYCL devices:
0. Host Device(host)

  1. NVIDIA GeForce RTX 2080 Ti(gpu)
    Device : NVIDIA GeForce RTX 2080 Ti
    Type : gpu
    Profile : FULL_PROFILE
    Version : OpenCL 3.0 CUDA
    Vendor : NVIDIA Corporation
    Driver : 510.73.05
    Poses : 65536
    Iterations: 8
    Ligands : 26
    Proteins : 938
    Deck : ../data/bm1
    WG : 4 (use nd_range:true)
    free(): invalid pointer
    Aborted

I managed to identify the function -> clCreateProgramWithBinary

Can you help to overcome this problem ?

Rui

Implement validation

All the implementations need a validation procedure against a known good output.

kokkos version: cmake error

Using cmake 2.12.2, cmake errors out with the following when attempting to build the kokkos version:

CMake Error at CMakeLists.txt:86 (target_link_libraries):
  The plain signature for target_link_libraries has already been used with
  the target "bude".  All uses of target_link_libraries with a target must be
  either all-keyword or all-plain.

  The uses of the plain signature are here:

   * CMakeLists.txt:76 (target_link_libraries)

It seems that when there are multiple target_link_libraries lines, they either both need to have a keyword or not. In your file kokkos/CMakeLists.txt, there are lines that have the PUBLIC keyword and one that does not. I changed as follows to get a successful build:

--- a/kokkos/CMakeLists.txt
+++ b/kokkos/CMakeLists.txt
@@ -73,7 +73,7 @@ if(DEFINED OLD_CMAKE_CXX_FLAGS) # restore if overwritten before, as required by
     set(CMAKE_CXX_FLAGS ${OLD_CMAKE_CXX_FLAGS})
 endif()
 
-target_link_libraries(bude Kokkos::kokkos)
+target_link_libraries(bude PUBLIC Kokkos::kokkos)
 
 if (${CMAKE_VERSION} VERSION_LESS "3.13.0")
     message(WARNING "target_link_options is only available in CMake >= 3.13.0, using fallback target_link_libraries, this may cause issues with some compilers")

This may not be the only possible solution, but it worked for me -- I am no cmake expert... I can PR if you like.

SYCL Performance Regression in f527c4c

Commit f527c4c decreases performance by more than 2X on Cascade Lake:

< 099e6ed
---
> f527c4c
12,13c12,13
< - Total time:     5603.552 ms
< - Average time:   700.444 ms
---
> - Total time:     12375.770 ms
> - Average time:   1546.971 ms

CUDA shared case (bug ?)

For the CUDA example

// Get index of first TD
int ix = blockIdx.xblockDim.xNUM_TD_PER_THREAD + threadIdx.x;

// Have extra threads do the last member intead of return.
// A return would disable use of barriers, so not using return is better
ix = ix < numTransforms ? ix : numTransforms - NUM_TD_PER_THREAD;

#ifdef USE_SHARED
extern shared FFParams forcefield[];
if(ix < num_atom_types)
{
forcefield[ix] = global_forcefield[ix];
}
#else

I think the ix in the shared case should be threadIdx.x. should't be ?

Problems to execute benchmark on Nvidia GPU

Hi,

I'm using the computecpp 2.0.0, ubuntu 20.04 and a GPU Nvidia 1070.

when i trying build your benchmark, the system return the following error:
ptxas fatal : Unresolved extern function '_Z4fabsf'

What could be wrong?

Regards,

Rui

SYCL version won't build for oneAPI

It has compiling issue (see the below)
In file included from /opt/intel/oneapi/compiler/2021.3.0/linux/bin/../include/sycl/CL/sycl/detail/generic_type_traits.hpp:16:
/opt/intel/oneapi/compiler/2021.3.0/linux/bin/../include/sycl/CL/sycl/half_type.hpp:79:9: fatal error: cannot assign to non-static data member within const member function 'operator-'
Buf ^= 0x8000;
~~~ ^
/opt/intel/oneapi/compiler/2021.3.0/linux/bin/../include/sycl/CL/sycl/half_type.hpp:78:19: note: member function 'sycl::detail::host_half_impl::half::operator-' is declared const here
constexpr half &operator-() {

1 error generated.

Here is oneAPI version,
Intel(R) oneAPI DPC++/C++ Compiler 2021.3.0 (2021.3.0.20210619)

After changing c++ standard from c++11 to c++17,  (-std=c++11 => -std=c++17), now it can compile.

numposes only works when set to 65536

At least for the C++ implementations, if the numpose parameter is set to something other than 65536, the benchmark terminates saying bad poses: N.

UoB-HPC / miniBUDE Public/data problem

Hello tom91136

I am trying do the miniBUDE)/sycl/ project and try to implement on opeapi. However i find the data files' content(.in file) from data/bm1 and data/bm2 are almost garbled text. May you update correct data?

Energy verification for < 1.f entries

For the energy verification code, different implementation seems to disagree on whether entries are skipped based on the reference values or actual implementation generated values.
For CUDA, CL, and omp-target, values are skipped based on the computed values.
For SYCL and omp, values are skipped based on the reference values.
We probably want to verify both numbers are less than 1.f before ignoring the entries? Something like this:

if (fabs(resultsImpl[i]) < 1.f && fabs(resultsRef[i]) < 1.f) continue;

[Kokkos] Build system does not support Fujitsu Compiler in clang mode

Summary

The Fujitsu compiler for the A64FX has two modes of operations: trad (the default), in which it uses a proprietary frontend and CLI flags, and clang, based on the clang frontend. To use clang mode, all commands needs to include -Nclang and clang (not trad) flags. The current BUDE Kokkos build system doesn't support a way to invoke the Fujitsu compiler in clang mode.

What should happen

The user should be able to specify the compiler to be either FCC (for trad mode) or FCC -Nclang (for clang mode). The selected mode needs to be used for both compiling objects and linking.

What actually happens

Including -Nclang in CXX_EXTRA_FLAGS results in CMake detecting (and using) FCC in the default trad mode:

$ cmake -DCMAKE_CXX_COMPILER='FCC'  -DCXX_EXTRA_FLAGS=-Nclang ...
-- The CXX compiler identification is Fujitsu

$ make VERBOSE=1
[  4%] Building CXX object kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o
/opt/FJSVstclanga/cp-1.0.20.04/bin/FCC -I... -fopenmp -march=armv8.2-a+sve -std=c++14 -o ...

Notice that the -Nclang flag is not included above. This will lead to objects being compiled in trad mode. The linker, however, will be passed the CXX_EXTRA_FLAGS, and so will try to link in clang mode, which will fail with unresolved symbols in the standard library.

On the other hand, manually setting CXXFLAGS detects the compiler as clang and applies the right flags:

$ CXX=FCC CXXFLAGS=-Nclang cmake ...
-- The CXX compiler identification is Clang 7.1.0

$ make VERBOSE=1
[  4%] Building CXX object kokkos/core/src/CMakeFiles/kokkoscore.dir/impl/Kokkos_CPUDiscovery.cpp.o
/opt/FJSVstclanga/cp-1.0.20.04/bin/FCC -I...  -Nclang --mcpu=a64fx -O3 -fopenmp=libomp ...

This leads to a complete build, but setting the environment variable this way completely overrides the flags passed to CXX_EXTRA_FLAGS, which is used in our portability scripts.

Setting CMAKE_CXX_COMPILER='FCC -Nclang' is not accepted by CMake.

Proposed fix

Ideally, the CXX_EXTRA_FLAGS should be passed to CMake early, so that it detects FCC in clang mode successfully. If this is not possible, then we will need to have a documented workaround for this specific compiler...

V2 fails to prevent invalid wgsizes from launching

If we try to launch the benchmark with an non existent kernel WGSIZE, the program actually gives you an invalid result instead of reporting this and terminating early:

miniBUDE:  
compile_commands:
   - "/opt/nvidia/hpc_sdk/Linux_x86_64/23.5/compilers/bin/nvcc -forward-unknown-to-host-compiler -DCUDA -DMEM=MANAGED -DUSE_PPWI="1\\,2\\,4\\,8\\,16\\,32\\,64\\,128" --options-file <OUT>/includes_CUDA.rsp  -std=c++17 -forward-unknown-to-host-compiler -arch=sm_61 -use_fast_math -restrict -keep   -DNDEBUG -std=c++17 -O3 -march=native -x cu -c <SRC>/main.cpp -o <OUT>/src/main.cpp.o"
vcs:
  commit:  e7339d6cd9b832f0ba59ed73d2bc406e4345d495*
  author:  "Tom Lin ([email protected])"
  date:    "2023-10-02 15:21:22 +0100"
  subject: "Prevent NVHPC from optimising away task barrier (likely a bug)"
host_cpu:
  ~
time: { epoch_s:1698373309, formatted: "Fri Oct 27 02:21:49 2023 GMT" }
deck:
  path:         "../data/bm1"
  poses:        65536
  proteins:     938
  ligands:      26
  forcefields:  34
config:
  iterations:   8
  poses:        65536
  ppwi:
    available:  [1,2,4,8,16,32,64,128]
    selected:   [64]
  wgsize:       [512]
device: { index: 0,  name: "NVIDIA TITAN X (Pascal) (12189MB;sm_61)" }
# Device and kernel cc: sm_61
# Verification failed for ppwi=64, wgsize=512; difference exceeded tolerance (0.025%)
# Bad energies (failed/total=58671/65536, showing first 8): 
# index,actual,expected,difference_%
# 0,0,865.523,100
# 1,0,25.0715,100
# 2,0,368.434,100
# 3,0,14.6651,100
# 4,0,574.987,100
# 5,0,707.354,100
# 6,0,33.947,100
# 7,0,135.588,100
# (ppwi=64,wgsize=512,valid=0)
results:
  - outcome:             { valid: false, max_diff_%: 100.000 }
    param:               { ppwi: 64, wgsize: 512 }
    raw_iterations:      [3.50847,0.00114,0.00047,0.00039,0.00041,0.00038,0.00036,0.00037,0.00034,0.00039]
    context_ms:          0.635100
    sum_ms:              0.003
    avg_ms:              0.000
    min_ms:              0.000
    max_ms:              0.000
    stddev_ms:           0.000
    giga_interactions/s: 4111361.976
    gflop/s:             124067012.898
    gfinst/s:            102784049.389
    energies:            
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
      - 0.00
best: { min_ms: 0.00, max_ms: 0.00, sum_ms: 0.00, avg_ms: 0.00, ppwi: 64, wgsize: 512 }

We also need to add a hint in the error such that the missing WGSIZE can be added.
Thanks to @jhdavis8 for discovering this.

OpenCL version segfaults on Intel NEO

Trying the current(37a6bd8) OpenCL version on Intel UHD630 with the NEO CL driver from Intel produces a segfault:

Running OpenCL
[New Thread 0x7fffef5db700 (LWP 219939)]
Using device: Intel(R) Gen9 HD Graphics NEO

Thread 1 "bude-opencl" received signal SIGSEGV, Segmentation fault.
0x00007fffe0d989bf in clang::serialization::BasicReaderBase<clang::ASTRecordReader>::readDeclarationName() () from /lib64/../lib64/libclang-cpp.so.10
Missing separate debuginfos, use: dnf debuginfo-install clang-libs-10.0.1-2.fc32.x86_64 intel-gmmlib-20.2.2-1.fc32.x86_64 intel-igc-core-1.0.4241-1.fc32.x86_64 intel-igc-opencl-1.0.4241-1.fc32.x86_64 intel-opencl-20.28.17293-1.fc32.x86_64 intel-opencl-clang-10.0.12-1.fc32.x86_64 libedit-3.1-32.20191231cvs.fc32.x86_64 libffi-3.1-24.fc32.x86_64 libgcc-10.2.1-1.fc32.x86_64 libgomp-10.2.1-1.fc32.x86_64 libstdc++-10.2.1-1.fc32.x86_64 libva-2.7.1-1.fc32.x86_64 llvm-libs-10.0.1-4.fc32.x86_64 ncurses-libs-6.1-15.20191109.fc32.x86_64 nvidia-driver-cuda-libs-455.28-1.fc32.x86_64 ocl-icd-2.2.13-1.fc32.x86_64 spirv-llvm-translator-10.0.12-1.fc32.x86_64 zlib-1.2.11-21.fc32.x86_64
(gdb) backtrace
#0  0x00007fffe0d989bf in clang::serialization::BasicReaderBase<clang::ASTRecordReader>::readDeclarationName() () from /lib64/../lib64/libclang-cpp.so.10
#1  0x00007fffe0dd8c6e in clang::ASTDeclReader::VisitNamedDecl(clang::NamedDecl*) () from /lib64/../lib64/libclang-cpp.so.10
#2  0x00007fffe0dd9285 in clang::ASTDeclReader::VisitValueDecl(clang::ValueDecl*) () from /lib64/../lib64/libclang-cpp.so.10
#3  0x00007fffe0dd9319 in clang::ASTDeclReader::VisitDeclaratorDecl(clang::DeclaratorDecl*) () from /lib64/../lib64/libclang-cpp.so.10
#4  0x00007fffe0de94f7 in clang::ASTDeclReader::VisitFunctionDecl(clang::FunctionDecl*) () from /lib64/../lib64/libclang-cpp.so.10
#5  0x00007fffe0df07f6 in clang::ASTDeclReader::Visit(clang::Decl*) () from /lib64/../lib64/libclang-cpp.so.10
#6  0x00007fffe0df0c2b in clang::ASTReader::ReadDeclRecord(unsigned int) () from /lib64/../lib64/libclang-cpp.so.10
#7  0x00007fffe0d8da91 in clang::ASTReader::GetDecl(unsigned int) () from /lib64/../lib64/libclang-cpp.so.10
#8  0x00007fffe0db087e in clang::ASTReader::ReadASTBlock(clang::serialization::ModuleFile&, unsigned int) () from /lib64/../lib64/libclang-cpp.so.10
#9  0x00007fffe0dbade3 in clang::ASTReader::ReadAST(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, unsigned int, llvm::SmallVectorImpl<clang::ASTReader::ImportedSubmodule>*) () from /lib64/../lib64/libclang-cpp.so.10
#10 0x00007fffe0f1bd50 in clang::CompilerInstance::loadModuleFile(llvm::StringRef) () from /lib64/../lib64/libclang-cpp.so.10
#11 0x00007fffe0f5c49c in clang::FrontendAction::BeginSourceFile(clang::CompilerInstance&, clang::FrontendInputFile const&) () from /lib64/../lib64/libclang-cpp.so.10
#12 0x00007fffe0f13269 in clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) () from /lib64/../lib64/libclang-cpp.so.10
#13 0x00007fffe0fcd12c in clang::ExecuteCompilerInvocation(clang::CompilerInstance*) () from /lib64/../lib64/libclang-cpp.so.10
#14 0x00007fffe1ddf6b2 in Compile () from /lib64/libopencl-clang.so.10
#15 0x00007fffed498137 in TC::CClangTranslationBlock::TranslateClang(TC::TranslateClangArgs const*, TC::STB_TranslateOutputArgs*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, char const*) () from /lib64/libigdfcl.so.1
#16 0x00007fffed499a5e in TC::CClangTranslationBlock::Translate(TC::STB_TranslateInputArgs const*, TC::STB_TranslateOutputArgs*) () from /lib64/libigdfcl.so.1
#17 0x00007fffed49d969 in IGC::FclOclTranslationCtx<0ul>::Impl::Translate(unsigned long, CIF::Builtins::Buffer<1ul>*, CIF::Builtins::Buffer<1ul>*, CIF::Builtins::Buffer<1ul>*, CIF::Builtins::Buffer<1ul>*, unsigned int) () from /lib64/libigdfcl.so.1
#18 0x00007ffff2153a7a in NEO::CompilerInterface::build(NEO::Device const&, NEO::TranslationInput const&, NEO::TranslationOutput&) ()
   from /usr/lib64/intel-opencl/libigdrcl.so
#19 0x00007ffff1f9bbef in NEO::Program::build(unsigned int, _cl_device_id* const*, char const*, void (*)(_cl_program*, void*), void*, bool) ()
   from /usr/lib64/intel-opencl/libigdrcl.so
#20 0x00007ffff1f3d7c8 in clBuildProgram () from /usr/lib64/intel-opencl/libigdrcl.so
#21 0x00007ffff7f80472 in clBuildProgram () from /lib64/libOpenCL.so.1
#22 0x0000000000402908 in initCL () at bude.c:674
#23 0x0000000000402a7f in runOpenCL (results=results@entry=0x4243a0) at bude.c:266
#24 0x000000000040130e in main (argc=<optimized out>, argv=<optimized out>) at bude.c:97

So the kernel compilation crashed at runtime, this looks like a CL runtime bug on Intel's side TBH.

For sanity, I've ran the exact same binary on a Nvidia Quadro P1000 and the result was correct:

./bude-opencl --device 1 -w 4 -p 1 -i 8                                                             

Running C/OpenMP
- Total time:     1699.10 ms
- Average time:    212.39 ms
- Interactions/s:    0.47 billion
- GFLOP/s:          19.29

Running OpenCL
Using device: Quadro P1000
- Total time:      642.43 ms
- Average time:     80.30 ms
- Interactions/s:    1.24 billion
- GFLOP/s:          51.03

 OpenMP      OpenCL   (diff)
 865.52  vs  865.52  ( 0.00%)
  25.07  vs   25.07  ( 0.00%)
 368.43  vs  368.43  ( 0.00%)
  14.67  vs   14.67  ( 0.00%)
 574.99  vs  574.99  ( 0.00%)
 707.35  vs  707.35  ( 0.00%)
  33.95  vs   33.95  ( 0.00%)
 135.59  vs  135.59  ( 0.00%)

Largest difference was 0.000%

Add CLI flags for input data

Currently, the benchmark depends on a fixed relative path of the input data. A CLI flag should be available to specify a different location. This is pre-requisite to having multiple input decks.

Remove OpenMP run from OpenCL implementation

The current OpenCL implementation also runs an (older, less optimised) OpenMP implementation.
The results of the two runs are them compared for validation purposes.

After standalone validation is implemented for #1, the OpenMP run in the OpenCL version should be removed.

Consolidate all driver code

We should consolidate all driver code to use the C++ version, this brings name-based matching and unified argument parsing across all C/C++ implementations.

Better names for WG_SIZE/NUM_TD_PER_THREAD

We should eventually consolidate all the different terms. Maybe keep only two, so something like:

  • UNROLL_ITER for the static vector/loop unroll thing. Should be available in all impl.
  • WG_SIZE only exists for implementations with nd_range, such as CL, CUDA, and SYCL
    And no more NUM_TD_PER_THREAD or ppWI

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.