GithubHelp home page GithubHelp logo

markstock / nvortexhip Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 122 KB

Simple direct solver for vortex particle methods using HIP for GPU

License: GNU General Public License v3.0

Makefile 0.86% C++ 93.70% Shell 0.70% CMake 4.73%
gpgpu gpgpu-physics hip nbody nbody-simulation performance-test

nvortexhip's Introduction

nvortexHip

Simple direct solver for vortex particle methods using HIP for GPU

Compile and run

As long Rocm is installed and hipcc is in your PATH, you should be able to do:

make
./nvHip05.bin -n=100000

Or with cmake, make a build directory and try (with the right GPU target):

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_COMPILER=hipcc -DCMAKE_CXX_FLAGS=" -fopenmp --amdgpu-target=gfx90a" ..
make

Hipify

The first four of these codes were created from nvortexCuda using:

hipify-clang nvCuda04.cu -o=nvHip04.cpp -v --cuda-path=[..] -I [..]/include

Description

This repository contains a few progressive examples of a compute-only n-body calculation of the Biot-Savart influence of N desingularized vorticies on one another.

Program nvHip01 is the simplest implementation. On the CPU side, the program parallelizes with a basic OpenMP parallel for loop over the target particles. On the GPU side, we use HIP without unified or pinned memory (full transfers), with one target particle per "thread."

Program nvHip02 speeds this up considerably. The CPU now uses omp simd to vectorize the inner loop over source particles. The GPU uses shared memory to load blocks of source particles in a coalesced manner before all threads operate on that block. This program represents the "80" part of the "80-20 rule": that you can go most of the way with some simple methods.

Program nvHip03 adds some enhancements in an attempt to eke out even more performance, though only on the GPU side. First, we moved the GPU timers to not count allocation and deallocation, in order to be more consistent with the CPU timers. Second, we now break the computation up along the source-particle dimension, to allow for greater concurrency, which requires atomicAdd to write results back to main GPU memory. Finally, we added support for multiple GPU systems.

nvHip04 saves one flop per inner loop by presquaring the target radius, but adds six more by performing Kahan summation on the accumulators. This further reduces errors inherent in summing large arrays of numbers, but seems incompatible with the omp simd clause.

nvHip05 speeds up the CPU calculation by blocking (putting into hierarchical "blocks"), which improves memory locality, especially for large N. This has the additional benefit of reducing the errors inherent in summing large sequences of numbers (so we remove Kahan summation). It also rearranges some of the asynchronous GPU calls to reduce total memory-move-and-execution time by a few percent for multi-gpu runs.

3D Gravitation versions

ngHip04 is a modification of ngHip05 that uses Kahan summation for improved accuracy - but the Kahan summation seems to be optimized out, as the results between these two are identical.

ngHip05 is the baseline 3D gravitation code, made from the nvHip05 code, which includes blocking for the CPU calculation and using shared memory to store chunks of source particles for re-use.

ngHip06 is a gravitation version of the N-body code, and adds the __launch_bounds__ keyword, dispatches asynchronous calls to streams from multiple threads, and times the allocations.

ngHip08 is identical to ngHip05 except that is uses warp shuffle commands to read source particle data instead of shared memory. This has been reported to be faster, but here it is not.

ngHip09 is also identical to ngHip05, but it explicitly uses float2 operations in an effort to eke out more performance on MI-250X. No other Instinct accelerator has special float2 units.

ngHip10 does the same as ngHip09 but uses explicit float4 variables to further unroll the loops. Surprisingly, this improves performance further.

Building on Cray

module load PrgEnv-amd
module use /global/opt/modulefiles
module load rocm/5.2.0

Run on a slurm system with

srun -N1 -n1 --exclusive --cpus-per-task=64 --threads-per-core=1 ./ngHip06.bin -n=800000 -c

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.