edoyango / grasph Goto Github PK

basic WCSPH with RK4 time-integration. Written in FORTRAN.

License: MIT License

Fortran 82.88% Batchfile 2.75% MATLAB 7.11% Shell 2.25% Cuda 1.37% Python 3.64%

grasph's Introduction

GraSPH

GraSPH is an SPH program originally inteded for simulations of bulk granular material as well as fluids. This repo contains Fortran source code files in src_CAF and src_GPU, seperated by the level of paralellism. This code is an upgraded version of used in [1, 2, 3]. Which incorporates more of Fortran's features, namely derived types and coarrays. It also incorporates structural changes enabling faster run times, and an option for GPU acceleration. The code currently only simulates water via the classic "Weakly-compressible" approach. Simple granular models will be implemented soon.

src_CAF contains code intended to run multi-core configuration enabled with the Coarray Fortran 2008 features. (confirmed working with gfortran 9.4.0, ifort 2021.8.0). Serial runs can also be setup (explained later).
src_GPU contains code intended to run on a CUDA-enabled GPU.

Beginners may wish to read the getting started page on my website.

The repo is setup to run the dam break case which uses the example/dambreak.h5 input HDF5 file. See the inputs page to see how this HDF5 file is structured. Example scripts written in Python and Matlab are alo included in the example folder.

NOTE This is a hobby project and is actively being developed. Major changes to the main branch can occur.

Prerequisites

Compilation of all source code requires hdf5 libraries where the serial and cuda code requires the high-level libraries as well.

hdf5 v1.10.7 or above (built with mpi)
make
An appropriate compiler:
- nvfortran (tested with v21.9.0 to 23.0.0) - mandatory for GPU code;
- gfortran (tested with v9.4.0 to v11.2.0); or
- ifort (tested with v2021.8.0)/ifx (tested with v2023.0.0).
An MPI implementation (for the MPI code only)
If using gfortran, OpenCoarrays library is needed.

Compiling

Compiling the code is done via the makefiles in the makefiles directory. This can be invoked with one of the following arguments:

CoArray Fortran

the compilation command is

make FC=<compiler> compiler=<gnu/intel> mode=<serial/caf> extras=<opt,dev,debug,singlecaf> -f makefiles/makefile.caf

where FC specifies the compiler to use e.g., gfortran, ifort, caf, mpifort, or mpiifort etc. (default is gfortran) compiler specifies which type of compiler i.e., either gnu or intel (default is gnu). mode specifies whether you would like to compile in serial or using Coarrays. extras add extra options:

opt adds further optimisation options (which reduce usefulness of error messages)
dev adds compiler warnings
debug adds more detailed error messages and checks (which reduces performance)
singlecaf is useful only when compiling with mode=intel. It compiles the executable with -coarray=single, but uses parallel IO i.e., needs to link to HDF5 parallel IO libraries. This allows the sph executable to be executed with mpiexec.

CUDA Fortran

the compilation command is

make -f makefiles/makefile.cuda

This already assumes nvfortran as the compiler. Currently doesn't allow extra customisation (unless you modify the makefile). This will soon be merged with the CAF makefile.

Important Environment Variables

FCFLAGS is used to specify compiler options. You should set this to point to Fortran modulefiles. e.g., export FCFLAGS="-I/usr/include/hdf5/openmpi" points the compiler to the HDF5 OpenMPI Fortran modules installed with apt-get. LDFLAGS is used to specify linker options. You should set this to point to shared libraries. e.g., export LDFLAGS="-L/usr/lib/x86_64-linux-gnu/hdf5/openmpi" points the linker to the HDF5 OpenMPI shared libraries installed by apt-get and needed by GraSPH. Note that -lhdf5 is included by default, but you may also need to add options specifying the HL libraries (lhdf5_hl_fortran.so or lhdf5hl_fortran.so), in the case of compiling with mode=serial) or the HDF5_fortran libraries (libhdf5_fortran.so).

Running

Running the program requires running the executable and supplying three integer arguments e.g.: ./sph-serial <max timesteps> <print interval> <write interval> where is the maximum number of timesteps to run the simulation for, is the number of steps between printing to the terminal, and is the number of steps between writing data to disk. The first compilation should run the classic dam-break experiment with 65,000 SPH particles. This can be run after first compilation by ./sph-serial 7500 100 100

Currently geometry and simulation parameters are hardcoded. These can be controlled by

virt_part subroutine in input.f90 (boundary geometry)
input subroutine in input.f90 (initial geometry)
param.f90 (parameters). Work is ongoing to improve this.

CAF (`mode=caf`)

Built with Opencoarrays cafrun -n <ncpus> ./sph 100000 1000 1000

Built with ifort/ifx: FOR_COARRAY_NUM_IMAGES=<ncpus> ./sph 100000 1000 1000 or if built with extras=singlecaf mpiexec -n <ncpus> ./sph 100000 1000 1000

Example Build and Run (with gfortran and Opencoarrays)

Installing opencoarrays and openmpi via spack (brew-linux is also a good option)

$ spack install opencoarrays

Making sure caf and cafrun are in my path

$ spack load opencoarrays openmpi
$ which caf
/usr/local/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.4.0/opencoarrays-2.7.1-wiecvev57rcwa6wdobdhmk2fukurcm6d/bin/caf
$ which cafrun
/usr/local/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.4.0/opencoarrays-2.7.1-wiecvev57rcwa6wdobdhmk2fukurcm6d/bin/cafrun

If the above which commands don't return anything, you'll need to add the caf bin directory into your path.

Compiling the program

$ cd ~/GraSPH
# Compiling using caf (FC=caf), with coarrays active (mode=caf), with extra optimisation and development warnings information (extras=opt,dev)
$ make FC=caf mode=caf extras=opt,dev -f makefiles/makefile.caf

Running the program

$ cafrun -n 4 sph 100000 1000 1000

Visualisation

Currently visualisation is done via a MatLab script which parses the output hdf5 files. This is included as Plot_hdf5.m. This workflow shall be improved in the future.

grasph's People

Contributors

Stargazers

Watchers

Forkers

suzhengyang

grasph's Issues

Can you show the results?

good job. Can you show the results?

gpu version in single precision doesn't gel with HDF5

sph-cuda: /tmp/edwardy/spack-stage/spack-stage-hdf5-1.12.2-yvixmh5ii4vsbjbt2wgafuzmnxksnfxo/spack-src/src/H5MM.c:614: H5MM_memcpy: Assertion `(char *)dest >= (const char *)src + n || (const char *)src >= (char *)dest + n' failed.

use input config file

currently use param.f90 to control parameters.

This is ok, as the understanding of fortran required is pretty limited. But maybe something like a TOML/YML/INI file would be better.

downside is a lot of the parameters become run-time variables, which could degrade performance. needs investigating.

Add more advanced models

At the moment, I'm mainly interested in

delta-SPH
elasto-plastic type models
fluid granular models e.g. MuI.

use precompiler directives to merge makefiles

Currently use a bash script to call seperate makescripts to compile GPU, vs CPU serial, vs CPU parallel codes. After CAF CPU parallel code, could merge CPU serial and CPU parallel into one code base, and use conditional compilation with preprocessor directives.

After than, need to think of a scheme to compile GPU code within the same makefile too.

Code bases could potentially be merged using preprocessor directives. e.g., there could be a single main.f90 file, but depending on the preprocess directives, make could compile/link the CPU or GPU version of time_integration.f90 etc.

code base probably couldn't be completely merged because of the significantly different calls made in the core computation subroutines in the GPU and CPU versions.

Choose appropriate chunk size when writing data

write speed is a bit problematic when writing output data and I know that chunk size influences things. Would be nice for the parallel_hdf5_io_helper module to choose a reasonably good chunk_size that optimises for speed and compression. Some things I know so far (run on 56 images across 2 nodes on Milton):

writing all datasets as single chunks takes 8.5s
writing all datasets with chunk_size = n/numImages or chunk_size = [1, n/numImages] takes write time down to 1.6s
writing all datasets with chunk_size approx 100kb, takes write time to 2.4s.

Improve load balancing

htop has some CPUs sitting between 70% and 100% - indicating load imbalance.

Use HDF5 input compatible with Paraview

Would be handy to use Paraview as the visualiser, instead of the janky Matlab script.

Make use of MPI aware CUDA

I'd like to implement this after all the existing issues are implemented.

Current obstacle is that I don't have access to multiple GPUs I can develop on.
The GPUs at my work are on CUDA 12, which requires NVHPC 22.3, which requires a newer OS than what we have.

Parse hdf5 as inputs

Currently hardcoded virtual and real particle generation is sub-optimal. Greater coding burden on the user.

Allowing an input hdf5 file to be parsed would make things much easier.

A user could generate one via python/c++/fortran as needed. Maybe helper python package could be created too.

runtime specification of output/input dirs

output dir is currently hardcoded in param.f90

could be a commandline argument/option.

an input spec would come after the ability to read input files #4

$ srun --nodes=2 --tasks-per-node=2 --mem-per-cpu=1G --mpi=pmi2 --nodelist=med-n18,med-n29 sph 0 1 1
srun: job 10922111 queued and waiting for resources
srun: job 10922111 has been allocated resources

                               Date = 17/03/2023
                              Time = 16:19:02.026
Executing code in parallel with    4 images!
Running       0 step(s).
Printing summary to screen every       1 step(s).
Writing output to disc every       1 step(s).
Total simulation size of 62500 physical particles, and
                         188112 virtual particles.
_______________________________________________________________________________
INFO: Image subdomain boundaries being updated
      Current timestep: 0
      Repartition mode: Updating cut locations only
      Checking whether diffusion is needed...
      No diffusion needed. Continuing...
_______________________________________________________________________________

                               Date = 17/03/2023
                              Time = 16:19:14.464
================================= TIME SUMMARY ================================
Average Wall time (s)      =      0.0000002
Average Partition time (s) =     11.2053136   <------
Average Send/recv time (s) =      0.0322832
Average Output time (s)    =      0.0000000
============================== PARTITION SUMMARY ==============================
            Number of Partitions = 1
    Avg Timesteps B/N Partitions = N/A
    Max Timesteps B/N Partitions = N/A
    Min Timesteps B/N Partitions = N/A

Number of Cut Axis Reorientation = 1
Avg Timesteps B/N Reorientations = N/A
Max Timesteps B/N Reorientations = N/A
Min Timesteps B/N Reorientations = N/A

4 images on 1 node:

$ srun --nodes=1 --tasks-per-node=4 --mem-per-cpu=1G --mpi=pmi2 --contiguous sph 0 1 1
srun: job 10922109 queued and waiting for resources
srun: job 10922109 has been allocated resources

                               Date = 17/03/2023
                              Time = 16:18:01.957
Executing code in parallel with    4 images!
Running       0 step(s).
Printing summary to screen every       1 step(s).
Writing output to disc every       1 step(s).
Total simulation size of 62500 physical particles, and
                         188112 virtual particles.
_______________________________________________________________________________
INFO: Image subdomain boundaries being updated
      Current timestep: 0
      Repartition mode: Updating cut locations only
      Checking whether diffusion is needed...
      No diffusion needed. Continuing...
_______________________________________________________________________________

                               Date = 17/03/2023
                              Time = 16:18:03.137
================================= TIME SUMMARY ================================
Average Wall time (s)      =      0.0000003
Average Partition time (s) =      0.1104427   <------
Average Send/recv time (s) =      0.0153661
Average Output time (s)    =      0.0000000
============================== PARTITION SUMMARY ==============================
            Number of Partitions = 1
    Avg Timesteps B/N Partitions = N/A
    Max Timesteps B/N Partitions = N/A
    Min Timesteps B/N Partitions = N/A

Number of Cut Axis Reorientation = 1
Avg Timesteps B/N Reorientations = N/A
Max Timesteps B/N Reorientations = N/A
Min Timesteps B/N Reorientations = N/A

Release CAF version of CPU parallel code

Coarray Fortran looks pretty cool. TO do things with higher-level Fortran intrinsics instead of MPI library calls could be much easier and approachable.