GithubHelp home page GithubHelp logo

edoyango / grasph Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 1.0 1.99 MB

basic WCSPH with RK4 time-integration. Written in FORTRAN.

License: MIT License

Fortran 82.88% Batchfile 2.75% MATLAB 7.11% Shell 2.25% Cuda 1.37% Python 3.64%

grasph's Introduction

GraSPH

GraSPH is an SPH program originally inteded for simulations of bulk granular material as well as fluids. This repo contains Fortran source code files in src_CAF and src_GPU, seperated by the level of paralellism. This code is an upgraded version of used in [1, 2, 3]. Which incorporates more of Fortran's features, namely derived types and coarrays. It also incorporates structural changes enabling faster run times, and an option for GPU acceleration. The code currently only simulates water via the classic "Weakly-compressible" approach. Simple granular models will be implemented soon.

  • src_CAF contains code intended to run multi-core configuration enabled with the Coarray Fortran 2008 features. (confirmed working with gfortran 9.4.0, ifort 2021.8.0). Serial runs can also be setup (explained later).
  • src_GPU contains code intended to run on a CUDA-enabled GPU.

Beginners may wish to read the getting started page on my website.

The repo is setup to run the dam break case which uses the example/dambreak.h5 input HDF5 file. See the inputs page to see how this HDF5 file is structured. Example scripts written in Python and Matlab are alo included in the example folder.

NOTE This is a hobby project and is actively being developed. Major changes to the main branch can occur.

Prerequisites

Compilation of all source code requires hdf5 libraries where the serial and cuda code requires the high-level libraries as well.

  • hdf5 v1.10.7 or above (built with mpi)
  • make
  • An appropriate compiler:
    • nvfortran (tested with v21.9.0 to 23.0.0) - mandatory for GPU code;
    • gfortran (tested with v9.4.0 to v11.2.0); or
    • ifort (tested with v2021.8.0)/ifx (tested with v2023.0.0).
  • An MPI implementation (for the MPI code only)
  • If using gfortran, OpenCoarrays library is needed.

Compiling

Compiling the code is done via the makefiles in the makefiles directory. This can be invoked with one of the following arguments:

CoArray Fortran

the compilation command is

make FC=<compiler> compiler=<gnu/intel> mode=<serial/caf> extras=<opt,dev,debug,singlecaf> -f makefiles/makefile.caf

where FC specifies the compiler to use e.g., gfortran, ifort, caf, mpifort, or mpiifort etc. (default is gfortran) compiler specifies which type of compiler i.e., either gnu or intel (default is gnu). mode specifies whether you would like to compile in serial or using Coarrays. extras add extra options:

  • opt adds further optimisation options (which reduce usefulness of error messages)
  • dev adds compiler warnings
  • debug adds more detailed error messages and checks (which reduces performance)
  • singlecaf is useful only when compiling with mode=intel. It compiles the executable with -coarray=single, but uses parallel IO i.e., needs to link to HDF5 parallel IO libraries. This allows the sph executable to be executed with mpiexec.

CUDA Fortran

the compilation command is

make -f makefiles/makefile.cuda

This already assumes nvfortran as the compiler. Currently doesn't allow extra customisation (unless you modify the makefile). This will soon be merged with the CAF makefile.

Important Environment Variables

FCFLAGS is used to specify compiler options. You should set this to point to Fortran modulefiles. e.g., export FCFLAGS="-I/usr/include/hdf5/openmpi" points the compiler to the HDF5 OpenMPI Fortran modules installed with apt-get. LDFLAGS is used to specify linker options. You should set this to point to shared libraries. e.g., export LDFLAGS="-L/usr/lib/x86_64-linux-gnu/hdf5/openmpi" points the linker to the HDF5 OpenMPI shared libraries installed by apt-get and needed by GraSPH. Note that -lhdf5 is included by default, but you may also need to add options specifying the HL libraries (lhdf5_hl_fortran.so or lhdf5hl_fortran.so), in the case of compiling with mode=serial) or the HDF5_fortran libraries (libhdf5_fortran.so).

Running

Running the program requires running the executable and supplying three integer arguments e.g.: ./sph-serial <max timesteps> <print interval> <write interval> where is the maximum number of timesteps to run the simulation for, is the number of steps between printing to the terminal, and is the number of steps between writing data to disk. The first compilation should run the classic dam-break experiment with 65,000 SPH particles. This can be run after first compilation by ./sph-serial 7500 100 100

Currently geometry and simulation parameters are hardcoded. These can be controlled by

  • virt_part subroutine in input.f90 (boundary geometry)
  • input subroutine in input.f90 (initial geometry)
  • param.f90 (parameters). Work is ongoing to improve this.

CAF (mode=caf)

Built with Opencoarrays cafrun -n <ncpus> ./sph 100000 1000 1000

Built with ifort/ifx: FOR_COARRAY_NUM_IMAGES=<ncpus> ./sph 100000 1000 1000 or if built with extras=singlecaf mpiexec -n <ncpus> ./sph 100000 1000 1000

Example Build and Run (with gfortran and Opencoarrays)

Installing opencoarrays and openmpi via spack (brew-linux is also a good option)

$ spack install opencoarrays

Making sure caf and cafrun are in my path

$ spack load opencoarrays openmpi
$ which caf
/usr/local/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.4.0/opencoarrays-2.7.1-wiecvev57rcwa6wdobdhmk2fukurcm6d/bin/caf
$ which cafrun
/usr/local/spack/opt/spack/linux-ubuntu20.04-skylake/gcc-9.4.0/opencoarrays-2.7.1-wiecvev57rcwa6wdobdhmk2fukurcm6d/bin/cafrun

If the above which commands don't return anything, you'll need to add the caf bin directory into your path.

Compiling the program

$ cd ~/GraSPH
# Compiling using caf (FC=caf), with coarrays active (mode=caf), with extra optimisation and development warnings information (extras=opt,dev)
$ make FC=caf mode=caf extras=opt,dev -f makefiles/makefile.caf

Running the program

$ cafrun -n 4 sph 100000 1000 1000

Visualisation

Currently visualisation is done via a MatLab script which parses the output hdf5 files. This is included as Plot_hdf5.m. This workflow shall be improved in the future.

grasph's People

Contributors

edoyango avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

suzhengyang

grasph's Issues

gpu version in single precision doesn't gel with HDF5

sph-cuda: /tmp/edwardy/spack-stage/spack-stage-hdf5-1.12.2-yvixmh5ii4vsbjbt2wgafuzmnxksnfxo/spack-src/src/H5MM.c:614: H5MM_memcpy: Assertion `(char *)dest >= (const char *)src + n || (const char *)src >= (char *)dest + n' failed.

use input config file

currently use param.f90 to control parameters.

This is ok, as the understanding of fortran required is pretty limited. But maybe something like a TOML/YML/INI file would be better.

downside is a lot of the parameters become run-time variables, which could degrade performance. needs investigating.

Add more advanced models

At the moment, I'm mainly interested in

  • delta-SPH
  • elasto-plastic type models
  • fluid granular models e.g. MuI.

use precompiler directives to merge makefiles

Currently use a bash script to call seperate makescripts to compile GPU, vs CPU serial, vs CPU parallel codes. After CAF CPU parallel code, could merge CPU serial and CPU parallel into one code base, and use conditional compilation with preprocessor directives.

After than, need to think of a scheme to compile GPU code within the same makefile too.

Code bases could potentially be merged using preprocessor directives. e.g., there could be a single main.f90 file, but depending on the preprocess directives, make could compile/link the CPU or GPU version of time_integration.f90 etc.

code base probably couldn't be completely merged because of the significantly different calls made in the core computation subroutines in the GPU and CPU versions.

Choose appropriate chunk size when writing data

write speed is a bit problematic when writing output data and I know that chunk size influences things. Would be nice for the parallel_hdf5_io_helper module to choose a reasonably good chunk_size that optimises for speed and compression. Some things I know so far (run on 56 images across 2 nodes on Milton):

writing all datasets as single chunks takes 8.5s
writing all datasets with chunk_size = n/numImages or chunk_size = [1, n/numImages] takes write time down to 1.6s
writing all datasets with chunk_size approx 100kb, takes write time to 2.4s.

Make use of MPI aware CUDA

I'd like to implement this after all the existing issues are implemented.

Current obstacle is that I don't have access to multiple GPUs I can develop on.
The GPUs at my work are on CUDA 12, which requires NVHPC 22.3, which requires a newer OS than what we have.

Parse hdf5 as inputs

Currently hardcoded virtual and real particle generation is sub-optimal. Greater coding burden on the user.

Allowing an input hdf5 file to be parsed would make things much easier.

A user could generate one via python/c++/fortran as needed. Maybe helper python package could be created too.

CAF ORB is slow on multiple nodes

Turns out the each image reading data from other image's sub-copy of the grid is not ideal.

4 images across two nodes:

$ srun --nodes=2 --tasks-per-node=2 --mem-per-cpu=1G --mpi=pmi2 --nodelist=med-n18,med-n29 sph 0 1 1
srun: job 10922111 queued and waiting for resources
srun: job 10922111 has been allocated resources

                               Date = 17/03/2023
                              Time = 16:19:02.026
Executing code in parallel with    4 images!
Running       0 step(s).
Printing summary to screen every       1 step(s).
Writing output to disc every       1 step(s).
Total simulation size of 62500 physical particles, and
                         188112 virtual particles.
_______________________________________________________________________________
INFO: Image subdomain boundaries being updated
      Current timestep: 0
      Repartition mode: Updating cut locations only
      Checking whether diffusion is needed...
      No diffusion needed. Continuing...
_______________________________________________________________________________

                               Date = 17/03/2023
                              Time = 16:19:14.464
================================= TIME SUMMARY ================================
Average Wall time (s)      =      0.0000002
Average Partition time (s) =     11.2053136   <------
Average Send/recv time (s) =      0.0322832
Average Output time (s)    =      0.0000000
============================== PARTITION SUMMARY ==============================
            Number of Partitions = 1
    Avg Timesteps B/N Partitions = N/A
    Max Timesteps B/N Partitions = N/A
    Min Timesteps B/N Partitions = N/A

Number of Cut Axis Reorientation = 1
Avg Timesteps B/N Reorientations = N/A
Max Timesteps B/N Reorientations = N/A
Min Timesteps B/N Reorientations = N/A

4 images on 1 node:

$ srun --nodes=1 --tasks-per-node=4 --mem-per-cpu=1G --mpi=pmi2 --contiguous sph 0 1 1
srun: job 10922109 queued and waiting for resources
srun: job 10922109 has been allocated resources

                               Date = 17/03/2023
                              Time = 16:18:01.957
Executing code in parallel with    4 images!
Running       0 step(s).
Printing summary to screen every       1 step(s).
Writing output to disc every       1 step(s).
Total simulation size of 62500 physical particles, and
                         188112 virtual particles.
_______________________________________________________________________________
INFO: Image subdomain boundaries being updated
      Current timestep: 0
      Repartition mode: Updating cut locations only
      Checking whether diffusion is needed...
      No diffusion needed. Continuing...
_______________________________________________________________________________

                               Date = 17/03/2023
                              Time = 16:18:03.137
================================= TIME SUMMARY ================================
Average Wall time (s)      =      0.0000003
Average Partition time (s) =      0.1104427   <------
Average Send/recv time (s) =      0.0153661
Average Output time (s)    =      0.0000000
============================== PARTITION SUMMARY ==============================
            Number of Partitions = 1
    Avg Timesteps B/N Partitions = N/A
    Max Timesteps B/N Partitions = N/A
    Min Timesteps B/N Partitions = N/A

Number of Cut Axis Reorientation = 1
Avg Timesteps B/N Reorientations = N/A
Max Timesteps B/N Reorientations = N/A
Min Timesteps B/N Reorientations = N/A

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.