GithubHelp home page GithubHelp logo

eth-cscs / implicitglobalgrid.jl Goto Github PK

View Code? Open in Web Editor NEW
155.0 14.0 16.0 6.97 MB

Almost trivial distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid

License: BSD 3-Clause "New" or "Revised" License

Julia 100.00%
multi-gpu julia distributed julia-mpi-wrapper mpi cuda stencil-codes staggered-grids gpu

implicitglobalgrid.jl's Introduction

ImplicitGlobalGrid.jl ImplicitGlobalGrid.jl

CI Coverage

ImplicitGlobalGrid is an outcome of a collaboration of the Swiss National Supercomputing Centre, ETH Zurich (Dr. Samuel Omlin) with Stanford University (Dr. Ludovic Räss) and the Swiss Geocomputing Centre (Prof. Yuri Podladchikov). It renders the distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid almost trivial and enables close to ideal weak scaling of real-world applications on thousands of GPUs [1, 2, 3]:

Weak scaling Piz Daint

ImplicitGlobalGrid relies on the Julia MPI wrapper (MPI.jl) to perform halo updates close to hardware limit and leverages CUDA-aware or ROCm-aware MPI for GPU-applications. The communication can straightforwardly be hidden behind computation [1, 3] (how this can be done automatically when using ParallelStencil.jl is shown in [3]; a general approach particularly suited for CUDA C applications is explained in [4]).

A particularity of ImplicitGlobalGrid is the automatic implicit creation of the global computational grid based on the number of processes the application is run with (and based on the process topology, which can be explicitly chosen by the user or automatically defined). As a consequence, the user only needs to write a code to solve his problem on one GPU/CPU (local grid); then, as little as three functions can be enough to transform a single GPU/CPU application into a massively scaling Multi-GPU/CPU application. See the example below. 1-D, 2-D and 3-D grids are supported. Here is a sketch of the global grid that results from running a 2-D solver with 4 processes (P1-P4) (a 2x2 process topology is created by default in this case):

Implicit global grid

Contents

Multi-GPU with three functions

Only three functions are required to perform halo updates close to hardware limit:

  • init_global_grid
  • update_halo!
  • finalize_global_grid

Three additional functions are provided to query Cartesian coordinates with respect to the global computational grid if required:

  • x_g
  • y_g
  • z_g

Moreover, the following three functions allow to query the size of the global grid:

  • nx_g
  • ny_g
  • nz_g

The following Multi-GPU 3-D heat diffusion solver illustrates how these functions enable the creation of massively parallel applications.

50-lines Multi-GPU example

This simple Multi-GPU 3-D heat diffusion solver uses ImplicitGlobalGrid. It relies fully on the broadcasting capabilities of CUDA.jl's CuArray type to perform the stencil-computations with maximal simplicity (CUDA.jl enables also writing explicit GPU kernels which can lead to significantly better performance for these computations).

using CUDA                # Import CUDA before ImplicitGlobalGrid to activate its CUDA device support
using ImplicitGlobalGrid

@views d_xa(A) = A[2:end  , :     , :     ] .- A[1:end-1, :     , :     ];
@views d_xi(A) = A[2:end  ,2:end-1,2:end-1] .- A[1:end-1,2:end-1,2:end-1];
@views d_ya(A) = A[ :     ,2:end  , :     ] .- A[ :     ,1:end-1, :     ];
@views d_yi(A) = A[2:end-1,2:end  ,2:end-1] .- A[2:end-1,1:end-1,2:end-1];
@views d_za(A) = A[ :     , :     ,2:end  ] .- A[ :     , :     ,1:end-1];
@views d_zi(A) = A[2:end-1,2:end-1,2:end  ] .- A[2:end-1,2:end-1,1:end-1];
@views  inn(A) = A[2:end-1,2:end-1,2:end-1]

@views function diffusion3D()
    # Physics
    lam        = 1.0;                                       # Thermal conductivity
    cp_min     = 1.0;                                       # Minimal heat capacity
    lx, ly, lz = 10.0, 10.0, 10.0;                          # Length of domain in dimensions x, y and z

    # Numerics
    nx, ny, nz = 256, 256, 256;                             # Number of gridpoints dimensions x, y and z
    nt         = 100000;                                    # Number of time steps
    init_global_grid(nx, ny, nz);                           # Initialize the implicit global grid
    dx         = lx/(nx_g()-1);                             # Space step in dimension x
    dy         = ly/(ny_g()-1);                             # ...        in dimension y
    dz         = lz/(nz_g()-1);                             # ...        in dimension z

    # Array initializations
    T     = CUDA.zeros(Float64, nx,   ny,   nz  );
    Cp    = CUDA.zeros(Float64, nx,   ny,   nz  );
    dTedt = CUDA.zeros(Float64, nx-2, ny-2, nz-2);
    qx    = CUDA.zeros(Float64, nx-1, ny-2, nz-2);
    qy    = CUDA.zeros(Float64, nx-2, ny-1, nz-2);
    qz    = CUDA.zeros(Float64, nx-2, ny-2, nz-1);

    # Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
    Cp .= cp_min .+ CuArray([5*exp(-((x_g(ix,dx,Cp)-lx/1.5))^2-((y_g(iy,dy,Cp)-ly/2))^2-((z_g(iz,dz,Cp)-lz/1.5))^2) +
                             5*exp(-((x_g(ix,dx,Cp)-lx/3.0))^2-((y_g(iy,dy,Cp)-ly/2))^2-((z_g(iz,dz,Cp)-lz/1.5))^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])
    T  .= CuArray([100*exp(-((x_g(ix,dx,T)-lx/2)/2)^2-((y_g(iy,dy,T)-ly/2)/2)^2-((z_g(iz,dz,T)-lz/3.0)/2)^2) +
                    50*exp(-((x_g(ix,dx,T)-lx/2)/2)^2-((y_g(iy,dy,T)-ly/2)/2)^2-((z_g(iz,dz,T)-lz/1.5)/2)^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])

    # Time loop
    dt = min(dx*dx,dy*dy,dz*dz)*cp_min/lam/8.1;                                               # Time step for the 3D Heat diffusion
    for it = 1:nt
        qx    .= -lam.*d_xi(T)./dx;                                                           # Fourier's law of heat conduction: q_x   = -λ δT/δx
        qy    .= -lam.*d_yi(T)./dy;                                                           # ...                               q_y   = -λ δT/δy
        qz    .= -lam.*d_zi(T)./dz;                                                           # ...                               q_z   = -λ δT/δz
        dTedt .= 1.0./inn(Cp).*(-d_xa(qx)./dx .- d_ya(qy)./dy .- d_za(qz)./dz);               # Conservation of energy:           δT/δt = 1/cₚ (-δq_x/δx - δq_y/dy - δq_z/dz)
        T[2:end-1,2:end-1,2:end-1] .= inn(T) .+ dt.*dTedt;                                    # Update of temperature             T_new = T_old + δT/δt
        update_halo!(T);                                                                      # Update the halo of T
    end

    finalize_global_grid();                                                                   # Finalize the implicit global grid
end

diffusion3D()

The corresponding file can be found here. A basic cpu-only example is available here (no usage of multi-threading).

Straightforward in-situ visualization / monitoring

ImplicitGlobalGrid provides a function to gather an array from each process into a one large array on a single process, assembled according to the global grid:

  • gather!

This enables straightforward in-situ visualization or monitoring of Multi-GPU/CPU applications using e.g. the Julia Plots package as shown in the following (the GR backend is used as it is particularly fast according to the Julia Plots documentation). It is enough to add a couple of lines to the previous example (omitted unmodified lines are represented with #(...)):

using CUDA                       # Import CUDA before ImplicitGlobalGrid to activate its CUDA device support
using ImplicitGlobalGrid, Plots
#(...)

@views function diffusion3D()
    # Physics
    #(...)

    # Numerics
    #(...)
    me, dims   = init_global_grid(nx, ny, nz);              # Initialize the implicit global grid
    #(...)

    # Array initializations
    #(...)

    # Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
    #(...)

    # Preparation of visualisation
    gr()
    ENV["GKSwstype"]="nul"
    anim = Animation();
    nx_v = (nx-2)*dims[1];
    ny_v = (ny-2)*dims[2];
    nz_v = (nz-2)*dims[3];
    T_v  = zeros(nx_v, ny_v, nz_v);
    T_nohalo = zeros(nx-2, ny-2, nz-2);

    # Time loop
    #(...)
    for it = 1:nt
        if mod(it, 1000) == 1                                                                 # Visualize only every 1000th time step
            T_nohalo .= Array(T[2:end-1,2:end-1,2:end-1]);                                    # Copy data to CPU removing the halo
            gather!(T_nohalo, T_v)                                                            # Gather data on process 0 (could be interpolated/sampled first)
            if (me==0) heatmap(transpose(T_v[:,ny_v÷2,:]), aspect_ratio=1); frame(anim); end  # Visualize it on process 0
        end
        #(...)
    end

    # Postprocessing
    if (me==0) gif(anim, "diffusion3D.gif", fps = 15) end                                     # Create a gif movie on process 0
    if (me==0) mp4(anim, "diffusion3D.mp4", fps = 15) end                                     # Create a mp4 movie on process 0
    finalize_global_grid();                                                                   # Finalize the implicit global grid
end

diffusion3D()

Here is the resulting movie when running the application on 8 GPUs, solving 3-D heat diffusion with heterogeneous heat capacity (two Gaussian anomalies) on a global computational grid of size 510x510x510 grid points. It shows the x-z-dimension plane in the middle of the dimension y:

Implicit global grid

The simulation producing this movie - including the in-situ visualization - took 29 minutes on 8 NVIDIA® Tesla® P100 GPUs on Piz Daint (an optimized solution using CUDA.jl's native kernel programming capabilities can be more than 10 times faster). The complete example can be found here. A corresponding basic cpu-only example is available here (no usage of multi-threading) and a movie of a simulation with 254x254x254 grid points which it produced within 34 minutes using 8 Intel® Xeon® E5-2690 v3 is found here (with 8 processes, no multi-threading).

Seamless interoperability with MPI.jl

ImplicitGlobalGrid is seamlessly interoperable with MPI.jl. The Cartesian MPI communicator it uses is created by default when calling init_global_grid and can then be obtained as follows (variable comm_cart):

me, dims, nprocs, coords, comm_cart = init_global_grid(nx, ny, nz);

Moreover, the automatic initialization and finalization of MPI can be deactivated in order to replace them with direct calls to MPI.jl:

init_global_grid(nx, ny, nz; init_MPI=false);
finalize_global_grid(;finalize_MPI=false)

Besides, init_global_grid makes every argument it passes to an MPI.jl function customizable via its keyword arguments.

CUDA-aware/ROCm-aware MPI support

If the system supports CUDA-aware/ROCm-aware MPI, it may be activated for ImplicitGlobalGrid by setting an environment variable as specified in the module documentation callable from the Julia REPL or in IJulia (see next section).

Module documentation callable from the Julia REPL / IJulia

The module documentation can be called from the Julia REPL or in IJulia:

julia> using ImplicitGlobalGrid
julia>?
help?> ImplicitGlobalGrid
search: ImplicitGlobalGrid

  Module ImplicitGlobalGrid

  Renders the distributed parallelization of stencil-based GPU and CPU applications on a
  regular staggered grid almost trivial and enables close to ideal weak scaling of
  real-world applications on thousands of GPUs.

  General overview and examples
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  https://github.com/eth-cscs/ImplicitGlobalGrid.jl

  Functions
  ≡≡≡≡≡≡≡≡≡≡≡

    •    init_global_grid

    •    finalize_global_grid

    •    update_halo!

    •    gather!

    •    select_device

    •    nx_g

    •    ny_g

    •    nz_g

    •    x_g

    •    y_g

    •    z_g

    •    tic

    •    toc

  To see a description of a function type ?<functionname>.

  │ Activation of device support
  │
  │  The support for a device type (CUDA or AMDGPU) is activated by importing the corresponding module (CUDA or AMDGPU) before
  │  importing ImplicitGlobalGrid (the corresponding extension will be loaded).

  │ Performance note
  │
  │  If the system supports CUDA-aware MPI (for Nvidia GPUs) or ROCm-aware MPI (for AMD GPUs), it may be activated for
  │  ImplicitGlobalGrid by setting one of the following environment variables (at latest before the call to init_global_grid):
  │
  │  shell> export IGG_CUDAAWARE_MPI=1
  │
  │  shell> export IGG_ROCMAWARE_MPI=1

julia>

Dependencies

ImplicitGlobalGrid relies on the Julia MPI wrapper (MPI.jl), the Julia CUDA package (CUDA.jl [5, 6]) and the Julia AMDGPU package (AMDGPU.jl).

Installation

ImplicitGlobalGrid may be installed directly with the Julia package manager from the REPL:

julia>]
  pkg> add ImplicitGlobalGrid
  pkg> test ImplicitGlobalGrid

References

[1] Räss, L., Omlin, S., & Podladchikov, Y. Y. (2019). Porting a Massively Parallel Multi-GPU Application to Julia: a 3-D Nonlinear Multi-Physics Flow Solver. JuliaCon Conference, Baltimore, USA.

[2] Räss, L., Omlin, S., & Podladchikov, Y. Y. (2019). A Nonlinear Multi-Physics 3-D Solver: From CUDA C + MPI to Julia. PASC19 Conference, Zurich, Switzerland.

[3] Omlin, S., Räss, L., Kwasniewski, G., Malvoisin, B., & Podladchikov, Y. Y. (2020). Solving Nonlinear Multi-Physics on GPU Supercomputers with Julia. JuliaCon Conference, virtual.

[4] Räss, L., Omlin, S., & Podladchikov, Y. Y. (2019). Resolving Spontaneous Nonlinear Multi-Physics Flow Localisation in 3-D: Tackling Hardware Limit. GPU Technology Conference 2019, San Jose, Silicon Valley, CA, USA.

[5] Besard, T., Foket, C., & De Sutter, B. (2018). Effective Extensible Programming: Unleashing Julia on GPUs. IEEE Transactions on Parallel and Distributed Systems, 30(4), 827-841. doi: 10.1109/TPDS.2018.2872064

[6] Besard, T., Churavy, V., Edelman, A., & De Sutter B. (2019). Rapid software prototyping for heterogeneous and distributed platforms. Advances in Engineering Software, 132, 29-46. doi: 10.1016/j.advengsoft.2019.02.002

implicitglobalgrid.jl's People

Contributors

jgphpc avatar luraess avatar omlins avatar utkinis avatar vchuravy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

implicitglobalgrid.jl's Issues

First try to compute efficiency is not very good

I have tried running one of the examples on 1, 2 and 4 gpus to reproduce some scaling results. I started wtih 'diffusion3D_multigpu_CuArray_novis.jl` with a coarser grid to see how it does. Below you will find my results.

For this weak-scaling problem I compute the efficiencies on 2 and 4 cores to be 57 and 35 percent respectively. I realize this is much coarser but is this surprising?

I have ensured that the system does have CUDA-aware MPI so that would not be a problem, I don't think.

$ mpiexec -np 1 julia --project diffusion3D_multigpu_CuArrays_novis.jl 
Global grid: 32x32x32 (nprocs: 1, dims: 1x1x1)
Simulation time = 41.547354999

$ mpiexec -np 2 julia --project diffusion3D_multigpu_CuArrays_novis.jl 
Global grid: 62x32x32 (nprocs: 2, dims: 2x1x1)
Simulation time = Simulation time = 72.528739102
72.494481903
$ mpiexec -np 4 julia --project diffusion3D_multigpu_CuArrays_novis.jl 
Global grid: 62x62x32 (nprocs: 4, dims: 2x2x1)
Simulation time = Simulation time = Simulation time = Simulation time = 116.549162142
116.549022381

Specifying manual subdomains with halos

Currently, ImplicitGlobalGrid automatically takes care of the grid "chunking" by allocating appropriate arrays on each process and ensuring the arrays have a suitable halo around them.

In some instances, it may be useful to force a part of the domain to be split into its own chunk and proper halo -- even if it's not the only chunk on the process. This is particularly relevant when building heterogenous stencil codes, such that part of the domain may have more complicated materials etc (and thus a more complicated stencil). By ensuring these chunks have halos, one could update these chunks, along with all other chunks asynchronously on a gpu. Then we perform a global halo update and move to the next set of kernels.

For example, suppose a wave-physics problem has perfectly-matched layers (PML). These layers only exist on the edges of the computational domain (but can be several pixels thick in each dimension) and require several auxiliary variables (i.e. arrays) in addition to those of the normal PDE. Naively, one might just allocate these extra arrays for the entire domain, and use a single stencil for everything. However, with memory-bandwidth-bound codes, this is a bad idea! It would be much better to have one kernel for regions with auxiliary fields, and one without!

These kernels can be run in tandem (and asynchronously) so long as their halos were updated before the next iteration, just like a set of arrays living within two different processes.

I realize this removes some of the "implicit" within IGG... but I also think it extends its capabilities to efficient large-scale codes with highly practical applications.

Support for Complex arrays

Relating to this issue in ParallelStencil.jl, it would be convenient to add support for Complex arrays in ImplicitGlobalGrid.jl.

Current working hack is to split the Complex array in two temporary real arrays and exchange those within update_halo().

Bump AMDGPU version to make IGG run on AMD with non-ROCm APU

On my AMD Ryzen laptop, [email protected] errors with (and thus IGG also errors):

julia> using AMDGPU                                                                                                                                                                                                                                             
ERROR: InitError: could not load library "libhsa-runtime64.so.1"                                                                                                                                                                                                
libhsa-runtime64.so.1: cannot open shared object file: No such file or directory                                                                                                                                                                                
Stacktrace:                                                                                                                                                                                                                                                     
  [1] hsa_init                                                                                                                                                                                                                                                  
    @ ~/.julia/packages/AMDGPU/FXTo5/src/hsa/LibHSARuntime.jl:71 [inlined]                                                                                                                                                                                      
  [2] __init__()                                                                                                                                                                                                                                                
    @ AMDGPU ~/.julia/packages/AMDGPU/FXTo5/src/AMDGPU.jl:169                                                                                                                                                                                                   
...

However, v0.6 and higher just throw a warning. Can the version bound on AMDGPU of IGG be relaxed to at least v0.6?

Question/clarification: How fast this pacakge is it compared to CUDA.jl

I wanted to start using ImplicitGlobalGrids and ParallelStencil, but I got confused about the following statement from the docs:

The simulation producing this movie - including the in-situ visualization - took 29 minutes on 8 NVIDIA® Tesla® P100 GPUs on Piz Daint (an optimized solution using CUDA.jl's native kernel programming capabilities can be more than 10 times faster).

Does this mean that CUDA.jl can be 10 times faster that ImplicitGlobalGrids and ParallelStencil? Or am I misunderstanding it?

IGG on Julia 1.10

On Julia 1.10-rc2, the follow issue occurs during ini of IGG. Possibly similar to omlins/ParallelStencil.jl#125

Stacktrace:

julia> init_global_grid(4, 3, 2;device_type="AMDGPU")
Global grid: 4x3x2 (nprocs: 1, dims: 1x1x1)
ERROR: No function of the module can be called before init_global_grid() or after finalize_global_grid().
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] check_initialized()
   @ ImplicitGlobalGrid /pfs/lustrep2/scratch/project_465000557/lurass/ImplicitGlobalGrid.jl/src/shared.jl:85
 [3] global_grid()
   @ ImplicitGlobalGrid /pfs/lustrep2/scratch/project_465000557/lurass/ImplicitGlobalGrid.jl/src/shared.jl:82
 [4] cuda_enabled()
   @ ImplicitGlobalGrid /pfs/lustrep2/scratch/project_465000557/lurass/ImplicitGlobalGrid.jl/src/shared.jl:107
 [5] select_device()
   @ ImplicitGlobalGrid /pfs/lustrep2/scratch/project_465000557/lurass/ImplicitGlobalGrid.jl/src/select_device.jl:16
 [6] _select_device()
   @ ImplicitGlobalGrid /pfs/lustrep2/scratch/project_465000557/lurass/ImplicitGlobalGrid.jl/src/select_device.jl:38
 [7] init_global_grid(nx::Int64, ny::Int64, nz::Int64; dimx::Int64, dimy::Int64, dimz::Int64, periodx::Int64, periody::Int64, periodz::Int64, overlaps::Tuple{…}, halowidths::Tuple{…}, disp::Int64, reorder::Int64, comm::MPI.Comm, init_MPI::Bool, device_type::String, select_device::Bool, quiet::Bool)
   @ ImplicitGlobalGrid /pfs/lustrep2/scratch/project_465000557/lurass/ImplicitGlobalGrid.jl/src/init_global_grid.jl:105
 [8] top-level scope
   @ REPL[7]:1
 [9] top-level scope
   @ /scratch/project_465000557/lurass/julia_local/julia_depot/packages/AMDGPU/goZLq/src/tls.jl:200
Some type information was truncated. Use `show(err)` to see complete types.

julia> 

Problem with non mpich mpi

Hello,

I use ImplicitGlobalGrid and ParallelStencil for my simulation, and everything is fine when I use the default MPI (MPICH).
This does not work properly with MPI with selected openMPI via MPI preferences (MPI.jl itself seems to run OK but
implicitGlobalGrid complains about mismatching MPI version and run N times the same mono-process pb).

Is there some particular settings I should use for ImplicitGlobalGrid.jl for using the proper MPI library?

julia 1.9.3

(Acoustic3DFD) pkg> status
Project Acoustic3DFD v0.1.0
Status `~/travail/triscale/git/TransfoUS.jl/Acoustic3DFD.jl/Project.toml`
⌅ [052768ef] CUDA v4.4.1
  [5789e2e9] FileIO v1.16.1
  [f67ccb44] HDF5 v0.17.1
  [4d7a3746] ImplicitGlobalGrid v0.13.0
  [033835bb] JLD2 v0.4.35
  [682c06a0] JSON v0.21.4
⌃ [da04e1cc] MPI v0.20.14
  [3da0fdf6] MPIPreferences v0.1.9
  [94395366] ParallelStencil v0.9.0
  [92933f4c] ProgressMeter v1.9.0
  [90137ffa] StaticArrays v1.6.5
  [37e2e46d] LinearAlgebra

Intel GPU

Hi,

I tried the diffusion3D example (attached file) on one Intel GPU using oneAPI/oneArray and did not use any of the ImplicitGlobalGrid native functions, as the signature only accepts CUDA/AMD arrays.

For nt=100000, I get ZE_RESULT_ERROR_DEVICE_LOST (device hung, reset, was removed, or driver update occurred). For nt=100, it works, but the results are not correct.

Any idea what could possibly go wrong?

Thanks!

P.S. oneAPI.versioninfo()

Binary dependencies:

  • NEO: 23.17.26241+0
  • libigc: 1.0.13822+0
  • gmmlib: 22.3.0+0
  • SPIRV_LLVM_Translator_unified: 0.3.0+0
  • SPIRV_Tools: 2023.2.0+0

Toolchain:

  • Julia: 1.9.3
  • LLVM: 14.0.6

1 driver:

  • 00000000-0000-0000-178a-f44f01036794 (v1.3.26516, API v1.3.0)

2 devices:

  • Intel(R) Data Center GPU Max 1100
  • Intel(R) Data Center GPU Max 1100

MPI failing on TitanXm with CUDA.jl v3.5

Getting the following error running IGG with CUDA.jl v3.5 on Titan Xm GPUs with Driver Version: 470.42.01 CUDA Version: 11.4 in the update_halo function. Using system CUDA install, and either Julia MPICH MPI artifact or system OpenMPI without CUDA-aware support.

Interestingly, this error does not occur:

  • if using the same CUDA install with Tesla V100 GPUs (without CUDA-aware support)
  • if using CUDA-aware MPI and setting (export IGG_CUDAAWARE_MPI=1) on both Titan Xm and Tesla V100

Tmp Fix: downgrading to CUDA.jl v3.3.6 👀

ERROR: ERROR: LoadError: LoadError: ERROR: LoadError: ERROR: LoadError: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
Stacktrace:
  [1] CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::res::CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/YpW0k/lib/cudadrv/throw_api_error(error.jl:91
  [2] macro expansion
    @ ~/.julia/packages/CUDA/YpW0k/lib/cudadrv/error.jl:101 [inlined]
  [3] cuMemHostGetDevicePointer_v2(pdptr::Base.RefValue{CUDA.CuPtr{Nothing}}, p::Ptr{Nothing}, Flags::Int64)
    @ CUDA ~/.julia/packages/CUDA/YpW0k/lib/utils/call.jl:26
  [4] convert
    @ ~/.julia/packages/CUDA/YpW0k/lib/cudadrv/memory.jl:130 [inlined]
  [5] unsafe_convert
    @ ~/.julia/packages/CUDA/YpW0k/src/array.jl:320 [inlined]
  [6] pointer
    @ ~/.julia/packages/CUDA/YpW0k/src/array.jl:275 [inlined]
  [7] unsafe_convert
    @ ~/.julia/packages/CUDA/YpW0k/src/array.jl:327 [inlined]
  [8] adapt_structure
    @ ~/.julia/packages/CUDA/YpW0k/src/compiler/execution.jl:139 [inlined]
  [9] adapt
    @ ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:40 [inlined]
 [10] cudaconvert(arg::CUDA.CuArray{Float64, 1, CUDA.Mem.HostBuffer})
    @ CUDA ~/.julia/packages/CUDA/YpW0k/src/compiler/execution.jl:152
 [11] map
    @ ./tuple.jl:216 [inlined]
 [12] macro expansion
    @ ~/.julia/packages/CUDA/YpW0k/src/compiler/execution.jl:100 [inlined]
 [13] iwrite_sendbufs!(n::Int64, dim::Int64, A::CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, i::Int64)
    @ ImplicitGlobalGrid ~/.julia/packages/ImplicitGlobalGrid/b8fz5/src/update_halo.jl:347
 [14] _update_halo!(fields::CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})
    @ ImplicitGlobalGrid ~/.julia/packages/ImplicitGlobalGrid/b8fz5/src/update_halo.jl:38
 [15] update_halo!(A::CUDA ~/.julia/packages/CUDA/YpW0k/lib/cudadrv/CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})
    @ ImplicitGlobalGrid ~/.julia/packages/ImplicitGlobalGrid/b8fz5/src/update_halo.jl:27
 [16] res::error.jl:91
  [2] macro expansion
    @ ~/.julia/packages/CUDA/YpW0k/lib/cudadrv/error.jl:101 [inlined]
  [3] cuMemHostGetDevicePointer_v2(pdptr::Base.RefValue{CUDA.CuPtr{Nothing}}, p::Ptr{Nothing}, Flags::Int64)
    @ CUDA ~/.julia/packages/CUDA/YpW0k/lib/utils/call.jl:26
  [4] convert
    @ ~/.julia/packages/CUDA/YpW0k/lib/cudadrv/memory.jl:130 [inlined]
  [5] unsafe_convert
    @ ~/.julia/packages/CUDA/YpW0k/src/array.jl:320 [inlined]
  [6] pointer
    @ ~/.julia/packages/CUDA/YpW0k/src/array.jl:275 [inlined]
  [7] unsafe_convert
    @ ~/.julia/packages/CUDA/YpW0k/src/array.jl:327 [inlined]
  [8] adapt_structure
    @ ~/.julia/packages/CUDA/YpW0k/src/compiler/execution.jl:139 [inlined]
  [9] adapt
    @ ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:40 [inlined]
 [10] cudaconvert(arg::CUDA.CuArray{Float64, 1, CUDA.Mem.HostBuffer})
    @ CUDA ~/.julia/packages/CUDA/YpW0k/src/compiler/execution.jl:152
 [11] map
    @ ./tuple.jl:216 [inlined]
 [12] macro expansion
    @ ~/.julia/packages/CUDA/YpW0k/src/compiler/execution.jl:100 [inlined]
 [13] iwrite_sendbufs!(n::Int64, dim::Int64, A::CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, i::Int64)
    @ ImplicitGlobalGrid ~/.julia/packages/ImplicitGlobalGrid/b8fz5/src/update_halo.jl:347
 [14] _update_halo!(fields::CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})
    @ ImplicitGlobalGrid ~/.julia/packages/ImplicitGlobalGrid/b8fz5/src/update_halo.jl:38
 [15] update_halo!(A::CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})
    @ ImplicitGlobalGrid ~/.julia/packages/ImplicitGlobalGrid/b8fz5/src/update_halo.jl:27
 [16] diffusion_2D(; do_visu::Bool)
    @ Main /scratch/lraess/course-101-0250-00/lecture08_teachers/diffusion_2D_perf_multixpu.jl:62
 [17] top-level scope
    @ /scratch/lraess/course-101-0250-00/lecture08_teachers/diffusion_2D_perf_multixpu.jl:84
in expression starting at /scratch/lraess/course-101-0250-00/lecture08_teachers/diffusion_2D_perf_multixpu.jl:84
diffusion_2D(; do_visu::Bool)
    @ Main /scratch/lraess/course-101-0250-00/lecture08_teachers/diffusion_2D_perf_multixpu.jl:62
 [17] top-level scope
    @ /scratch/lraess/course-101-0250-00/lecture08_teachers/diffusion_2D_perf_multixpu.jl:84
in expression starting at /scratch/lraess/course-101-0250-00/lecture08_teachers/diffusion_2D_perf_multixpu.jl:84

CPU Multithreading Question

Hi,

If one runs the example provided here for CPU, only one thread per rank will be used (?). This is based on my observation, and setting 'JULIA_NUM_THREADS=n' or using 'julia --threads n' did not change it. If that's the case, would the solution for using multiple MPI ranks and multithreading within each rank involve combining ParallelStencil and ImplicitGlobalGrid, as demonstrated in the example here?

Thanks!

MPI is already initialized, finalize_global_grid() doesn't finalizes

Hi,
I am trying to do multiple runs run in a for loop increasing the grid size over each iteration.
So I initialize the global grid at the start of the loop as such init_global_grid(nx, ny, 1; init_MPI=true);
and make sure to finalize the global grid at the end finalize_global_grid(;finalize_MPI=true);
but it always crashes on the second itteration

ERROR: LoadError: MPI is already initialized. Set the argument 'init_MPI=false'

Setting the argument didn't help so I checked aftter the finalize call and found out finalize_global_grid() didn't actually finalize MPI.

Am I doing it the wrong way, or is it impossible to launch multiple times a MPI task?
Also it might be possible that it comes from MPI.jl but am not sure so I post it there.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Update Halo for Higher dimension Tensors?

Sometimes there are a lot tensors whose halos need be updated. Is it possible to update the halo of a higher dimension tensor A with size {nx, ny, nz, m}, where nx, ny, nz are the size of grids?

I know it is equivalent to update_halo!(A_1, A_2, ..., A_m). Just want to know if it can be done more easily when m is a large number.

Thank you in advance!

Unexpected halo behavior

I'm attempting to port a Fortran finite volume CFD code over to Julia and use ImplicitGlobalGrid as part of it. I'm having a bit of a hard time following the behavior of this package and need a few pointers.

If I have a vector A = [1, 2, 3, 4, 5, 6, 7, 8] with periodic BCs and want to exchange the 2 halo elements on either side (where [1, 2] and [7, 8] are the halo elements).

julia> using ImplicitGlobalGrid

julia> init_global_grid(8,1,1,periodx=1)
Global grid: 6x1x1 (nprocs: 1, dims: 1x1x1)
(0, [1, 1, 1], 1, [0, 0, 0], MPI.Comm(-2080374784))

julia> A = collect(1:8)
8-element Vector{Int64}:
 1
 2
 3
 4
 5
 6
 7
 8

julia> update_halo!(A)

julia> A
8-element Vector{Int64}:
 7
 2
 3
 4
 5
 6
 7
 2

This only swaps out 1 element, even though the overlap is 2. After running update_halo!(A), I'm expecting it to turn into A = [5, 6, 3, 4, 5, 6, 3, 4] but it doesn't. What am I missing?

running on multi GPU on a non-CUDA aware MPI

First, thanks for writing this code and sharing it with the community. The results are really amazing and I plan to learn a lot from what you have done.

I am starting to run your code on a sever that I have access to, which has multiple GPUs. The server yields that MPI.has_cuda() is false, which makes me believe that I would not be able to run the code on multiple GPUs. However, when I tried one of your examples with a coarse grid, just so it's faster, it seemed to run without any problems. Does this mean that we can run your code on non CUDA aware MPI?

$ mpiexec -np 2 julia --project diffusion3D_multigpu_CuArrays_
diffusion3D_multigpu_CuArrays_novis.jl    diffusion3D_multigpu_CuArrays_onlyvis.jl  
[fpoulin@cdr353 examples]$ mpiexec -np 2 julia --project diffusion3D_multigpu_CuArrays_novis.jl 
┌ Warning: The NVIDIA driver on this system only supports up to CUDA 11.1.0.
│ For performance reasons, it is recommended to upgrade to a driver that supports CUDA 11.2 or higher.
└ @ CUDA ~/.julia/packages/CUDA/lwSps/src/initialization.jl:42
┌ Warning: The NVIDIA driver on this system only supports up to CUDA 11.1.0.
│ For performance reasons, it is recommended to upgrade to a driver that supports CUDA 11.2 or higher.
└ @ CUDA ~/.julia/packages/CUDA/lwSps/src/initialization.jl:42
Global grid: 30x16x16 (nprocs: 2, dims: 2x1x1)

Install size grows from 89M (released) to 535M (master)

The currently released version of ImplicitGlobalGrid.jl v0.1 currently installs 72 dependencies taking 89 megabytes of space in a clean Julia depot.

$ export JULIA_DEPOT_PATH=`mktemp -d`; julia -e 'using Pkg; Pkg.add("ImplicitGlobalGrid")' && du -hcs $JULIA_DEPOT_PATH/*
...
  72 dependencies successfully precompiled in 74 seconds
20M	/tmp/tmp.VnWvq05cMG/artifacts
36M	/tmp/tmp.VnWvq05cMG/compiled
32K	/tmp/tmp.VnWvq05cMG/environments
16K	/tmp/tmp.VnWvq05cMG/logs
29M	/tmp/tmp.VnWvq05cMG/packages
4.0K	/tmp/tmp.VnWvq05cMG/prefs
5.0M	/tmp/tmp.VnWvq05cMG/registries
24K	/tmp/tmp.VnWvq05cMG/scratchspaces
89M	total

The master branch after #34 expands this footprint to 95 dependencies taking 535 megabytes of space in a clean Julia depot.

$ export JULIA_DEPOT_PATH=`mktemp -d`; julia -e 'using Pkg; Pkg.add(name="ImplicitGlobalGrid", rev="master")' && du -hcs $JULIA_DEPOT_PATH/*
...
  95 dependencies successfully precompiled in 83 seconds (14 already precompiled)
450M	/tmp/tmp.kVz3DOLI0s/artifacts
7.1M	/tmp/tmp.kVz3DOLI0s/clones
41M	/tmp/tmp.kVz3DOLI0s/compiled
40K	/tmp/tmp.kVz3DOLI0s/environments
20K	/tmp/tmp.kVz3DOLI0s/logs
33M	/tmp/tmp.kVz3DOLI0s/packages
4.0K	/tmp/tmp.kVz3DOLI0s/prefs
5.0M	/tmp/tmp.kVz3DOLI0s/registries
32K	/tmp/tmp.kVz3DOLI0s/scratchspaces
535M	total

I found this while exploring omlins/CellArrays.jl#16 (comment), and I also wanted to document this here. My suggestion there was to split the core code into a small subdirectory package with fewer dependencies, then add that as a dependency of the main package.

async d2h and h2d copy issue in HIP

In IGG v0.13, write_d2h_async! and write_h2d_async! for ROCArrays (AMDGPU) need a fix.

Currently fixing the error when pos!=(0,0,0) results in an offset by one in the copied results (see ROCm/HIP#3289).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.