mantevo / minife Goto Github PK

View Code? Open in Web Editor NEW

28.0 8.0 31.0 1.05 MB

MiniFE Finite Element Mini-Application

Home Page: http://www.mantevo.org

License: GNU Lesser General Public License v3.0

C++ 58.89% Shell 5.00% Makefile 9.80% CMake 0.55% C 20.26% M4 3.59% Perl 1.04% Cuda 0.81% Roff 0.07%

miniapp minife ecp finite-elements solvers solver snl-mini-apps

minife's Introduction

Mantevo

Core Mantevo Repository containing common resources

minife's People

Contributors

Stargazers

Watchers

minife's Issues

How to build MiniFE

I want to build a docker image that contains MiniFE, and I try to download the v2.1.0.But the source code doesn't contain make file,so how to build MiniFE?

__syncthreads() needed for reduction ?

For the following codes in miniFE, the comments show that __syncthreads() are not needed in a warp. However, I think __syncthreads() are actually needed to produce correct sum results. I got incorrect sum results when omitting them. Could you reproduce the issue ? Thank you for your comments.

template<typename Vector>
__global__ void dot_kernel(const Vector x, const Vector y, typename TypeTraits<typename Vector::ScalarType>::magnitude_type *d) {

  typedef typename TypeTraits<typename Vector::ScalarType>::magnitude_type magnitude;
  const int BLOCK_SIZE=512;

  magnitude sum=0;
  for(int idx=blockIdx.x*blockDim.x+threadIdx.x;idx<x.n;idx+=gridDim.x*blockDim.x) {
    sum+=x.coefs[idx]*y.coefs[idx];
  }

  //Do a shared memory reduction on the dot product
  __shared__ volatile magnitude red[BLOCK_SIZE];
  red[threadIdx.x]=sum;
  //__syncthreads(); if(threadIdx.x<512) {sum+=red[threadIdx.x+512]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<256)  {sum+=red[threadIdx.x+256]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<128)  {sum+=red[threadIdx.x+128]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<64)   {sum+=red[threadIdx.x+64];  red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<32)   {sum+=red[threadIdx.x+32];  red[threadIdx.x]=sum;}
  //the remaining ones don't need syncthreads because they are warp synchronous
                   if(threadIdx.x<16)   {sum+=red[threadIdx.x+16];  red[threadIdx.x]=sum;}
                   if(threadIdx.x<8)    {sum+=red[threadIdx.x+8];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<4)    {sum+=red[threadIdx.x+4];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<2)    {sum+=red[threadIdx.x+2];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<1)    {sum+=red[threadIdx.x+1];}

  //save partial dot products
  if(threadIdx.x==0) d[blockIdx.x]=sum;
}

template<typename Scalar>
__global__ void dot_final_reduce_kernel(Scalar *d) {
  const int BLOCK_SIZE=1024;
  Scalar sum=d[threadIdx.x];
  __shared__ volatile Scalar red[BLOCK_SIZE];

  red[threadIdx.x]=sum;
  __syncthreads(); if(threadIdx.x<512)  {sum+=red[threadIdx.x+512]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<256)  {sum+=red[threadIdx.x+256]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<128)  {sum+=red[threadIdx.x+128]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<64)   {sum+=red[threadIdx.x+64];  red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<32)   {sum+=red[threadIdx.x+32];  red[threadIdx.x]=sum;}
  //the remaining ones don't need syncthreads because they are warp synchronous
                   if(threadIdx.x<16)   {sum+=red[threadIdx.x+16];  red[threadIdx.x]=sum;}
                   if(threadIdx.x<8)    {sum+=red[threadIdx.x+8];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<4)    {sum+=red[threadIdx.x+4];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<2)    {sum+=red[threadIdx.x+2];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<1)    {sum+=red[threadIdx.x+1];}

  //save final dot product at the front
                   if(threadIdx.x==0) d[0]=sum;
}

#define BLOCK_SIZE  256

#include <stdio.h>
#include <cuda.h>

__global__ void dot_kernel(const int n, const int* x, const int* y, int *d) {

  int sum=0;
  for(int idx=blockIdx.x*blockDim.x+threadIdx.x;idx<n;idx+=gridDim.x*blockDim.x) {
    sum+=x[idx]*y[idx];
  }

  //Do a shared memory reduction on the dot product
  __shared__ int red[BLOCK_SIZE];
  red[threadIdx.x]=sum;
  #pragma unroll
  for (int n = 128; n > 0; n = n/2) {   // incorrect results when syncthreads() are omitted in a wrap
     __syncthreads();
     if(threadIdx.x<n)  {sum+=red[threadIdx.x+n]; red[threadIdx.x]=sum;}
  }

  //save partial dot products
  if(threadIdx.x==0) d[blockIdx.x]=sum;
}

__global__ void final(int *d) {
  int sum=d[threadIdx.x];
  __shared__ int red[BLOCK_SIZE];

  red[threadIdx.x]=sum;
  #pragma unroll
  for (int n = 128; n > 0; n = n/2) {    
     __syncthreads();
     if(threadIdx.x<n)  {sum+=red[threadIdx.x+n]; red[threadIdx.x]=sum;}
  }
  //save final dot product at the front
  if(threadIdx.x==0) d[0]=sum;
}

#define LEN 1025
int main() {
  int a[LEN];
  int b[LEN];
  int r[256];
  srand(2);
  int sum = 0;
  int d_sum = 0;

// sum on the host
  for (int i = 0; i < LEN; i++) {
    a[i] = rand() % 3;
    b[i] = rand() % 3;
    sum += a[i]*b[i];
  }

// sum on the device
  int *da, *db;
  int *dr;
  const int n = LEN;
  cudaMalloc((void**)&da, sizeof(int)*LEN);
  cudaMalloc((void**)&db, sizeof(int)*LEN);
  cudaMalloc((void**)&dr, sizeof(int)*256);
  cudaMemcpy(da, a, sizeof(int)*LEN, cudaMemcpyHostToDevice);
  cudaMemcpy(db, b, sizeof(int)*LEN, cudaMemcpyHostToDevice);
  dot_kernel<<<(n+255)/256, 256 >>>(n, da,db,dr);
  final<<<1, 256>>>(dr);
  cudaMemcpy(&d_sum, dr, sizeof(int), cudaMemcpyDeviceToHost);
  printf("%d %d\n", sum ,d_sum);
  cudaFree(da);
  cudaFree(db);
  cudaFree(dr);
  return 0;
}

MPI_Irecv and MPI_Send use the same buffer at the same time

Hi,

I ran miniFE's ref version with Intel MPI under the message checker from ITAC (Intel Trace Analyzer and Collector). The message checker detected issues LOCAL:MEMORY:OVERLAP and further LOCAL:MEMORY:ILLEGAL_MODIFICATION in ref/src/make_local_matrix.hpp where the same buffers are used for sending and receiving at the same time. From what I saw all other minFE's version should also be affected if they execute the corresponding code.

The affected code from ref/src/make_local_matrix.hpp is in lines 257ff:

  std::vector<MPI_Request> request(num_send_neighbors);
  for(int i=0; i<num_send_neighbors; ++i) {
    MPI_Irecv(&tmp_buffer[i], 1, mpi_dtype, MPI_ANY_SOURCE, MPI_MY_TAG,
              MPI_COMM_WORLD, &request[i]);
  }

  // send messages

  for(int i=0; i<num_recv_neighbors; ++i) {
    MPI_Send(&tmp_buffer[i], 1, mpi_dtype, recv_list[i], MPI_MY_TAG,
             MPI_COMM_WORLD);
  }

If both loops have a trip count > 0 then some buffers pointed to by the tmp_buffer array are used at the same time for sending and receiving.

The complete output and commands for reproducing:

$ git clone https://github.com/Mantevo/miniFE.git
$ cd miniFE/ref/src
$ # loaded module for intelmpi and itac
$ make
$ mpiexec -check-mpi -n 2 ./miniFE.x
...
      creating/filling mesh...0.000828028s, total time: 0.000828981
generating matrix structure...0.00868297s, total time: 0.00951195
         assembling FE data...0.00850797s, total time: 0.0180199
      imposing Dirichlet BC...0.00221992s, total time: 0.0202398
      imposing Dirichlet BC...0.00244904s, total time: 0.0226889
making matrix indices local...
[0] WARNING: LOCAL:MEMORY:OVERLAP: warning
[0] WARNING:    New send buffer overlaps with currently active receive buffer at address 0x17f0730.
[0] WARNING:    Control over active buffer was transferred to MPI at:
[0] WARNING:       MPI_Irecv(*buf=0x17f0730, count=1, datatype=MPI_INT, source=MPI_ANY_SOURCE, tag=99, comm=MPI_COMM_WORLD, *request=0x1c04470)
[0] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:259)
[0] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[0] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[0] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[0] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[0] WARNING:    Control over new buffer is about to be transferred to MPI at:
[0] WARNING:       MPI_Send(*buf=0x17f0730, count=1, datatype=MPI_INT, dest=1, tag=99, comm=MPI_COMM_WORLD)
[0] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[0] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[0] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[0] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[0] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)

[1] WARNING: LOCAL:MEMORY:OVERLAP: warning
[1] WARNING:    New send buffer overlaps with currently active receive buffer at address 0x11d48a0.
[1] WARNING:    Control over active buffer was transferred to MPI at:
[1] WARNING:       MPI_Irecv(*buf=0x11d48a0, count=1, datatype=MPI_INT, source=MPI_ANY_SOURCE, tag=99, comm=MPI_COMM_WORLD, *request=0x1219dc0)
[1] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:259)
[1] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[1] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[1] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[1] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[1] WARNING:    Control over new buffer is about to be transferred to MPI at:
[1] WARNING:       MPI_Send(*buf=0x11d48a0, count=1, datatype=MPI_INT, dest=0, tag=99, comm=MPI_COMM_WORLD)
[1] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[1] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[1] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[1] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[1] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
1.09176s, total time: 1.11445
Starting CG solver ...
Initial Residual = 11.0289
Iteration = 20   Residual = 1.23424e-08
Final Resid Norm: 2.06977e-16

[0] INFO: LOCAL:MEMORY:OVERLAP: found 2 times (0 errors + 2 warnings), 0 reports were suppressed
[0] INFO: Found 2 problems (0 errors + 2 warnings), 0 reports were suppressed.

If I use more then 2 processes, e.g. 72, then some OVERLAP warnings turn into ILLEGAL_MODIFICATION errors:

[54] ERROR: LOCAL:MEMORY:ILLEGAL_MODIFICATION: error
[54] ERROR:    Read-only buffer was modified while owned by MPI.
[54] ERROR:    Control over buffer was transferred to MPI at:
[54] ERROR:       MPI_Send(*buf=0x9693c4, count=1, datatype=MPI_INT, dest=22, tag=99, comm=MPI_COMM_WORLD)
[54] ERROR:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[54] ERROR:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[54] ERROR:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[54] ERROR:       __libc_start_main (/usr/lib64/libc-2.28.so)
[54] ERROR:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[54] ERROR:    Modified buffer detected at:
[54] ERROR:       MPI_Send(*buf=0x9693c4, count=1, datatype=MPI_INT, dest=22, tag=99, comm=MPI_COMM_WORLD)
[54] ERROR:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[54] ERROR:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[54] ERROR:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[54] ERROR:       __libc_start_main (/usr/lib64/libc-2.28.so)
[54] ERROR:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)

what is the different between miniFE and HPCG?

Hi, what is the different between miniFE and HPCG?

How to change the definition of Scalar, universally, from a `double` to a third-party implementation of floating point math?

I'm trying to merge the posit implementation found here into the ref/src implementation of the miniFE application. I'm kind of new to editing makefiles and was hoping for some tips.

x * y * z * 3^3 / ppn can exceed 2^31 on modern machines

Running on a 192 GB dual socket machine. Using the MPI + OMP version in miniFE_openmp_opt
export OMP_NUM_THREADS=11
mpirun -n 4 -ppn 4 ./miniFE.x nx=682 ny=682 nz=682
throws an exception because nrows_max in CSRMatrix.hpp turns negative due to int overflow.
packed_cols.reserve(nrows_max); doesn't like negative numbers ;-)

mpirun -n 4 -ppn 4 ./miniFE.x nx=680 ny=680 nz=680 # works

Unfortunately making MINIFE_GLOBAL_ORDINAL a long is not sufficient to address the issue.

problems building miniFE with clang-16

I am trying to build miniFE with clang-16, but I am getting errors of various kinds. For example: Include. nested too deeply, or, for all ofthe math functions: error: no member named 'acos' in the global namespace
using ::acos;
Has anybody have luck with clang-16 and can share tips how to build?
Thanks in advance, Gabriele

How to estimate memory footprint from input parameter values?

Assuming we are only interested in OpenMP (miniFE/openmp45/src), not MPI.

How to estimate the memory required for a given set of input sizes (nx, ny and nz values)?

miniFE/openmp45/src]./miniFE.x -nx 512 -ny 512 -nz 512

MiniFE Mini-App, OpenMP Peer Implementation
Creating OpenMP Thread Pool...
Counted: 160 threads.
Running MiniFE Mini-App...
      creating/filling mesh...1.08418s, total time: 1.08418
generating matrix structure...Segmentation fault  ****

Using 1024x1024x1024 generates "running out of memory"

./miniFE.x -nx 1024 -ny 1024 -nz 1024
MiniFE Mini-App, OpenMP Peer Implementation
Creating OpenMP Thread Pool...
Counted: 160 threads.
Running MiniFE Mini-App...
      creating/filling mesh...6.70102s, total time: 6.70103

generating matrix structure...proc 0 threw an exception in generate_matrix_structure, probably due to running out of memory. 
2.3012s, total time: 9.00222
         assembling FE data...

MiniFE does not parse argv correctly in main(..)

During Trinity bring up it appears that a path with nx or ny or nz in it (actually in argv[0]) can be parsed as an argument. This has created a significant number of issues for full workload performance evaluation.

Differences between opemp and openmp-opt?

From the commit history, I can tell that one major difference between the openmp-opt and openmp versions is the use of MPI_THREAD_MULTIPLE in openmp-opt. Are there any other differences between the two versions? If so, is there a range of commits I can look at to find these differences?

Could you provide some tips on how to run this project in the CLion IDE?

Analytic solution is incorrect

The analytic solution defined in fem/analytic_soln.hpp corresponds to a case where the internal heat source term Q is zero. In the miniapp code, however, the source term is set to 1 (Hex8::sourceVector: Scalar Q = 1.0;). Consequently, model error does not decrease as resolution is increased (e.g., absolute error at (0.5, 0.5, 0.5) is always between 0.05 and 0.06).

Fig. Comparison of analytic and numerical solutions on z==0.5 plane.

mantevo / minife Goto Github PK

minife's Introduction

Mantevo

minife's People

Contributors

Stargazers

Watchers

Forkers

minife's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs