GithubHelp home page GithubHelp logo

mantevo / minife Goto Github PK

View Code? Open in Web Editor NEW
28.0 8.0 31.0 1.05 MB

MiniFE Finite Element Mini-Application

Home Page: http://www.mantevo.org

License: GNU Lesser General Public License v3.0

C++ 58.89% Shell 5.00% Makefile 9.80% CMake 0.55% C 20.26% M4 3.59% Perl 1.04% Cuda 0.81% Roff 0.07%
miniapp minife ecp finite-elements solvers solver snl-mini-apps

minife's Introduction

Mantevo

Core Mantevo Repository containing common resources

minife's People

Contributors

crtrott avatar jwillenbring avatar nmhamster avatar npe9 avatar pwxy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

minife's Issues

How to build MiniFE

I want to build a docker image that contains MiniFE, and I try to download the v2.1.0.But the source code doesn't contain make file,so how to build MiniFE?

__syncthreads() needed for reduction ?

For the following codes in miniFE, the comments show that __syncthreads() are not needed in a warp. However, I think __syncthreads() are actually needed to produce correct sum results. I got incorrect sum results when omitting them. Could you reproduce the issue ? Thank you for your comments.

template<typename Vector>
__global__ void dot_kernel(const Vector x, const Vector y, typename TypeTraits<typename Vector::ScalarType>::magnitude_type *d) {

  typedef typename TypeTraits<typename Vector::ScalarType>::magnitude_type magnitude;
  const int BLOCK_SIZE=512;

  magnitude sum=0;
  for(int idx=blockIdx.x*blockDim.x+threadIdx.x;idx<x.n;idx+=gridDim.x*blockDim.x) {
    sum+=x.coefs[idx]*y.coefs[idx];
  }

  //Do a shared memory reduction on the dot product
  __shared__ volatile magnitude red[BLOCK_SIZE];
  red[threadIdx.x]=sum;
  //__syncthreads(); if(threadIdx.x<512) {sum+=red[threadIdx.x+512]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<256)  {sum+=red[threadIdx.x+256]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<128)  {sum+=red[threadIdx.x+128]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<64)   {sum+=red[threadIdx.x+64];  red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<32)   {sum+=red[threadIdx.x+32];  red[threadIdx.x]=sum;}
  //the remaining ones don't need syncthreads because they are warp synchronous
                   if(threadIdx.x<16)   {sum+=red[threadIdx.x+16];  red[threadIdx.x]=sum;}
                   if(threadIdx.x<8)    {sum+=red[threadIdx.x+8];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<4)    {sum+=red[threadIdx.x+4];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<2)    {sum+=red[threadIdx.x+2];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<1)    {sum+=red[threadIdx.x+1];}

  //save partial dot products
  if(threadIdx.x==0) d[blockIdx.x]=sum;
}

template<typename Scalar>
__global__ void dot_final_reduce_kernel(Scalar *d) {
  const int BLOCK_SIZE=1024;
  Scalar sum=d[threadIdx.x];
  __shared__ volatile Scalar red[BLOCK_SIZE];

  red[threadIdx.x]=sum;
  __syncthreads(); if(threadIdx.x<512)  {sum+=red[threadIdx.x+512]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<256)  {sum+=red[threadIdx.x+256]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<128)  {sum+=red[threadIdx.x+128]; red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<64)   {sum+=red[threadIdx.x+64];  red[threadIdx.x]=sum;}
  __syncthreads(); if(threadIdx.x<32)   {sum+=red[threadIdx.x+32];  red[threadIdx.x]=sum;}
  //the remaining ones don't need syncthreads because they are warp synchronous
                   if(threadIdx.x<16)   {sum+=red[threadIdx.x+16];  red[threadIdx.x]=sum;}
                   if(threadIdx.x<8)    {sum+=red[threadIdx.x+8];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<4)    {sum+=red[threadIdx.x+4];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<2)    {sum+=red[threadIdx.x+2];   red[threadIdx.x]=sum;}
                   if(threadIdx.x<1)    {sum+=red[threadIdx.x+1];}

  //save final dot product at the front
                   if(threadIdx.x==0) d[0]=sum;
}
#define BLOCK_SIZE  256

#include <stdio.h>
#include <cuda.h>

__global__ void dot_kernel(const int n, const int* x, const int* y, int *d) {

  int sum=0;
  for(int idx=blockIdx.x*blockDim.x+threadIdx.x;idx<n;idx+=gridDim.x*blockDim.x) {
    sum+=x[idx]*y[idx];
  }

  //Do a shared memory reduction on the dot product
  __shared__ int red[BLOCK_SIZE];
  red[threadIdx.x]=sum;
  #pragma unroll
  for (int n = 128; n > 0; n = n/2) {   // incorrect results when syncthreads() are omitted in a wrap
     __syncthreads();
     if(threadIdx.x<n)  {sum+=red[threadIdx.x+n]; red[threadIdx.x]=sum;}
  }

  //save partial dot products
  if(threadIdx.x==0) d[blockIdx.x]=sum;
}

__global__ void final(int *d) {
  int sum=d[threadIdx.x];
  __shared__ int red[BLOCK_SIZE];

  red[threadIdx.x]=sum;
  #pragma unroll
  for (int n = 128; n > 0; n = n/2) {    
     __syncthreads();
     if(threadIdx.x<n)  {sum+=red[threadIdx.x+n]; red[threadIdx.x]=sum;}
  }
  //save final dot product at the front
  if(threadIdx.x==0) d[0]=sum;
}

#define LEN 1025
int main() {
  int a[LEN];
  int b[LEN];
  int r[256];
  srand(2);
  int sum = 0;
  int d_sum = 0;

// sum on the host
  for (int i = 0; i < LEN; i++) {
    a[i] = rand() % 3;
    b[i] = rand() % 3;
    sum += a[i]*b[i];
  }

// sum on the device
  int *da, *db;
  int *dr;
  const int n = LEN;
  cudaMalloc((void**)&da, sizeof(int)*LEN);
  cudaMalloc((void**)&db, sizeof(int)*LEN);
  cudaMalloc((void**)&dr, sizeof(int)*256);
  cudaMemcpy(da, a, sizeof(int)*LEN, cudaMemcpyHostToDevice);
  cudaMemcpy(db, b, sizeof(int)*LEN, cudaMemcpyHostToDevice);
  dot_kernel<<<(n+255)/256, 256 >>>(n, da,db,dr);
  final<<<1, 256>>>(dr);
  cudaMemcpy(&d_sum, dr, sizeof(int), cudaMemcpyDeviceToHost);
  printf("%d %d\n", sum ,d_sum);
  cudaFree(da);
  cudaFree(db);
  cudaFree(dr);
  return 0;
}

MPI_Irecv and MPI_Send use the same buffer at the same time

Hi,

I ran miniFE's ref version with Intel MPI under the message checker from ITAC (Intel Trace Analyzer and Collector). The message checker detected issues LOCAL:MEMORY:OVERLAP and further LOCAL:MEMORY:ILLEGAL_MODIFICATION in ref/src/make_local_matrix.hpp where the same buffers are used for sending and receiving at the same time. From what I saw all other minFE's version should also be affected if they execute the corresponding code.

The affected code from ref/src/make_local_matrix.hpp is in lines 257ff:

  std::vector<MPI_Request> request(num_send_neighbors);
  for(int i=0; i<num_send_neighbors; ++i) {
    MPI_Irecv(&tmp_buffer[i], 1, mpi_dtype, MPI_ANY_SOURCE, MPI_MY_TAG,
              MPI_COMM_WORLD, &request[i]);
  }

  // send messages

  for(int i=0; i<num_recv_neighbors; ++i) {
    MPI_Send(&tmp_buffer[i], 1, mpi_dtype, recv_list[i], MPI_MY_TAG,
             MPI_COMM_WORLD);
  }

If both loops have a trip count > 0 then some buffers pointed to by the tmp_buffer array are used at the same time for sending and receiving.

The complete output and commands for reproducing:

$ git clone https://github.com/Mantevo/miniFE.git
$ cd miniFE/ref/src
$ # loaded module for intelmpi and itac
$ make
$ mpiexec -check-mpi -n 2 ./miniFE.x
...
      creating/filling mesh...0.000828028s, total time: 0.000828981
generating matrix structure...0.00868297s, total time: 0.00951195
         assembling FE data...0.00850797s, total time: 0.0180199
      imposing Dirichlet BC...0.00221992s, total time: 0.0202398
      imposing Dirichlet BC...0.00244904s, total time: 0.0226889
making matrix indices local...
[0] WARNING: LOCAL:MEMORY:OVERLAP: warning
[0] WARNING:    New send buffer overlaps with currently active receive buffer at address 0x17f0730.
[0] WARNING:    Control over active buffer was transferred to MPI at:
[0] WARNING:       MPI_Irecv(*buf=0x17f0730, count=1, datatype=MPI_INT, source=MPI_ANY_SOURCE, tag=99, comm=MPI_COMM_WORLD, *request=0x1c04470)
[0] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:259)
[0] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[0] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[0] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[0] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[0] WARNING:    Control over new buffer is about to be transferred to MPI at:
[0] WARNING:       MPI_Send(*buf=0x17f0730, count=1, datatype=MPI_INT, dest=1, tag=99, comm=MPI_COMM_WORLD)
[0] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[0] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[0] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[0] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[0] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)

[1] WARNING: LOCAL:MEMORY:OVERLAP: warning
[1] WARNING:    New send buffer overlaps with currently active receive buffer at address 0x11d48a0.
[1] WARNING:    Control over active buffer was transferred to MPI at:
[1] WARNING:       MPI_Irecv(*buf=0x11d48a0, count=1, datatype=MPI_INT, source=MPI_ANY_SOURCE, tag=99, comm=MPI_COMM_WORLD, *request=0x1219dc0)
[1] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:259)
[1] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[1] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[1] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[1] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[1] WARNING:    Control over new buffer is about to be transferred to MPI at:
[1] WARNING:       MPI_Send(*buf=0x11d48a0, count=1, datatype=MPI_INT, dest=0, tag=99, comm=MPI_COMM_WORLD)
[1] WARNING:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[1] WARNING:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[1] WARNING:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[1] WARNING:       __libc_start_main (/usr/lib64/libc-2.28.so)
[1] WARNING:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
1.09176s, total time: 1.11445
Starting CG solver ...
Initial Residual = 11.0289
Iteration = 20   Residual = 1.23424e-08
Final Resid Norm: 2.06977e-16

[0] INFO: LOCAL:MEMORY:OVERLAP: found 2 times (0 errors + 2 warnings), 0 reports were suppressed
[0] INFO: Found 2 problems (0 errors + 2 warnings), 0 reports were suppressed.

If I use more then 2 processes, e.g. 72, then some OVERLAP warnings turn into ILLEGAL_MODIFICATION errors:

[54] ERROR: LOCAL:MEMORY:ILLEGAL_MODIFICATION: error
[54] ERROR:    Read-only buffer was modified while owned by MPI.
[54] ERROR:    Control over buffer was transferred to MPI at:
[54] ERROR:       MPI_Send(*buf=0x9693c4, count=1, datatype=MPI_INT, dest=22, tag=99, comm=MPI_COMM_WORLD)
[54] ERROR:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[54] ERROR:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[54] ERROR:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[54] ERROR:       __libc_start_main (/usr/lib64/libc-2.28.so)
[54] ERROR:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[54] ERROR:    Modified buffer detected at:
[54] ERROR:       MPI_Send(*buf=0x9693c4, count=1, datatype=MPI_INT, dest=22, tag=99, comm=MPI_COMM_WORLD)
[54] ERROR:       _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[54] ERROR:       _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[54] ERROR:       main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[54] ERROR:       __libc_start_main (/usr/lib64/libc-2.28.so)
[54] ERROR:       _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)

x * y * z * 3^3 / ppn can exceed 2^31 on modern machines

Running on a 192 GB dual socket machine. Using the MPI + OMP version in miniFE_openmp_opt
export OMP_NUM_THREADS=11
mpirun -n 4 -ppn 4 ./miniFE.x nx=682 ny=682 nz=682
throws an exception because nrows_max in CSRMatrix.hpp turns negative due to int overflow.
packed_cols.reserve(nrows_max); doesn't like negative numbers ;-)

mpirun -n 4 -ppn 4 ./miniFE.x nx=680 ny=680 nz=680 # works

Unfortunately making MINIFE_GLOBAL_ORDINAL a long is not sufficient to address the issue.

problems building miniFE with clang-16

I am trying to build miniFE with clang-16, but I am getting errors of various kinds. For example: Include. nested too deeply, or, for all ofthe math functions: error: no member named 'acos' in the global namespace
using ::acos;
Has anybody have luck with clang-16 and can share tips how to build?
Thanks in advance, Gabriele

How to estimate memory footprint from input parameter values?

Assuming we are only interested in OpenMP (miniFE/openmp45/src), not MPI.

How to estimate the memory required for a given set of input sizes (nx, ny and nz values)?

miniFE/openmp45/src]./miniFE.x -nx 512 -ny 512 -nz 512

MiniFE Mini-App, OpenMP Peer Implementation
Creating OpenMP Thread Pool...
Counted: 160 threads.
Running MiniFE Mini-App...
      creating/filling mesh...1.08418s, total time: 1.08418
generating matrix structure...Segmentation fault  **** 

Using 1024x1024x1024 generates "running out of memory"

./miniFE.x -nx 1024 -ny 1024 -nz 1024
MiniFE Mini-App, OpenMP Peer Implementation
Creating OpenMP Thread Pool...
Counted: 160 threads.
Running MiniFE Mini-App...
      creating/filling mesh...6.70102s, total time: 6.70103

generating matrix structure...proc 0 threw an exception in generate_matrix_structure, probably due to running out of memory. 
2.3012s, total time: 9.00222
         assembling FE data...

MiniFE does not parse argv correctly in main(..)

During Trinity bring up it appears that a path with nx or ny or nz in it (actually in argv[0]) can be parsed as an argument. This has created a significant number of issues for full workload performance evaluation.

Differences between opemp and openmp-opt?

From the commit history, I can tell that one major difference between the openmp-opt and openmp versions is the use of MPI_THREAD_MULTIPLE in openmp-opt. Are there any other differences between the two versions? If so, is there a range of commits I can look at to find these differences?

Analytic solution is incorrect

The analytic solution defined in fem/analytic_soln.hpp corresponds to a case where the internal heat source term Q is zero. In the miniapp code, however, the source term is set to 1 (Hex8::sourceVector: Scalar Q = 1.0;). Consequently, model error does not decrease as resolution is increased (e.g., absolute error at (0.5, 0.5, 0.5) is always between 0.05 and 0.06).

minife_100x100x100_solution_comparison
Fig. Comparison of analytic and numerical solutions on z==0.5 plane.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.