Core Mantevo Repository containing common resources
mantevo / minife Goto Github PK
View Code? Open in Web Editor NEWMiniFE Finite Element Mini-Application
Home Page: http://www.mantevo.org
License: GNU Lesser General Public License v3.0
MiniFE Finite Element Mini-Application
Home Page: http://www.mantevo.org
License: GNU Lesser General Public License v3.0
I want to build a docker image that contains MiniFE, and I try to download the v2.1.0.But the source code doesn't contain make file,so how to build MiniFE?
For the following codes in miniFE, the comments show that __syncthreads() are not needed in a warp. However, I think __syncthreads() are actually needed to produce correct sum results. I got incorrect sum results when omitting them. Could you reproduce the issue ? Thank you for your comments.
template<typename Vector>
__global__ void dot_kernel(const Vector x, const Vector y, typename TypeTraits<typename Vector::ScalarType>::magnitude_type *d) {
typedef typename TypeTraits<typename Vector::ScalarType>::magnitude_type magnitude;
const int BLOCK_SIZE=512;
magnitude sum=0;
for(int idx=blockIdx.x*blockDim.x+threadIdx.x;idx<x.n;idx+=gridDim.x*blockDim.x) {
sum+=x.coefs[idx]*y.coefs[idx];
}
//Do a shared memory reduction on the dot product
__shared__ volatile magnitude red[BLOCK_SIZE];
red[threadIdx.x]=sum;
//__syncthreads(); if(threadIdx.x<512) {sum+=red[threadIdx.x+512]; red[threadIdx.x]=sum;}
__syncthreads(); if(threadIdx.x<256) {sum+=red[threadIdx.x+256]; red[threadIdx.x]=sum;}
__syncthreads(); if(threadIdx.x<128) {sum+=red[threadIdx.x+128]; red[threadIdx.x]=sum;}
__syncthreads(); if(threadIdx.x<64) {sum+=red[threadIdx.x+64]; red[threadIdx.x]=sum;}
__syncthreads(); if(threadIdx.x<32) {sum+=red[threadIdx.x+32]; red[threadIdx.x]=sum;}
//the remaining ones don't need syncthreads because they are warp synchronous
if(threadIdx.x<16) {sum+=red[threadIdx.x+16]; red[threadIdx.x]=sum;}
if(threadIdx.x<8) {sum+=red[threadIdx.x+8]; red[threadIdx.x]=sum;}
if(threadIdx.x<4) {sum+=red[threadIdx.x+4]; red[threadIdx.x]=sum;}
if(threadIdx.x<2) {sum+=red[threadIdx.x+2]; red[threadIdx.x]=sum;}
if(threadIdx.x<1) {sum+=red[threadIdx.x+1];}
//save partial dot products
if(threadIdx.x==0) d[blockIdx.x]=sum;
}
template<typename Scalar>
__global__ void dot_final_reduce_kernel(Scalar *d) {
const int BLOCK_SIZE=1024;
Scalar sum=d[threadIdx.x];
__shared__ volatile Scalar red[BLOCK_SIZE];
red[threadIdx.x]=sum;
__syncthreads(); if(threadIdx.x<512) {sum+=red[threadIdx.x+512]; red[threadIdx.x]=sum;}
__syncthreads(); if(threadIdx.x<256) {sum+=red[threadIdx.x+256]; red[threadIdx.x]=sum;}
__syncthreads(); if(threadIdx.x<128) {sum+=red[threadIdx.x+128]; red[threadIdx.x]=sum;}
__syncthreads(); if(threadIdx.x<64) {sum+=red[threadIdx.x+64]; red[threadIdx.x]=sum;}
__syncthreads(); if(threadIdx.x<32) {sum+=red[threadIdx.x+32]; red[threadIdx.x]=sum;}
//the remaining ones don't need syncthreads because they are warp synchronous
if(threadIdx.x<16) {sum+=red[threadIdx.x+16]; red[threadIdx.x]=sum;}
if(threadIdx.x<8) {sum+=red[threadIdx.x+8]; red[threadIdx.x]=sum;}
if(threadIdx.x<4) {sum+=red[threadIdx.x+4]; red[threadIdx.x]=sum;}
if(threadIdx.x<2) {sum+=red[threadIdx.x+2]; red[threadIdx.x]=sum;}
if(threadIdx.x<1) {sum+=red[threadIdx.x+1];}
//save final dot product at the front
if(threadIdx.x==0) d[0]=sum;
}
#define BLOCK_SIZE 256
#include <stdio.h>
#include <cuda.h>
__global__ void dot_kernel(const int n, const int* x, const int* y, int *d) {
int sum=0;
for(int idx=blockIdx.x*blockDim.x+threadIdx.x;idx<n;idx+=gridDim.x*blockDim.x) {
sum+=x[idx]*y[idx];
}
//Do a shared memory reduction on the dot product
__shared__ int red[BLOCK_SIZE];
red[threadIdx.x]=sum;
#pragma unroll
for (int n = 128; n > 0; n = n/2) { // incorrect results when syncthreads() are omitted in a wrap
__syncthreads();
if(threadIdx.x<n) {sum+=red[threadIdx.x+n]; red[threadIdx.x]=sum;}
}
//save partial dot products
if(threadIdx.x==0) d[blockIdx.x]=sum;
}
__global__ void final(int *d) {
int sum=d[threadIdx.x];
__shared__ int red[BLOCK_SIZE];
red[threadIdx.x]=sum;
#pragma unroll
for (int n = 128; n > 0; n = n/2) {
__syncthreads();
if(threadIdx.x<n) {sum+=red[threadIdx.x+n]; red[threadIdx.x]=sum;}
}
//save final dot product at the front
if(threadIdx.x==0) d[0]=sum;
}
#define LEN 1025
int main() {
int a[LEN];
int b[LEN];
int r[256];
srand(2);
int sum = 0;
int d_sum = 0;
// sum on the host
for (int i = 0; i < LEN; i++) {
a[i] = rand() % 3;
b[i] = rand() % 3;
sum += a[i]*b[i];
}
// sum on the device
int *da, *db;
int *dr;
const int n = LEN;
cudaMalloc((void**)&da, sizeof(int)*LEN);
cudaMalloc((void**)&db, sizeof(int)*LEN);
cudaMalloc((void**)&dr, sizeof(int)*256);
cudaMemcpy(da, a, sizeof(int)*LEN, cudaMemcpyHostToDevice);
cudaMemcpy(db, b, sizeof(int)*LEN, cudaMemcpyHostToDevice);
dot_kernel<<<(n+255)/256, 256 >>>(n, da,db,dr);
final<<<1, 256>>>(dr);
cudaMemcpy(&d_sum, dr, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d %d\n", sum ,d_sum);
cudaFree(da);
cudaFree(db);
cudaFree(dr);
return 0;
}
Hi,
I ran miniFE's ref version with Intel MPI under the message checker from ITAC (Intel Trace Analyzer and Collector). The message checker detected issues LOCAL:MEMORY:OVERLAP and further LOCAL:MEMORY:ILLEGAL_MODIFICATION in ref/src/make_local_matrix.hpp
where the same buffers are used for sending and receiving at the same time. From what I saw all other minFE's version should also be affected if they execute the corresponding code.
The affected code from ref/src/make_local_matrix.hpp
is in lines 257ff:
std::vector<MPI_Request> request(num_send_neighbors);
for(int i=0; i<num_send_neighbors; ++i) {
MPI_Irecv(&tmp_buffer[i], 1, mpi_dtype, MPI_ANY_SOURCE, MPI_MY_TAG,
MPI_COMM_WORLD, &request[i]);
}
// send messages
for(int i=0; i<num_recv_neighbors; ++i) {
MPI_Send(&tmp_buffer[i], 1, mpi_dtype, recv_list[i], MPI_MY_TAG,
MPI_COMM_WORLD);
}
If both loops have a trip count > 0 then some buffers pointed to by the tmp_buffer
array are used at the same time for sending and receiving.
The complete output and commands for reproducing:
$ git clone https://github.com/Mantevo/miniFE.git
$ cd miniFE/ref/src
$ # loaded module for intelmpi and itac
$ make
$ mpiexec -check-mpi -n 2 ./miniFE.x
...
creating/filling mesh...0.000828028s, total time: 0.000828981
generating matrix structure...0.00868297s, total time: 0.00951195
assembling FE data...0.00850797s, total time: 0.0180199
imposing Dirichlet BC...0.00221992s, total time: 0.0202398
imposing Dirichlet BC...0.00244904s, total time: 0.0226889
making matrix indices local...
[0] WARNING: LOCAL:MEMORY:OVERLAP: warning
[0] WARNING: New send buffer overlaps with currently active receive buffer at address 0x17f0730.
[0] WARNING: Control over active buffer was transferred to MPI at:
[0] WARNING: MPI_Irecv(*buf=0x17f0730, count=1, datatype=MPI_INT, source=MPI_ANY_SOURCE, tag=99, comm=MPI_COMM_WORLD, *request=0x1c04470)
[0] WARNING: _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:259)
[0] WARNING: _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[0] WARNING: main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[0] WARNING: __libc_start_main (/usr/lib64/libc-2.28.so)
[0] WARNING: _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[0] WARNING: Control over new buffer is about to be transferred to MPI at:
[0] WARNING: MPI_Send(*buf=0x17f0730, count=1, datatype=MPI_INT, dest=1, tag=99, comm=MPI_COMM_WORLD)
[0] WARNING: _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[0] WARNING: _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[0] WARNING: main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[0] WARNING: __libc_start_main (/usr/lib64/libc-2.28.so)
[0] WARNING: _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[1] WARNING: LOCAL:MEMORY:OVERLAP: warning
[1] WARNING: New send buffer overlaps with currently active receive buffer at address 0x11d48a0.
[1] WARNING: Control over active buffer was transferred to MPI at:
[1] WARNING: MPI_Irecv(*buf=0x11d48a0, count=1, datatype=MPI_INT, source=MPI_ANY_SOURCE, tag=99, comm=MPI_COMM_WORLD, *request=0x1219dc0)
[1] WARNING: _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:259)
[1] WARNING: _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[1] WARNING: main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[1] WARNING: __libc_start_main (/usr/lib64/libc-2.28.so)
[1] WARNING: _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[1] WARNING: Control over new buffer is about to be transferred to MPI at:
[1] WARNING: MPI_Send(*buf=0x11d48a0, count=1, datatype=MPI_INT, dest=0, tag=99, comm=MPI_COMM_WORLD)
[1] WARNING: _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[1] WARNING: _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[1] WARNING: main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[1] WARNING: __libc_start_main (/usr/lib64/libc-2.28.so)
[1] WARNING: _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
1.09176s, total time: 1.11445
Starting CG solver ...
Initial Residual = 11.0289
Iteration = 20 Residual = 1.23424e-08
Final Resid Norm: 2.06977e-16
[0] INFO: LOCAL:MEMORY:OVERLAP: found 2 times (0 errors + 2 warnings), 0 reports were suppressed
[0] INFO: Found 2 problems (0 errors + 2 warnings), 0 reports were suppressed.
If I use more then 2 processes, e.g. 72, then some OVERLAP warnings turn into ILLEGAL_MODIFICATION errors:
[54] ERROR: LOCAL:MEMORY:ILLEGAL_MODIFICATION: error
[54] ERROR: Read-only buffer was modified while owned by MPI.
[54] ERROR: Control over buffer was transferred to MPI at:
[54] ERROR: MPI_Send(*buf=0x9693c4, count=1, datatype=MPI_INT, dest=22, tag=99, comm=MPI_COMM_WORLD)
[54] ERROR: _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[54] ERROR: _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[54] ERROR: main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[54] ERROR: __libc_start_main (/usr/lib64/libc-2.28.so)
[54] ERROR: _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
[54] ERROR: Modified buffer detected at:
[54] ERROR: MPI_Send(*buf=0x9693c4, count=1, datatype=MPI_INT, dest=22, tag=99, comm=MPI_COMM_WORLD)
[54] ERROR: _ZN6miniFE17make_local_matrixINS_9CSRMatrixIdiiEEEEvRT_ (/home/xyz/projects/miniFE/ref/src/./make_local_matrix.hpp:266)
[54] ERROR: _ZN6miniFE6driverIdiiEEiRK3BoxRS1_RNS_10ParametersER8YAML_Doc (/home/xyz/projects/miniFE/ref/src/./driver.hpp:228)
[54] ERROR: main (/home/xyz/projects/miniFE/ref/src/main.cpp:154)
[54] ERROR: __libc_start_main (/usr/lib64/libc-2.28.so)
[54] ERROR: _start (/home/xyz/projects/miniFE/ref/src/miniFE.x)
Hi, what is the different between miniFE and HPCG?
I'm trying to merge the posit
implementation found here into the ref/src
implementation of the miniFE application. I'm kind of new to editing makefiles and was hoping for some tips.
Running on a 192 GB dual socket machine. Using the MPI + OMP version in miniFE_openmp_opt
export OMP_NUM_THREADS=11
mpirun -n 4 -ppn 4 ./miniFE.x nx=682 ny=682 nz=682
throws an exception because nrows_max in CSRMatrix.hpp turns negative due to int overflow.
packed_cols.reserve(nrows_max); doesn't like negative numbers ;-)
mpirun -n 4 -ppn 4 ./miniFE.x nx=680 ny=680 nz=680 # works
Unfortunately making MINIFE_GLOBAL_ORDINAL a long is not sufficient to address the issue.
I am trying to build miniFE with clang-16, but I am getting errors of various kinds. For example: Include. nested too deeply, or, for all ofthe math functions: error: no member named 'acos' in the global namespace
using ::acos;
Has anybody have luck with clang-16 and can share tips how to build?
Thanks in advance, Gabriele
Assuming we are only interested in OpenMP (miniFE/openmp45/src), not MPI.
How to estimate the memory required for a given set of input sizes (nx, ny and nz values)?
miniFE/openmp45/src]./miniFE.x -nx 512 -ny 512 -nz 512
MiniFE Mini-App, OpenMP Peer Implementation
Creating OpenMP Thread Pool...
Counted: 160 threads.
Running MiniFE Mini-App...
creating/filling mesh...1.08418s, total time: 1.08418
generating matrix structure...Segmentation fault ****
Using 1024x1024x1024 generates "running out of memory"
./miniFE.x -nx 1024 -ny 1024 -nz 1024
MiniFE Mini-App, OpenMP Peer Implementation
Creating OpenMP Thread Pool...
Counted: 160 threads.
Running MiniFE Mini-App...
creating/filling mesh...6.70102s, total time: 6.70103
generating matrix structure...proc 0 threw an exception in generate_matrix_structure, probably due to running out of memory.
2.3012s, total time: 9.00222
assembling FE data...
During Trinity bring up it appears that a path with nx
or ny
or nz
in it (actually in argv[0]
) can be parsed as an argument. This has created a significant number of issues for full workload performance evaluation.
From the commit history, I can tell that one major difference between the openmp-opt
and openmp
versions is the use of MPI_THREAD_MULTIPLE
in openmp-opt
. Are there any other differences between the two versions? If so, is there a range of commits I can look at to find these differences?
Could you provide some tips on how to run this project in the CLion IDE?
The analytic solution defined in fem/analytic_soln.hpp
corresponds to a case where the internal heat source term Q
is zero. In the miniapp code, however, the source term is set to 1 (Hex8::sourceVector
: Scalar Q = 1.0;
). Consequently, model error does not decrease as resolution is increased (e.g., absolute error at (0.5, 0.5, 0.5) is always between 0.05 and 0.06).
Fig. Comparison of analytic and numerical solutions on z==0.5 plane.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.