meteoswiss-apn / comm_overlap_bench Goto Github PK
View Code? Open in Web Editor NEWCommunication/Computation Overlap Benchmark
License: BSD 2-Clause "Simplified" License
Communication/Computation Overlap Benchmark
License: BSD 2-Clause "Simplified" License
@mmxcscs is trying to use MUST (=MPI runtime checker) on cosmo (only the non-gpu build compiled
with GNU for now). MUST detects a conflict between active buffers in MPI_Irecv from mpe_io2.f90.
Unfortunately, COSMO crashes with the following error:
At line 2178 of file /scratch-shared/meteoswiss/scratch/maximem/mpi1sided/
cosmo-pompa_debug/cosmo/src/mpe_io2.f90
Fortran runtime error: Index '999999' of dimension 1 of array
'write_buffer%completed' above upper bound of 7
Both the crash and the MPI_irecv issue seem to be related. Have you ever seen
that crash before ? That could also be a glitch of the tool.
In KESCH:
in the past we always observed good results with G2G=2 (using the GDR driver) but we could never put that in production due to the bug with CCE environment.
But the bug will presumably be fixed with the upgrade of kesch, so we would like to switch to G2G=2.
However the recent experiments of @pspoerri
https://docs.google.com/spreadsheets/d/1xmL4-qsGpeDdb7qi-85fq2sl-7dCdECztR9Y-YaOX04/edit#gid=0
shows a large degradation in performance once we cross the QPI, (i.e. jobs using full node)
Trying to explain #6 (comment)
module load craype-haswell
module load craype-network-infiniband
module load MVAPICH2/2.2-GCC-4.9.3-binutils-2.25
Currently Loaded Modulefiles:
1) craype-haswell 3) binutils/.2.25 5) cudatoolkit/7.0.28
2) craype-network-infiniband 4) GCC/4.9.3-binutils-2.25 6) MVAPICH2/2.2-GCC-4.9.3-binutils-2.25
export MV2_USE_CUDA=1
srun -p debug --gres=gpu:1 -N1 -n1 --ntasks-per-node=1 -t15 \
nvprof --analysis-metrics -o my.nv \
./GNU4.9.3_MVAPICH22.2_CUDAV7.0.27.keschln-0002+setdev
module load craype-haswell
module load craype-network-infiniband
module load MVAPICH2/2.2-GCC-4.9.3-binutils-2.25-cuda75
Currently Loaded Modulefiles:
1) craype-haswell 3) binutils/.2.25 5) cudatoolkit/7.5.18
2) craype-network-infiniband 4) GCC/4.9.3-binutils-2.25 6) /MVAPICH2/2.2-GCC-4.9.3-binutils-2.25-cuda75
export MV2_USE_CUDA=1
srun -p debug --gres=gpu:1 -N1 -n1 --ntasks-per-node=1 -t15 \
nvprof --analysis-metrics -o my.nv \
./GNU4.9.3_MVAPICH22.2_CUDAV7.5.17.keschln-0001+setdev
Hi,
This is my first attempt to run StandaloneStencilsCUDA.
export G2G=2
srun -p debug --gres=gpu:2 -n2 -t10 ./StandaloneStencilsCUDA
Unfortunately, i get a segfault at the very end of the job (in MPI_Finalize):
Domain : [128,128,60]
Sync? : 0
NoComm? : 0
NoComp? : 0
Number Halo Exchanges : 2
Number benchmark repetitions : 1000
In Order halo exchanges? : 0
Device ID :0
SLURM_PROCID :0
Compiled for mvapich2
ELAPSED TIME : ...
[keschcn-0001:mpi_rank_0][error_sighandler]
Caught error: Segmentation fault (signal 11)
ELAPSED TIME : 3.1743 +- + 0.00574198
@pspoerri do you get the same segfault ? If not, could you point to 2 successful jobs (1 with and 1 without aysnchronous mode) ?
I don't segfault with cuda/75:
ELAPSED TIME : 8.4395 +- + 0.00154042
module load daint-gpu
module load Score-P/3.0-CrayGNU-2016.11-cuda-8.0.54
Currently Loaded Modulefiles:
modules/3.2.10.5
eswrap/2.0.11-2.2
cray-mpich/7.5.0
slurm/16.05.8-1
ddt/7.0
xalt/daint-2016.11
daint-gpu
gcc/5.3.0
craype-haswell
craype-network-aries
craype/2.5.8
cray-libsci/16.11.1
udreg/2.3.2-4.14
ugni/6.0.13-2.8
pmi/5.0.10-1.0000.11050.0.0.ari
dmapp/7.1.0-16.18
gni-headers/5.0.7-4.11
xpmem/2.0.3_geb8008a-2.11
job/2.0.2_g98a4850-2.43
dvs/2.5_2.0.70_g1ddb68c-2.144
alps/6.2.5-20.1
rca/2.0.10_g66b76b7-2.51
atp/2.0.4
PrgEnv-gnu/6.0.3
CrayGNU/2016.11
libiberty/.5.3.0-scorep
libunwind/.1.1-scorep
Cube/.4.3.4
OPARI2/.2.0.1
SIONlib/.1.7.1
PDT/.3.23
papi/5.5.0.2
vampir/9.2.0
cray-libsci_acc/16.11.1
cudatoolkit/8.0.54_2.2.8_ga620558-2.1
craype-accel-nvidia60
Score-P/3.0-CrayGNU-2016.11-cuda-8.0.54
make PREP= CXX=CC \
ARCH=-arch=sm_60 \
MVAPICH2GDR_VERSION=$CRAY_MPICH_DIR \
BOOST_VERSION=$BB
LIBMPI=
export LD_LIBRARY_PATH=$BB/lib:LD_LIBRARY_PATH
export CUDA_AUTOBOOST=1
export GCLOCK=875
export MPICH_RDMA_ENABLED_CUDA=1
ELAPSED TIME : 6.78014 +- + 0.000217501
*** Error in GNU5.3.0-CUDAV8.0.53-Stencils':
free(): invalid next size (fast): 0x00 00000001605220 ***
Do you get the same ?
module load daint-gpu
module loadScore-P/3.0-CrayGNU-2016.11
Built without cuda, i am building a new version now.
CUDA_SEPARABLE_COMPILATION
will create additional objects that can confuse the score-p wrapper scripts. By default, score-p assumes that only one cuda object file is created and it will link the nvidia score-p file against this target. With separable compilation, cmake will create a separate object file for each cuda file. This will confuse the linker because the cuda score-p function names will occur multiple times.
git clone https://github.com/MeteoSwiss-APN/comm_overlap_bench.git
git checkout barebones
salloc -p normal -t90 --res=update -w keschcn-0002
ssh keschcn-0002
build_kesch_cuda80.sh
module purge
module load craype-haswell craype-network-infiniband
module load Score-P/3.0-gmvapich2-17.02_cuda_7.5_gdr
Currently Loaded Modulefiles:
craype-haswell
GCC/4.9.3-binutils-2.25
cudatoolkit/7.5.18
mvapich2gdr_gnu/2.2_cuda_7.5
craype-network-infiniband
...
cd /apps/common/UES/sandbox/jgp/comm_overlap_bench.git/GNU493CUDA75gdr/
make PREP= -f Makefilejg
mpicxx -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -DENABLE_CUDA_STREAMS -DMVAPICH2 -g -w -fexceptions -fstack-protector -std=gnu++11 -c main.cpp # -o main.o
mpicxx -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -DENABLE_CUDA_STREAMS -DMVAPICH2 -g -w -fexceptions -fstack-protector -std=gnu++11 -c Repository.cpp # -o Repository.o
mpicxx -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -DENABLE_CUDA_STREAMS -DMVAPICH2 -g -w -fexceptions -fstack-protector -std=gnu++11 -c Options.cpp # -o Options.o
nvcc -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -ccbin g++ -m64 -DMVAPICH2 -DENABLE_CUDA_STREAMS -DNVCC -Xcompiler ,\"-g\" -arch=sm_37 -std=c++11 -c Kernel.cu # -o Kernel.o
nvcc -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -ccbin g++ -m64 -DMVAPICH2 -DENABLE_CUDA_STREAMS -DNVCC -Xcompiler ,\"-g\" -arch=sm_37 -std=c++11 -c HorizontalDiffusionSA.cu # -o HorizontalDiffusionSA.o
mpicxx -g -std=gnu++11 -Wl,--enable-new-dtags -L/opt/mvapich2/gdr/2.2/cuda7.5/gnu/lib64 -L/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/lib -L/global/opt/nvidia/cudatoolkit/7.5.18/lib64
-lmpi -lmpicxx -lcudart -lcuda
/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/lib/libboost_system.so /apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/lib/libboost_timer.so /apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/lib/libboost_chrono.so main.o Repository.o Options.o Kernel.o HorizontalDiffusionSA.o -o GNU4.9.3-CUDAV7.5.17-Stencils
export G2G=2
export CUDA_AUTOBOOST=1
export GCLOCK=875
export LD_LIBRARY_PATH=/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/lib:LD_LIBRARY_PATH
module load cudatoolkit/7.5.18 GCC/4.9.3-binutils-2.25 # ...
srun -p debug --gres=gpu:2 -N1 -n2 --ntasks-per-node=2 -t10 ./GNU4.9.3-CUDAV7.5.17-Stencils
CONFIGURATION
====================================
Domain : [128,128,60]
Sync? : 0
NoComm? : 0
NoComp? : 0
Number Halo Exchanges : 2
Number benchmark repetitions : 1000
In Order halo exchanges? : 0
Device ID :1
SLURM_PROCID :1
Compiled for mvapich2
ELAPSED TIME : 7.82283 +- + 0.00356291
MV2_USE_GPUDIRECT=1
CUDA_AUTOBOOST=1
GCLOCK=875
G2G=2
Linux keschcn-0008 2.6.32-573.18.1.el6.x86_64
MV2_CUDA_IPC=1
MV2_ENABLE_AFFINITY=0
MV2_GPUDIRECT_GDRCOPY_LIB=/apps/escha/gdrcopy/20170131/libgdrapi.so
MV2_USE_CUDA=1
ssh kesch
# salloc -p normal -t15 --res=update -w keschcn-0002
ssh keschcn-0002
git clone https://github.com/MeteoSwiss-APN/comm_overlap_bench.git comm_overlap_bench.git
git checkout barebones
cd comm_overlap_bench.git/
./build_kesch_cuda80.sh
Currently Loaded Modulefiles:
craype-haswell
craype-network-infiniband
binutils/.2.25
GCC/4.9.3-binutils-2.25
cudatoolkit/.8.0.44
mvapich2gdr_gnu/.2.2_cuda_8.0
MVAPICH2/2.2-GCC-4.9.3-binutils-2.25_cuda_8.0_gdr
gmvapich2/17.01_cuda_8.0_gdr
/opt/mvapich2/gdr/2.2/cuda8.0/gnu/bin/mpicxx
-O3 -DNDEBUG -lpthread
CMakeFiles/comm_overlap_benchmark.dir/main.cpp.o
CMakeFiles/comm_overlap_benchmark.dir/MPIHelper.cpp.o
CMakeFiles/comm_overlap_benchmark.dir/Repository.cpp.o
CMakeFiles/comm_overlap_benchmark.dir/Options.cpp.o
-o comm_overlap_benchmark
-rdynamic libStandaloneStencils.a
-lmpicxx -lmpi <--------------
/global/opt/nvidia/cudatoolkit/8.0.44/lib64/libcudart_static.a
-lpthread -ldl -lrt
/opt/mvapich2/gdr/2.2/cuda8.0/gnu/bin/mpicxx
-O3 -DNDEBUG -lpthread
CMakeFiles/comm_overlap_benchmark_cc.dir/main.cpp.o
CMakeFiles/comm_overlap_benchmark_cc.dir/MPIHelper.cpp.o
CMakeFiles/comm_overlap_benchmark_cc.dir/Repository.cpp.o
CMakeFiles/comm_overlap_benchmark_cc.dir/Options.cpp.o
comm_overlap_benchmark_cc_generated_HorizontalDiffusionSA.cu.o
comm_overlap_benchmark_cc_generated_Kernel.cu.o
-o comm_overlap_benchmark_cc
-rdynamic /global/opt/nvidia/cudatoolkit/8.0.44/lib64/libcudart_static.a
-lpthread -ldl -lrt
-lmpicxx -lmpi <--------------
module rm Cube
export G2G=2
export CUDA_AUTOBOOST=1
export GCLOCK=875
/usr/bin/time -p \
srun -p debug --gres=gpu:2 \
-N1 -n2 --ntasks-per-node=2 \
--distribution=block:block --cpu_bind=q -t5 \
--reservation=update -w keschcn-0002 \
./comm_overlap_benchmark # or ./comm_overlap_benchmark_cc
StandaloneStencilsCUDA
MV2_COMM_WORLD_LOCAL_RANK: 0, 1,
CUDA_VISIBLE_DEVICES: [0: 0,1] [1: 0,1]
Configured CUDA Devices: [0: 0] [1: 1]
SLURM_JOBID: 1471552
SLURM_PROCID: 0, 1,
SLURM_CPU_BIND: [0: quiet,mask_cpu:0x000002,0x000004] [1: quiet,mask_cpu:0x000002,0x000004]
SLURMD_NODENAME: keschcn-0002, keschcn-0002,
CONFIGURATION
====================================
Domain : [128,128,60]
Sync? : 0
NoComm? : 0
NoComp? : 0
Number Halo Exchanges : 2
Number benchmark repetitions : 1000
In Order halo exchanges? : 0
Compiled for mvapich2
Dimensions: [2, 1]
Neighbors: [0: 1,0,1,0] [1: 0,1,0,1]
Timers disabled: Enable by compiling with ENABLE_TIMER
Timers disabled: Enable by compiling with ENABLE_TIMER
real 10.31
[keschcn-0002:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
[keschcn-0002:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: keschcn-0002: task 1: Segmentation fault (core dumped)
#0 0x00002aeba473395b in memcpy () from /lib64/libc.so.6
#1 0x00002aeba3acdfdc in MPIDI_CH3U_Buffer_copy ()
from /opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64/libmpi.so.12
#2 0x00002aeba3afdaaf in MPIDI_Isend_self ()
from /opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64/libmpi.so.12
#3 0x00002aeba3af6c80 in MPID_Isend ()
from /opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64/libmpi.so.12
#4 0x00002aeba3a80915 in PMPI_Isend ()
from /opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64/libmpi.so.12
export MV2_USE_CUDA=1
srun -p debug --gres=gpu:1 -N1 -n1 --ntasks-per-node=1 -t5 \
./GNU4.9.3_MVAPICH22.2_CUDAV7.0.27.keschln-0001+setdev+not -n 1000
CUDA_VISIBLE_DEVICES: [0: 0]
Configured CUDA Devices: [0: 0]
SLURM_JOBID: 1621679
SLURM_PROCID: 0,
SLURM_CPU_BIND: [0: quiet,mask_cpu:0x000001]
SLURMD_NODENAME: keschcn-0012,
MV2_USE_CUDA: 1
ELAPSED TIME (seconds) for rank=0 3.130876
Currently Loaded Modulefiles:
craype-haswell
craype-network-infiniband
binutils/.2.25
GCC/4.9.3-binutils-2.25
cudatoolkit/7.5.18
MVAPICH2/2.2-GCC-4.9.3-binutils-2.25-cuda75
gmvapich2/17.02_cuda_7.5
zlib/.1.2.8
libunwind/.1.1
binutils/.2.26-scorep
libunwind/.1.1-scorep
Qt/.4.8.6
Cube/.4.3.3
papi/5.4.3.2
vampir/9.2.0
slurm/16.05.10-2
Score-P/3.0-gmvapich2-17.02_cuda_7.5
export MV2_USE_CUDA=1
srun -p debug --gres=gpu:1 -N1 -n1 --ntasks-per-node=1 -t5
./GNU4.9.3_MVAPICH22.2_CUDAV7.5.17.keschln-0001+setdev+not -n 1000
CUDA_VISIBLE_DEVICES: [0: 0]
Configured CUDA Devices: [0: 0]
SLURM_JOBID: 1621637
SLURM_PROCID: 0,
SLURM_CPU_BIND: [0: quiet,mask_cpu:0x000001]
SLURMD_NODENAME: keschcn-0002,
MV2_USE_CUDA: 1
ELAPSED TIME (seconds) for rank=0 7.186971
export SCOREP_WRAPPER=on
export SCOREP_TOTAL_MEMORY=1G
export SCOREP_ENABLE_TRACING=true
export SCOREP_CUDA_ENABLE=1
CUDA_VISIBLE_DEVICES: [0: 0]
Configured CUDA Devices: [0: 0]
SLURM_JOBID: 1621683
SLURM_PROCID: 0,
SLURM_CPU_BIND: [0: quiet,mask_cpu:0x000001]
SLURMD_NODENAME: keschcn-0004,
MV2_USE_CUDA: 1
ELAPSED TIME (seconds) for rank=0 4.114013
CUDA_VISIBLE_DEVICES: [0: 0]
Configured CUDA Devices: [0: 0]
SLURM_JOBID: 1621686
SLURM_PROCID: 0,
SLURM_CPU_BIND: [0: quiet,mask_cpu:0x000001]
SLURMD_NODENAME: keschcn-0005,
MV2_USE_CUDA: 1
ELAPSED TIME (seconds) for rank=0 10.737817
module use /apps/escha/UES/easybuild/modulefiles
module loadScore-P/3.0-gmvapich2-17.02_cuda_7.5_gdr
Currently Loaded Modulefiles:
craype-haswell
binutils/.2.25
GCC/4.9.3-binutils-2.25
cudatoolkit/7.5.18 <--------------
mvapich2gdr_gnu/2.2_cuda_7.5 <--------------
MVAPICH2/2.2_cuda_7.5_gdr
craype-network-infiniband
gmvapich2/17.02_cuda_7.5_gdr
zlib/.1.2.8
libunwind/.1.1
binutils/.2.26-scorep
libunwind/.1.1-scorep
Qt/.4.8.6
Cube/.4.3.3
papi/5.4.3.2
vampir/9.2.0
Let me know if it fails.
The Score-P instrumenter take care of all the necessary instrumentation
of user and MPI functions
Is it be possible to compile without Boost ?
#ifndef _NOBOOST
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.