meteoswiss-apn / comm_overlap_bench Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 5.0 332 KB

Communication/Computation Overlap Benchmark

License: BSD 2-Clause "Simplified" License

CMake 7.91% Shell 4.61% C++ 67.05% Cuda 20.19% Makefile 0.24%

comm_overlap_bench's People

Contributors

Stargazers

Watchers

Forkers

pspoerri jgphpc cosunae eth-cscs

comm_overlap_bench's Issues

write_buffer%completed error

@mmxcscs is trying to use MUST (=MPI runtime checker) on cosmo (only the non-gpu build compiled
with GNU for now). MUST detects a conflict between active buffers in MPI_Irecv from mpe_io2.f90.
Unfortunately, COSMO crashes with the following error:

At line 2178 of file /scratch-shared/meteoswiss/scratch/maximem/mpi1sided/
cosmo-pompa_debug/cosmo/src/mpe_io2.f90

Fortran runtime error: Index '999999' of dimension 1 of array
'write_buffer%completed' above upper bound of 7

Both the crash and the MPI_irecv issue seem to be related. Have you ever seen
that crash before ? That could also be a glitch of the tool.

using GDR gpu communication (G2G=2) shows a large performance degradation with full node

In KESCH:
in the past we always observed good results with G2G=2 (using the GDR driver) but we could never put that in production due to the bug with CCE environment.
But the bug will presumably be fixed with the upgrade of kesch, so we would like to switch to G2G=2.

However the recent experiments of @pspoerri
https://docs.google.com/spreadsheets/d/1xmL4-qsGpeDdb7qi-85fq2sl-7dCdECztR9Y-YaOX04/edit#gid=0
shows a large degradation in performance once we cross the QPI, (i.e. jobs using full node)

ISC'17

Collecting all issues we want to discuss during ISC meeting:

G2G:
CUDA_SEPARABLE_COMPILATION: #17
--compiler scorep flag: #29
Vampir remote: #30

nvprof on kesch (pow in Kernel.cu)

Trying to explain #6 (comment)

MVAPICH2/2.2-GCC-4.9.3-binutils-2.25 + cuda/7.0

module load craype-haswell
module load craype-network-infiniband
module load MVAPICH2/2.2-GCC-4.9.3-binutils-2.25

Currently Loaded Modulefiles:
  1) craype-haswell                         3) binutils/.2.25                         5) cudatoolkit/7.0.28
  2) craype-network-infiniband              4) GCC/4.9.3-binutils-2.25                6) MVAPICH2/2.2-GCC-4.9.3-binutils-2.25

export MV2_USE_CUDA=1
srun -p debug --gres=gpu:1 -N1 -n1 --ntasks-per-node=1 -t15 \
nvprof --analysis-metrics -o my.nv \
./GNU4.9.3_MVAPICH22.2_CUDAV7.0.27.keschln-0002+setdev

MVAPICH2/2.2-GCC-4.9.3-binutils-2.25 + cuda/7.5

module load craype-haswell
module load craype-network-infiniband
module load MVAPICH2/2.2-GCC-4.9.3-binutils-2.25-cuda75

Currently Loaded Modulefiles:
  1) craype-haswell                                 3) binutils/.2.25                                 5) cudatoolkit/7.5.18
  2) craype-network-infiniband                      4) GCC/4.9.3-binutils-2.25                        6) /MVAPICH2/2.2-GCC-4.9.3-binutils-2.25-cuda75

export MV2_USE_CUDA=1
srun -p debug --gres=gpu:1 -N1 -n1 --ntasks-per-node=1 -t15 \
nvprof --analysis-metrics -o my.nv \
./GNU4.9.3_MVAPICH22.2_CUDAV7.5.17.keschln-0001+setdev

kesch: segfault in MPI_Finalize

Hi,

This is my first attempt to run StandaloneStencilsCUDA.

export G2G=2
srun -p debug --gres=gpu:2 -n2 -t10 ./StandaloneStencilsCUDA

Unfortunately, i get a segfault at the very end of the job (in MPI_Finalize):

Domain : [128,128,60]
Sync? : 0
NoComm? : 0
NoComp? : 0
Number Halo Exchanges : 2
Number benchmark repetitions : 1000
In Order halo exchanges? : 0
Device ID :0
SLURM_PROCID :0
Compiled for mvapich2
ELAPSED TIME : ...
[keschcn-0001:mpi_rank_0][error_sighandler] 
Caught error: Segmentation fault (signal 11)

mvapich2gdr_gnu/2.1_cuda_7.0

ELAPSED TIME : 3.1743 +- + 0.00574198

@pspoerri do you get the same segfault ? If not, could you point to 2 successful jobs (1 with and 1 without aysnchronous mode) ?

mvapich2gdr_gnu/2.2_cuda_7.5

I don't segfault with cuda/75:

ELAPSED TIME : 8.4395 +- + 0.00154042

full cosmo on kesch

scorep/2.0 (Carlos)

/project/csstaff/piccinal/cosmo/cosmo-pompa.scorep_nonblockcomm

scorep/3.0 (Maxime)

/project/csstaff/piccinal/cosmo/cosmo-e_1m_2h/

Compilation on Daint

Setup

module load daint-gpu
module load Score-P/3.0-CrayGNU-2016.11-cuda-8.0.54

Currently Loaded Modulefiles:
modules/3.2.10.5
eswrap/2.0.11-2.2
cray-mpich/7.5.0
slurm/16.05.8-1
ddt/7.0
xalt/daint-2016.11
daint-gpu
gcc/5.3.0
craype-haswell
craype-network-aries
craype/2.5.8
cray-libsci/16.11.1
udreg/2.3.2-4.14
ugni/6.0.13-2.8
pmi/5.0.10-1.0000.11050.0.0.ari
dmapp/7.1.0-16.18
gni-headers/5.0.7-4.11
xpmem/2.0.3_geb8008a-2.11
job/2.0.2_g98a4850-2.43
dvs/2.5_2.0.70_g1ddb68c-2.144
alps/6.2.5-20.1
rca/2.0.10_g66b76b7-2.51
atp/2.0.4
PrgEnv-gnu/6.0.3
CrayGNU/2016.11
libiberty/.5.3.0-scorep
libunwind/.1.1-scorep
Cube/.4.3.4
OPARI2/.2.0.1
SIONlib/.1.7.1
PDT/.3.23
papi/5.5.0.2
vampir/9.2.0
cray-libsci_acc/16.11.1
cudatoolkit/8.0.54_2.2.8_ga620558-2.1
craype-accel-nvidia60
Score-P/3.0-CrayGNU-2016.11-cuda-8.0.54

Compile

BB=/apps/daint/UES/jenkins/6.0.UP02/gpu/easybuild/software/Boost/1.63.0-CrayGNU-2016.11-Python-2.7.12

make PREP= CXX=CC \
ARCH=-arch=sm_60 \
MVAPICH2GDR_VERSION=$CRAY_MPICH_DIR \
BOOST_VERSION=$BB
LIBMPI=

Run (no tool)

export LD_LIBRARY_PATH=$BB/lib:LD_LIBRARY_PATH
export CUDA_AUTOBOOST=1
export GCLOCK=875
export MPICH_RDMA_ENABLED_CUDA=1

srun -C gpu -p debug -N2 -n2 --ntasks-per-node=1 -t5 ./GNU5.3.0-CUDAV8.0.53-Stencils

  ELAPSED TIME : 6.78014 +- + 0.000217501
  *** Error in GNU5.3.0-CUDAV8.0.53-Stencils': 
free(): invalid next size (fast): 0x00  00000001605220 ***

Do you get the same ?

Score-P/3.0 on daint

Daint

module load daint-gpu
module load Score-P/3.0-CrayGNU-2016.11

Built without cuda, i am building a new version now.

Score-P does not compile with CMake and CUDA_SEPARABLE_COMPILATION

CUDA_SEPARABLE_COMPILATION will create additional objects that can confuse the score-p wrapper scripts. By default, score-p assumes that only one cuda object file is created and it will link the nvidia score-p file against this target. With separable compilation, cmake will create a separate object file for each cuda file. This will confuse the linker because the cuda score-p function names will occur multiple times.

Compilation on Kesch

mvapich2gdr_gnu/2.2 + cuda/8.0.44

git clone https://github.com/MeteoSwiss-APN/comm_overlap_bench.git
git checkout barebones
salloc -p normal -t90 --res=update -w keschcn-0002
ssh keschcn-0002
build_kesch_cuda80.sh

mvapich2gdr_gnu/2.2 + cuda/7.5.18

Setup

module purge
module load craype-haswell craype-network-infiniband
module load Score-P/3.0-gmvapich2-17.02_cuda_7.5_gdr

Currently Loaded Modulefiles:
craype-haswell
GCC/4.9.3-binutils-2.25
cudatoolkit/7.5.18
mvapich2gdr_gnu/2.2_cuda_7.5
craype-network-infiniband
...

Compile (no tool)

cd /apps/common/UES/sandbox/jgp/comm_overlap_bench.git/GNU493CUDA75gdr/
make PREP= -f Makefilejg

mpicxx -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -DENABLE_CUDA_STREAMS -DMVAPICH2 -g -w -fexceptions -fstack-protector -std=gnu++11 -c main.cpp # -o main.o

mpicxx -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -DENABLE_CUDA_STREAMS -DMVAPICH2 -g -w -fexceptions -fstack-protector -std=gnu++11 -c Repository.cpp # -o Repository.o

mpicxx -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -DENABLE_CUDA_STREAMS -DMVAPICH2 -g -w -fexceptions -fstack-protector -std=gnu++11 -c Options.cpp # -o Options.o

nvcc -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -ccbin g++ -m64 -DMVAPICH2 -DENABLE_CUDA_STREAMS -DNVCC -Xcompiler ,\"-g\" -arch=sm_37 -std=c++11 -c Kernel.cu # -o Kernel.o

nvcc -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -I/global/opt/nvidia/cudatoolkit/7.5.18/include -I/opt/mvapich2/gdr/2.2/cuda7.5/gnu/include -I/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/include -I. -ccbin g++ -m64 -DMVAPICH2 -DENABLE_CUDA_STREAMS -DNVCC -Xcompiler ,\"-g\" -arch=sm_37 -std=c++11 -c HorizontalDiffusionSA.cu # -o HorizontalDiffusionSA.o

mpicxx -g -std=gnu++11 -Wl,--enable-new-dtags -L/opt/mvapich2/gdr/2.2/cuda7.5/gnu/lib64 -L/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/lib -L/global/opt/nvidia/cudatoolkit/7.5.18/lib64 

-lmpi -lmpicxx -lcudart -lcuda

/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/lib/libboost_system.so /apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/lib/libboost_timer.so /apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/lib/libboost_chrono.so main.o Repository.o Options.o Kernel.o HorizontalDiffusionSA.o -o GNU4.9.3-CUDAV7.5.17-Stencils

Run

export G2G=2
export CUDA_AUTOBOOST=1
export GCLOCK=875
export LD_LIBRARY_PATH=/apps/escha/UES/RH6.7/easybuild/software/Boost/1.49.0-gmvolf-15.11-Python-2.7.10/lib:LD_LIBRARY_PATH
module load cudatoolkit/7.5.18 GCC/4.9.3-binutils-2.25 # ...

srun -p debug --gres=gpu:2 -N1 -n2 --ntasks-per-node=2 -t10 ./GNU4.9.3-CUDAV7.5.17-Stencils

CONFIGURATION 
====================================
Domain : [128,128,60]
Sync? : 0
NoComm? : 0
NoComp? : 0
Number Halo Exchanges : 2
Number benchmark repetitions : 1000
In Order halo exchanges? : 0
Device ID :1
SLURM_PROCID :1
Compiled for mvapich2
ELAPSED TIME : 7.82283 +- + 0.00356291

MV2_USE_GPUDIRECT=1
CUDA_AUTOBOOST=1
GCLOCK=875
G2G=2
Linux keschcn-0008 2.6.32-573.18.1.el6.x86_64
MV2_CUDA_IPC=1
MV2_ENABLE_AFFINITY=0
MV2_GPUDIRECT_GDRCOPY_LIB=/apps/escha/gdrcopy/20170131/libgdrapi.so
MV2_USE_CUDA=1

Kesch: mvapich2gdr_gnu/2.2

mvapich2gdr_gnu/2.2_cuda_8.0

Get the code

ssh kesch
# salloc -p normal -t15 --res=update -w keschcn-0002
ssh keschcn-0002
git clone https://github.com/MeteoSwiss-APN/comm_overlap_bench.git comm_overlap_bench.git
git checkout barebones
cd comm_overlap_bench.git/

Compile

./build_kesch_cuda80.sh

Currently Loaded Modulefiles:
craype-haswell
craype-network-infiniband
binutils/.2.25
GCC/4.9.3-binutils-2.25
cudatoolkit/.8.0.44
mvapich2gdr_gnu/.2.2_cuda_8.0
MVAPICH2/2.2-GCC-4.9.3-binutils-2.25_cuda_8.0_gdr
gmvapich2/17.01_cuda_8.0_gdr

Linking variants

Linking CXX executable comm_overlap_benchmark 🆗

/opt/mvapich2/gdr/2.2/cuda8.0/gnu/bin/mpicxx  
-O3 -DNDEBUG  -lpthread 
CMakeFiles/comm_overlap_benchmark.dir/main.cpp.o 
CMakeFiles/comm_overlap_benchmark.dir/MPIHelper.cpp.o 
CMakeFiles/comm_overlap_benchmark.dir/Repository.cpp.o 
CMakeFiles/comm_overlap_benchmark.dir/Options.cpp.o 
-o comm_overlap_benchmark 
-rdynamic libStandaloneStencils.a 
-lmpicxx -lmpi <--------------
/global/opt/nvidia/cudatoolkit/8.0.44/lib64/libcudart_static.a 
-lpthread -ldl -lrt

Linking CXX executable comm_overlap_benchmark_cc 👎

/opt/mvapich2/gdr/2.2/cuda8.0/gnu/bin/mpicxx 
-O3 -DNDEBUG  -lpthread 
CMakeFiles/comm_overlap_benchmark_cc.dir/main.cpp.o 
CMakeFiles/comm_overlap_benchmark_cc.dir/MPIHelper.cpp.o 
CMakeFiles/comm_overlap_benchmark_cc.dir/Repository.cpp.o 
CMakeFiles/comm_overlap_benchmark_cc.dir/Options.cpp.o 

comm_overlap_benchmark_cc_generated_HorizontalDiffusionSA.cu.o 
comm_overlap_benchmark_cc_generated_Kernel.cu.o  

-o comm_overlap_benchmark_cc 
-rdynamic /global/opt/nvidia/cudatoolkit/8.0.44/lib64/libcudart_static.a 
-lpthread -ldl -lrt 
-lmpicxx -lmpi   <--------------

Run

cd build_cuda80/src/

module rm Cube
export G2G=2
export CUDA_AUTOBOOST=1
export GCLOCK=875

/usr/bin/time -p \
srun -p debug --gres=gpu:2 \
-N1 -n2 --ntasks-per-node=2 \
--distribution=block:block --cpu_bind=q -t5 \
--reservation=update -w keschcn-0002 \
./comm_overlap_benchmark   # or ./comm_overlap_benchmark_cc

comm_overlap_benchmark 🆗

StandaloneStencilsCUDA
MV2_COMM_WORLD_LOCAL_RANK: 0, 1, 
CUDA_VISIBLE_DEVICES: [0: 0,1] [1: 0,1] 
Configured CUDA Devices: [0: 0] [1: 1] 
SLURM_JOBID: 1471552
SLURM_PROCID: 0, 1, 
SLURM_CPU_BIND: [0: quiet,mask_cpu:0x000002,0x000004] [1: quiet,mask_cpu:0x000002,0x000004] 
SLURMD_NODENAME: keschcn-0002, keschcn-0002, 

CONFIGURATION 
====================================
Domain : [128,128,60]
Sync? : 0
NoComm? : 0
NoComp? : 0
Number Halo Exchanges : 2
Number benchmark repetitions : 1000
In Order halo exchanges? : 0
Compiled for mvapich2
Dimensions: [2, 1]
Neighbors: [0: 1,0,1,0] [1: 0,1,0,1] 
Timers disabled: Enable by compiling with ENABLE_TIMER
Timers disabled: Enable by compiling with ENABLE_TIMER
real 10.31

comm_overlap_benchmark_cc 👎

[keschcn-0002:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
[keschcn-0002:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)
srun: error: keschcn-0002: task 1: Segmentation fault (core dumped)

gdb -c core.54261 -e comm_overlap_benchmark_cc

#0  0x00002aeba473395b in memcpy () from /lib64/libc.so.6
#1  0x00002aeba3acdfdc in MPIDI_CH3U_Buffer_copy () 
from /opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64/libmpi.so.12
#2  0x00002aeba3afdaaf in MPIDI_Isend_self () 
from /opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64/libmpi.so.12
#3  0x00002aeba3af6c80 in MPID_Isend () 
from /opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64/libmpi.so.12
#4  0x00002aeba3a80915 in PMPI_Isend () 
from /opt/mvapich2/gdr/2.2/cuda8.0/gnu/lib64/libmpi.so.12

Score-P/3.0 on kesch (makefile)

cuda/7.0

module load Score-P/3.0-gmvapich2-17.02_cuda_7.0
make LIBMPI="-L$SLURM_DIR/lib64 -lmpi -lmpicxx"

export MV2_USE_CUDA=1
srun -p debug --gres=gpu:1 -N1 -n1 --ntasks-per-node=1 -t5 \
./GNU4.9.3_MVAPICH22.2_CUDAV7.0.27.keschln-0001+setdev+not -n 1000

CUDA_VISIBLE_DEVICES: [0: 0]
Configured CUDA Devices: [0: 0]
SLURM_JOBID: 1621679
SLURM_PROCID: 0,
SLURM_CPU_BIND: [0: quiet,mask_cpu:0x000001]
SLURMD_NODENAME: keschcn-0012,
MV2_USE_CUDA: 1
ELAPSED TIME (seconds) for rank=0 3.130876

cuda/7.5

module load Score-P/3.0-gmvapich2-17.02_cuda_7.5

Currently Loaded Modulefiles:
craype-haswell
craype-network-infiniband
binutils/.2.25
GCC/4.9.3-binutils-2.25
cudatoolkit/7.5.18
MVAPICH2/2.2-GCC-4.9.3-binutils-2.25-cuda75
gmvapich2/17.02_cuda_7.5
zlib/.1.2.8
libunwind/.1.1
binutils/.2.26-scorep
libunwind/.1.1-scorep
Qt/.4.8.6
Cube/.4.3.3
papi/5.4.3.2
vampir/9.2.0
slurm/16.05.10-2
Score-P/3.0-gmvapich2-17.02_cuda_7.5

make LIBMPI="-L$SLURM_DIR/lib64 -lmpi -lmpicxx"

export MV2_USE_CUDA=1
srun -p debug --gres=gpu:1 -N1 -n1 --ntasks-per-node=1 -t5
./GNU4.9.3_MVAPICH22.2_CUDAV7.5.17.keschln-0001+setdev+not -n 1000

CUDA_VISIBLE_DEVICES: [0: 0]
Configured CUDA Devices: [0: 0]
SLURM_JOBID: 1621637
SLURM_PROCID: 0,
SLURM_CPU_BIND: [0: quiet,mask_cpu:0x000001]
SLURMD_NODENAME: keschcn-0002,
MV2_USE_CUDA: 1
ELAPSED TIME (seconds) for rank=0 7.186971

Score-P

export SCOREP_WRAPPER=on

recompile

export SCOREP_TOTAL_MEMORY=1G
export SCOREP_ENABLE_TRACING=true
export SCOREP_CUDA_ENABLE=1

cuda/70

./GNU4.9.3_MVAPICH22.2_CUDAV7.0.27.keschln-0001+setdev+sc30 -n 1000

CUDA_VISIBLE_DEVICES: [0: 0]
Configured CUDA Devices: [0: 0]
SLURM_JOBID: 1621683
SLURM_PROCID: 0,
SLURM_CPU_BIND: [0: quiet,mask_cpu:0x000001]
SLURMD_NODENAME: keschcn-0004,
MV2_USE_CUDA: 1
ELAPSED TIME (seconds) for rank=0 4.114013

cuda/75

./GNU4.9.3_MVAPICH22.2_CUDAV7.5.17.keschln-0001+setdev+sc30 -n 1000

CUDA_VISIBLE_DEVICES: [0: 0]
Configured CUDA Devices: [0: 0]
SLURM_JOBID: 1621686
SLURM_PROCID: 0,
SLURM_CPU_BIND: [0: quiet,mask_cpu:0x000001]
SLURMD_NODENAME: keschcn-0005,
MV2_USE_CUDA: 1
ELAPSED TIME (seconds) for rank=0 10.737817

vampir remote

Score-P/3.0 on kesch

Kesch:

module use /apps/escha/UES/easybuild/modulefiles
module load Score-P/3.0-gmvapich2-17.02_cuda_7.5_gdr

Currently Loaded Modulefiles:
craype-haswell
binutils/.2.25
GCC/4.9.3-binutils-2.25
cudatoolkit/7.5.18 <--------------
mvapich2gdr_gnu/2.2_cuda_7.5 <--------------
MVAPICH2/2.2_cuda_7.5_gdr
craype-network-infiniband
gmvapich2/17.02_cuda_7.5_gdr
zlib/.1.2.8
libunwind/.1.1
binutils/.2.26-scorep
libunwind/.1.1-scorep
Qt/.4.8.6
Cube/.4.3.3
papi/5.4.3.2
vampir/9.2.0

$EBROOTSCOREMINP/share/scorep/scorep.summary

Let me know if it fails.

--compiler scorep flag

Instrumentation methods

automatic compiler-based instrumentation (default)

The Score-P instrumenter take care of all the necessary instrumentation 
of user and MPI functions

Boost dependency

Is it be possible to compile without Boost ?

#ifndef _NOBOOST

meteoswiss-apn / comm_overlap_bench Goto Github PK

comm_overlap_bench's People

Contributors

Stargazers

Watchers

Forkers

comm_overlap_bench's Issues

MVAPICH2/2.2-GCC-4.9.3-binutils-2.25 + cuda/7.0

MVAPICH2/2.2-GCC-4.9.3-binutils-2.25 + cuda/7.5

mvapich2gdr_gnu/2.1_cuda_7.0

mvapich2gdr_gnu/2.2_cuda_7.5

scorep/2.0 (Carlos)

scorep/3.0 (Maxime)

Setup

Compile

Run (no tool)

Daint

mvapich2gdr_gnu/2.2 + cuda/8.0.44

mvapich2gdr_gnu/2.2 + cuda/7.5.18

Setup

Compile (no tool)

Run

mvapich2gdr_gnu/2.2_cuda_8.0

Get the code

Compile

Linking variants

Run

cuda/7.0

cuda/7.5

Score-P

cuda/70

cuda/75

Kesch:

Instrumentation methods

automatic compiler-based instrumentation (default)

Recommend Projects

Recommend Topics

Recommend Org

Jobs