GithubHelp home page GithubHelp logo

schnorr / akypuera Goto Github PK

View Code? Open in Web Editor NEW
10.0 10.0 5.0 664 KB

a library to trace mpi applications and generate paje trace files

License: GNU General Public License v3.0

C 59.58% CMake 12.96% Shell 1.48% Perl 24.46% Julia 1.52%

akypuera's People

Contributors

afarah1 avatar ezibenroc avatar kassick avatar mquinson avatar schnorr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

akypuera's Issues

Compilation of Akypuera for SMPI

I tried to use Akypuera with Simgrid version 3.20 and encountered several errors.

I used this command:

mkdir build && cd build && cmake -DSMPI=ON -DTHREADED=ON ..

At first, I got this error, despite a "standard" Simgrid installation:

-- SMPI installation was not found. Please provide SMPI_PATH:
--   - through the GUI when working with ccmake,
--   - as a command line argument when working with cmake e.g.
--     cmake .. -DSMPI_PATH:PATH=/usr/local/smpi

After applying this patch, the above command seemed to work correctly.

diff --git a/cmake/FindSMPI.cmake b/cmake/FindSMPI.cmake
index 4d590eb..0588a0f 100644
--- a/cmake/FindSMPI.cmake
+++ b/cmake/FindSMPI.cmake
@@ -13,7 +13,7 @@
 # Just set SMPI_PATH it to your specific installation directory
 #
 FIND_LIBRARY(SMPI_LIBRARY
-  NAMES smpi
+  NAMES simgrid
   PATHS /usr/lib /usr/local/lib ${SMPI_PATH}/lib)

 IF(SMPI_LIBRARY)

Then, when I try to compile Akypuera I get a bunch of errors. Here is a subset of them:

/tmp/akypuera/src/aky.c: In function ‘MPI_Type_indexed’:
/tmp/akypuera/src/aky.c:1369:12: error: argument ‘blocklens’ doesn’t match prototype
 const int *blocklens;
            ^~~~~~~~~
In file included from /usr/local/include/smpi/mpi.h:12:0,
                 from /tmp/akypuera/src/aky_private.h:21,
                 from /tmp/akypuera/src/aky.c:17:
/usr/local/include/smpi/smpi.h:431:1: error: prototype declaration
 MPI_CALL(XBT_PUBLIC int, MPI_Type_indexed,
 ^
/tmp/akypuera/src/aky.c:1370:12: error: argument ‘indices’ doesn’t match prototype
 const int *indices;
            ^~~~~~~
In file included from /usr/local/include/smpi/mpi.h:12:0,
                 from /tmp/akypuera/src/aky_private.h:21,
                 from /tmp/akypuera/src/aky.c:17:
/usr/local/include/smpi/smpi.h:431:1: error: prototype declaration
 MPI_CALL(XBT_PUBLIC int, MPI_Type_indexed,
 ^
/tmp/akypuera/src/aky.c:1375:44: warning: passing argument 2 of ‘PMPI_Type_indexed’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
   int returnVal = PMPI_Type_indexed(count, blocklens, indices, old_type,
                                            ^~~~~~~~~
In file included from /usr/local/include/smpi/mpi.h:12:0,
                 from /tmp/akypuera/src/aky_private.h:21,
                 from /tmp/akypuera/src/aky.c:17:
/usr/local/include/smpi/smpi.h:431:1: note: expected ‘int *’ but argument is of type ‘const int *’
 MPI_CALL(XBT_PUBLIC int, MPI_Type_indexed,
 ^
/tmp/akypuera/src/aky.c:1375:55: warning: passing argument 3 of ‘PMPI_Type_indexed’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
   int returnVal = PMPI_Type_indexed(count, blocklens, indices, old_type,
                                                       ^~~~~~~
In file included from /usr/local/include/smpi/mpi.h:12:0,
                 from /tmp/akypuera/src/aky_private.h:21,
                 from /tmp/akypuera/src/aky.c:17:
/usr/local/include/smpi/smpi.h:431:1: note: expected ‘int *’ but argument is of type ‘const int *’
 MPI_CALL(XBT_PUBLIC int, MPI_Type_indexed,

I tried to downgrade Simgrid to the versions 3.19 and 3.18, with no success. Did I miss something?

My system:

  • kernel 4.9.0
  • gcc 6.3.0
  • cmake 3.7.2

Build fails

You merged schnorr/master into afarah1/github-collective and pushed into my fork instead of yours, then merged back into schnorr/master, which seems to have messed things up and is causing the build to fail (see https://github.com/afarah1/akypuera/commits/github-collective).

Roll back quick and cleanly (overwrite recent history, revert #26):

git reset 3b5d8ed2a039c2073e611de1a3b756108ae7e2aa
# git push -f origin master

Better yet, roll back and merge #26 properly:

git reset --hard 3b5d8ed2a039c2073e611de1a3b756108ae7e2aa 2>&1
git remote add afarah1 [email protected]:afarah1/akypuera 2>&1
git fetch afarah1 2>&1
git merge afarah1/github-collective-fix 2>&1
mkdir b
cd b
cmake .. -DCMAKE_INSTALL_PREFIX=`pwd` 2>&1
make -j4 install 2>&1
ctest .. 2>&1
git --no-pager log --oneline -n10 2>&1
echo
# git push -f origin master
HEAD is now at 3b5d8ed Merge pull request #25 from afarah1/gather-github
From github.com:afarah1/akypuera
 * [new branch]      bytecount             -> afarah1/bytecount
 * [new branch]      compensation          -> afarah1/compensation
 * [new branch]      compensation-mpiwait  -> afarah1/compensation-mpiwait
 * [new branch]      fortranv1             -> afarah1/fortranv1
 * [new branch]      gather-github         -> afarah1/gather-github
 * [new branch]      github-collective     -> afarah1/github-collective
 * [new branch]      github-collective-fix -> afarah1/github-collective-fix
 * [new branch]      latest                -> afarah1/latest
 * [new branch]      libpoti-submodule     -> afarah1/libpoti-submodule
 * [new branch]      libpoti-submodule-schnorr -> afarah1/libpoti-submodule-schnorr
 * [new branch]      master                -> afarah1/master
 * [new branch]      otf2links             -> afarah1/otf2links
 * [new branch]      scatter-githubfix     -> afarah1/scatter-githubfix
 * [new branch]      tests                 -> afarah1/tests
 * [new branch]      tests-githubfix       -> afarah1/tests-githubfix
Updating 3b5d8ed..1755114
Fast-forward
 examples/collective/src/coll.c            |  17 +++
 examples/collective/traces/rastro-0-0.rst | Bin 0 -> 328 bytes
 examples/collective/traces/rastro-1-0.rst | Bin 0 -> 328 bytes
 examples/collective/traces/rastro-2-0.rst | Bin 0 -> 328 bytes
 examples/collective/traces/rastro-3-0.rst | Bin 0 -> 328 bytes
 include/aky.h                             |   4 +
 src/aky.c                                 |  26 +++-
 src/aky/aky_converter.c                   | 205 ++++++++++++++++++------------
 src/aky_private.h                         |   2 +
 tests/collective.tesh                     |  87 +++++++++++++
 10 files changed, 255 insertions(+), 86 deletions(-)
 create mode 100644 examples/collective/src/coll.c
 create mode 100644 examples/collective/traces/rastro-0-0.rst
 create mode 100644 examples/collective/traces/rastro-1-0.rst
 create mode 100644 examples/collective/traces/rastro-2-0.rst
 create mode 100644 examples/collective/traces/rastro-3-0.rst
 create mode 100644 tests/collective.tesh
-- The C compiler identification is GNU 4.9.4
-- The CXX compiler identification is GNU 4.9.4
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found MPI_C: /home/afh/install/openmpi-2.0.1/b/lib/libmpi.so  
-- Found MPI_CXX: /home/afh/install/openmpi-2.0.1/b/lib/libmpi.so  
-- Looking for gettimeofday
-- Looking for gettimeofday - found
-- Looking for clock_gettime in rt
-- Looking for clock_gettime in rt - found
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/tmp/akypuera/b
Scanning dependencies of target poti
Scanning dependencies of target rastro
[  2%] Building C object libpoti/CMakeFiles/poti.dir/src/poti.c.o
[  4%] Building C object libpoti/CMakeFiles/poti.dir/src/poti_events.c.o
[  6%] Building C object libpoti/CMakeFiles/poti.dir/src/poti_header.c.o
[  8%] Building C object librastro/CMakeFiles/rastro.dir/src/rst_read.c.o
[ 10%] Building C object librastro/CMakeFiles/rastro.dir/src/rst_write.c.o
[ 12%] Building C object librastro/CMakeFiles/rastro.dir/src/rst_generate.c.o
[ 14%] Linking C shared library libpoti.so
[ 14%] Built target poti
Scanning dependencies of target main_example
Scanning dependencies of target vite-svn
Scanning dependencies of target vite-1.2
[ 16%] Building C object libpoti/examples/CMakeFiles/main_example.dir/main_example.c.o
[ 18%] Building C object libpoti/examples/CMakeFiles/vite-svn.dir/vite-svn.c.o
[ 20%] Building C object libpoti/examples/CMakeFiles/vite-1.2.dir/vite-1.2.c.o
[ 22%] Linking C shared library librastro.so
[ 25%] Linking C executable vite-svn
[ 27%] Linking C executable main_example
[ 29%] Linking C executable vite-1.2
[ 29%] Built target rastro
Scanning dependencies of target aky_converter
[ 29%] Built target vite-svn
[ 29%] Built target main_example
[ 29%] Built target vite-1.2
[ 31%] Building C object CMakeFiles/aky_converter.dir/src/aky/aky_converter.c.o
Scanning dependencies of target rastro_generate
Scanning dependencies of target aky
Scanning dependencies of target rastro_timesync
[ 33%] Building C object librastro/CMakeFiles/rastro_generate.dir/src/generate/rst_generate.c.o
[ 35%] Building C object CMakeFiles/aky.dir/src/aky.c.o
[ 37%] Building C object librastro/CMakeFiles/rastro_timesync.dir/src/timesync/rst_timesync.c.o
[ 39%] Linking C executable rastro_generate
[ 41%] Linking C executable rastro_timesync
[ 43%] Building C object CMakeFiles/aky_converter.dir/src/aky/aky_arguments.c.o
[ 43%] Built target rastro_generate
Scanning dependencies of target rastro_read
[ 45%] Building C object librastro/CMakeFiles/rastro_read.dir/src/read/rst_read.c.o
[ 45%] Built target rastro_timesync
Scanning dependencies of target write2
[ 47%] Building C object librastro/examples/CMakeFiles/write2.dir/write2.c.o
[ 50%] Building C object CMakeFiles/aky_converter.dir/src/aky_utils.c.o
[ 52%] Linking C executable rastro_read
[ 54%] Building C object CMakeFiles/aky.dir/src/aky_aux.c.o
[ 56%] Building C object librastro/examples/CMakeFiles/write2.dir/auto-generated.c.o
[ 58%] Building C object CMakeFiles/aky_converter.dir/src/aky_names.c.o
[ 58%] Built target rastro_read
Scanning dependencies of target write1
[ 60%] Building C object librastro/examples/CMakeFiles/write1.dir/write1.c.o
[ 62%] Linking C executable write2
[ 64%] Building C object CMakeFiles/aky.dir/src/aky_rastro.c.o
[ 64%] Built target write2
[ 66%] Building C object librastro/examples/CMakeFiles/write1.dir/auto-generated.c.o
[ 68%] Building C object CMakeFiles/aky_converter.dir/src/aky_keys.c.o
[ 70%] Linking C shared library libaky.so
[ 72%] Linking C executable write1
[ 72%] Built target aky
Scanning dependencies of target ring
Scanning dependencies of target prog
[ 75%] Linking C executable aky_converter
[ 77%] Building C object examples/CMakeFiles/ring.dir/ring/src/ring.c.o
[ 79%] Building C object examples/CMakeFiles/prog.dir/prog/src/prog.c.o
[ 79%] Built target write1
Scanning dependencies of target smpi_traced
[ 81%] Building C object examples/CMakeFiles/smpi_traced.dir/smpi_traced/src/smpi_traced.c.o
[ 81%] Built target aky_converter
[ 83%] Linking C executable ring
Scanning dependencies of target scatter
[ 85%] Building C object examples/CMakeFiles/scatter.dir/scatter/src/scatter.c.o
[ 87%] Linking C executable prog
[ 89%] Linking C executable smpi_traced
[ 89%] Built target ring
Scanning dependencies of target small
[ 91%] Building C object examples/CMakeFiles/small.dir/small/src/small.c.o
[ 93%] Linking C executable scatter
[ 93%] Built target prog
Scanning dependencies of target smpi_bug
[ 95%] Building C object examples/CMakeFiles/smpi_bug.dir/smpi_bug/src/smpi_bug.c.o
[ 95%] Built target smpi_traced
[ 97%] Linking C executable small
[ 97%] Built target scatter
[100%] Linking C executable smpi_bug
[100%] Built target small
[100%] Built target smpi_bug
Install the project...
-- Install configuration: ""
-- Installing: /tmp/tmp/akypuera/b/lib/libaky.so
-- Set runtime path of "/tmp/tmp/akypuera/b/lib/libaky.so" to ""
-- Installing: /tmp/tmp/akypuera/b/bin/aky_converter
-- Installing: /tmp/tmp/akypuera/b/lib/librastro.so
-- Installing: /tmp/tmp/akypuera/b/include/rastro.h
-- Installing: /tmp/tmp/akypuera/b/bin/rastro_generate
-- Installing: /tmp/tmp/akypuera/b/bin/rastro_timesync
-- Installing: /tmp/tmp/akypuera/b/bin/rastro_read
-- Installing: /tmp/tmp/akypuera/b/lib/libpoti.so.4.2
-- Installing: /tmp/tmp/akypuera/b/lib/libpoti.so.4
-- Installing: /tmp/tmp/akypuera/b/lib/libpoti.so
-- Installing: /tmp/tmp/akypuera/b/include/poti.h
-- Up-to-date: /tmp/tmp/akypuera/b/lib/libpoti.so.4.2
-- Up-to-date: /tmp/tmp/akypuera/b/lib/libpoti.so.4
-- Up-to-date: /tmp/tmp/akypuera/b/lib/libpoti.so
-- Up-to-date: /tmp/tmp/akypuera/b/include/poti.h
-- Installing: /tmp/tmp/akypuera/b/lib/pkgconfig/poti.pc
Test project /tmp/tmp/akypuera/b
    Start 1: collective
1/7 Test #1: collective .......................   Passed    0.06 sec
    Start 2: prog
2/7 Test #2: prog .............................   Passed    0.06 sec
    Start 3: ring
3/7 Test #3: ring .............................   Passed    0.06 sec
    Start 4: scatter
4/7 Test #4: scatter ..........................   Passed    0.06 sec
    Start 5: small
5/7 Test #5: small ............................   Passed    0.06 sec
    Start 6: smpi_bug
6/7 Test #6: smpi_bug .........................   Passed    0.06 sec
    Start 7: smpi_traced
7/7 Test #7: smpi_traced ......................   Passed    0.07 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) =   0.45 sec
1755114 Implemented tests for collective communications
6c92f5d Implemented MPI_Reduce and MPI_Bcast
3b5d8ed Merge pull request #25 from afarah1/gather-github
ea6792d Merge pull request #23 from afarah1/scatter-githubfix
2f8f262 update to MPI-2.0 (compiles with openmpi 2.0.1)
21fa439 move code
ebf9504 improve syntax
d8ed499 fix file descriptor
f53755b call otf2-print here also
3a8fca1 support the METRIC line when converting to Paje

Filtering some MPI operations

There are two options to implement filtering of some MPI operations:

  1. When tracing, do not register the event at all
  2. During the conversion to Paje, do not convert the MPI operations we are not interested to

For option 1, we can implement using bitmask and groups of MPI calls by category (P2P, collective, etc). For option 2, a bitmask-based strategy may also be sufficient.

Build fails with ScoreP < 2.0.1

8f28280 breaks the compilation with older versions of ScoreP (such as the one listed in the Wiki) since the prototypes changed in newer versions (to which that commit patches Akypuera for). We should check ScoreP (OTF2 actually) version as well (see #13). Alternatively we can revert that commit and make the patch available as a .patch file.

Aky stopped working with Fortran applications with OpenMPI 2.0.1

Aky used to work with C and Fortran applications with OpenMPI 1.6.5, but with OpenMPI 2.0.1 it only seems to work with C applications. I added breakpoints to MPI calls with gdb. It seems that they are not
intercepted by Aky when using 2.0.1, as they should be:

Breakpoint 1, 0x00007ffff794bae0 in PMPI_Init () from /home/afh/install/openmpi-2.0.1/b/lib/libmpi.so.20.0.1
(gdb) where
#0  0x00007ffff794bae0 in PMPI_Init () from /home/afh/install/openmpi-2.0.1/b/lib/libmpi.so.20.0.1
#1  0x00007ffff729d638 in pmpi_init__ () from /home/afh/install/openmpi-2.0.1/b/lib/libmpi_mpifh.so.20
#2  0x0000000000401197 in MAIN__ ()
#3  0x000000000040210f in main ()

For the same Fortran application, but with 1.6.5:

Breakpoint 1, 0x00007ffff7bd3c34 in MPI_Init () from /home/afh/svn/akypuera/b/lib/libaky.so
(gdb) where
#0  0x00007ffff7bd3c34 in MPI_Init () from /home/afh/svn/akypuera/b/lib/libaky.so
#1  0x00007ffff763f218 in pmpi_init__ () from /usr/lib/libmpi_f77.so.1
#2  0x00000000004010d7 in MAIN__ ()
#3  0x000000000040204f in main ()

It seems that with OpenMPI 1.6.5 libmpi_f77 is used, whereas with 2.0.1 libmpi_mpifh is used and the calls are not intercepted for some reason. Any ideas?

I'm using LD_PRELOAD but I also tried linking against Aky and the results are the same. This is the command line used to execute the Fortran application (NAS EP benchmark):

mpirun -x LD_PRELOAD=/home/afh/svn/akypuera/b/lib/libaky.so:/home/afh/install/openmpi-2.0.1/b/lib/libmpi.so.20.0.1 -np 4 ./ep.S.4

The same command line works for the C application (NAS IS benchmark):

mpirun -x LD_PRELOAD=/home/afh/svn/akypuera/b/lib/libaky.so:/home/afh/install/openmpi-2.0.1/b/lib/libmpi.so.20.0.1 -np 4 ./is.S.4

Both applications can be found here.

ldd for both application with OpenMPI 1.6.5 (both work) and OpenMPI 2.0.1 (only the C one works):

C application, OpenMPI 1.6.5

	linux-vdso.so.1 (0x00007ffda8ad8000)
	libmpi.so.1 => /usr/lib64/libmpi.so.1 (0x00007fb76f9f9000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fb76f7f5000)
	libhwloc.so.5 => /usr/lib64/libhwloc.so.5 (0x00007fb76f5ba000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fb76f39e000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fb76f005000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00007fb76ee02000)
	libltdl.so.7 => /usr/lib64/libltdl.so.7 (0x00007fb76ebf8000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fb76fd7a000)
	libm.so.6 => /lib64/libm.so.6 (0x00007fb76e8f3000)
	libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007fb76e6e7000)
	libudev.so.1 => /lib64/libudev.so.1 (0x00007fb76ff60000)
	libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0 (0x00007fb76e4dd000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fb76e2d5000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fb76e0be000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fb76deb8000)
	libz.so.1 => /lib64/libz.so.1 (0x00007fb76dca2000)
	libattr.so.1 => /lib64/libattr.so.1 (0x00007fb76da9d000)

Fortran application, OpenMPI 1.6.5

	linux-vdso.so.1 (0x00007ffd3a18b000)
	libmpi.so.1 => /usr/lib64/libmpi.so.1 (0x00007f240c2f5000)
	libmpi_f77.so.1 => /usr/lib64/libmpi_f77.so.1 (0x00007f240c0c1000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f240bebd000)
	libhwloc.so.5 => /usr/lib64/libhwloc.so.5 (0x00007f240bc82000)
	libgfortran.so.3 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libgfortran.so.3 (0x00007f240b95a000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f240b655000)
	libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libgcc_s.so.1 (0x00007f240b43e000)
	libquadmath.so.0 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libquadmath.so.0 (0x00007f240b200000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f240afe4000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f240ac4b000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00007f240aa48000)
	libltdl.so.7 => /usr/lib64/libltdl.so.7 (0x00007f240a83e000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f240c676000)
	libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f240a632000)
	libudev.so.1 => /lib64/libudev.so.1 (0x00007f240c85a000)
	libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0 (0x00007f240a428000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f240a220000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x00007f240a009000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007f2409e03000)
	libz.so.1 => /lib64/libz.so.1 (0x00007f2409bed000)
	libattr.so.1 => /lib64/libattr.so.1 (0x00007f24099e8000)

C, 2.0.1

	linux-vdso.so.1 (0x00007ffc7a042000)
	libmpi.so.20 => /home/afh/install/openmpi-2.0.1/b/lib/libmpi.so.20 (0x00007fa4377fa000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa4375de000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fa437245000)
	libopen-rte.so.20 => /home/afh/install/openmpi-2.0.1/b/lib/libopen-rte.so.20 (0x00007fa436fc2000)
	libopen-pal.so.20 => /home/afh/install/openmpi-2.0.1/b/lib/libopen-pal.so.20 (0x00007fa436cc9000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fa436ac5000)
	libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007fa4368b9000)
	libudev.so.1 => /lib64/libudev.so.1 (0x00007fa437cc7000)
	libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0 (0x00007fa4366af000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fa4364a7000)
	libm.so.6 => /lib64/libm.so.6 (0x00007fa4361a2000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00007fa435f9f000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fa437ae1000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fa435d88000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fa435b82000)
	libz.so.1 => /lib64/libz.so.1 (0x00007fa43596c000)
	libattr.so.1 => /lib64/libattr.so.1 (0x00007fa435767000)

Fortran, 2.0.1

	linux-vdso.so.1 (0x00007ffe4558c000)
	libmpi.so.20 => /home/afh/install/openmpi-2.0.1/b/lib/libmpi.so.20 (0x00007fbe2fdde000)
	libmpi_usempif08.so.20 => /home/afh/install/openmpi-2.0.1/b/lib/libmpi_usempif08.so.20 (0x00007fbe2fbaf000)
	libmpi_usempi_ignore_tkr.so.20 => /home/afh/install/openmpi-2.0.1/b/lib/libmpi_usempi_ignore_tkr.so.20 (0x00007fbe2f9a8000)
	libmpi_mpifh.so.20 => /home/afh/install/openmpi-2.0.1/b/lib/libmpi_mpifh.so.20 (0x00007fbe2f750000)
	libgfortran.so.3 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libgfortran.so.3 (0x00007fbe2f428000)
	libm.so.6 => /lib64/libm.so.6 (0x00007fbe2f123000)
	libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libgcc_s.so.1 (0x00007fbe2ef0c000)
	libquadmath.so.0 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.4/libquadmath.so.0 (0x00007fbe2ecce000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fbe2eab2000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fbe2e719000)
	libopen-rte.so.20 => /home/afh/install/openmpi-2.0.1/b/lib/libopen-rte.so.20 (0x00007fbe2e496000)
	libopen-pal.so.20 => /home/afh/install/openmpi-2.0.1/b/lib/libopen-pal.so.20 (0x00007fbe2e19d000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fbe2df99000)
	libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007fbe2dd8d000)
	libudev.so.1 => /lib64/libudev.so.1 (0x00007fbe302a9000)
	libpciaccess.so.0 => /usr/lib64/libpciaccess.so.0 (0x00007fbe2db83000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fbe2d97b000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00007fbe2d778000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fbe300c5000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fbe2d561000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fbe2d35b000)
	libz.so.1 => /lib64/libz.so.1 (0x00007fbe2d145000)
	libattr.so.1 => /lib64/libattr.so.1 (0x00007fbe2cf40000)

Register the size of MPI messages

Registering the size of MPI messages might be useful for many cases. For that to happen, we need to change (in the aky.c file with the functions that intercept MPI calls using PMPI) all calls to rst_event by rst_event_i, passing as parameter the message size. Then, in the converter, change the handling of all events that contain the message size to generate a PajePushState enriched with a new field that will contain the message size.

otf22paje build fails with latest scorep

When compiling with OTF2 on, the build fails producing the following error.

/tmp/akypuera/src/otf2/otf22paje.c: In function ‘main’:
/tmp/akypuera/src/otf2/otf22paje.c:95:85: error: passing argument 2 of ‘OTF2_GlobalDefReaderCallbacks_SetSystemTreeNodePropertyCallback’ from incompatible pointer type [-Werror=incompatible-pointer-types]
     OTF2_GlobalDefReaderCallbacks_SetSystemTreeNodePropertyCallback (def_callbacks, otf22paje_global_def
                                                                                     ^
In file included from /home/afh/install/scorep-2.0.1/b/include/otf2/OTF2_GlobalDefReader.h:51:0,
                 from /home/afh/install/scorep-2.0.1/b/include/otf2/OTF2_Archive.h:171,
                 from /home/afh/install/scorep-2.0.1/b/include/otf2/OTF2_Reader.h:181,
                 from /home/afh/install/scorep-2.0.1/b/include/otf2/otf2.h:43,
                 from /tmp/akypuera/src/otf2/otf22paje.h:23,
                 from /tmp/akypuera/src/otf2/otf22paje.c:17:
/home/afh/install/scorep-2.0.1/b/include/otf2/OTF2_GlobalDefReaderCallbacks.h:1143:1: note: expected ‘OTF2_GlobalDefReaderCallback_SystemTreeNodeProperty {aka enum OTF2_CallbackCode_enum (*)(void *, unsigned int,  unsigned int,  unsigned char,  union OTF2_AttributeValue_union)}’ but argument is of type ‘OTF2_CallbackCode (*)(void *, OTF2_SystemTreeNodeRef,  OTF2_StringRef,  OTF2_StringRef) {aka enum OTF2_CallbackCode_enum (*)(void *, unsigned int,  unsigned int,  unsigned int)}’
 OTF2_GlobalDefReaderCallbacks_SetSystemTreeNodePropertyCallback(
 ^
cc1: all warnings being treated as errors
CMakeFiles/otf22paje.dir/build.make:62: recipe for target 'CMakeFiles/otf22paje.dir/src/otf2/otf22paje.c.o' failed
make[2]: *** [CMakeFiles/otf22paje.dir/src/otf2/otf22paje.c.o] Error 1
CMakeFiles/Makefile2:142: recipe for target 'CMakeFiles/otf22paje.dir/all' failed
make[1]: *** [CMakeFiles/otf22paje.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

Using scorep-2.0.1. otf2-config --version shows version 2.0.

As mentioned in the wiki, the build works with version 1.4 (and version 1.2.2, old one I had).

You can apply this patch to fix it: afarah1@8f28280

I didn't submit a pull request because I rewrote my fork's history to fix some commit author info (afh@localhost -> [email protected]). If you prefer I can submit a pull request.

Register number of bytes instead of number of elements sent

Currently Akypuera registers the number of elements sent instead of the number of bytes sent. E.g.

if (!rank) {                                                                  
  MPI_Recv(&cbuff, 1, MPI_CHAR, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
  MPI_Recv(&ibuff, 1, MPI_INT, MPI_ANY_SOURCE, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
} else {                                                                      
  MPI_Send(&cbuff, 1, MPI_CHAR, 0, 0, MPI_COMM_WORLD);                        
  MPI_Send(&ibuff, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);                         
}                                                                             

is registered as (pj_dumped)

Link, root, LINK, 0.000020, 0.000060, 0.000040, PTP, rank1, rank0, 0, 1
Link, root, LINK, 0.000043, 0.000062, 0.000019, PTP, rank1, rank0, 1, 1

The last column is the number of elements sent (1 char, 1 int).

Wouldn't registering the amount of bytes sent (as I thought was already done) more informative for the user? If I'm analysing a trace generated with Akypuera I probably don't know what data type each communication is sending and the "number of elements" information is meaningless. I can write a patch for this if the change is desirable.

Trace asynchronous event correlation for Time Independent Traces (SMPI)

New TIT format (SMPI replay trace format) that has been introduced in Simgrid 3.20, now requires to give information on asynchronous operation. It means that the asynchronous call now includes a tag field that links MPI_wait calls to the MPI_ISend or MPI_IRecv associated.

It would be great to have this information traced by akypuera and added to the TIT trace generated.

Note that I already made a small python script that is now ship with Simgrid to convert existing traces to the new format but the lack of information makes this conversion unreliable.

Link to the script: https://github.com/simgrid/simgrid/blob/master/tools/simgrid_convert_TI_traces.py

tau2paje segfaults with a valid trace file when node ids are not sequential

tau2paje segfaults with a trace that tau2otf + otf2paje successfully convert.

Program received signal SIGSEGV, Segmentation fault.
0x0000000000403f58 in SendMessage (userData=0x0, time=3075295, sourceNodeToken=0, sourceThreadToken=0, destinationNodeToken=65535, 
    destinationThreadToken=0, messageSize=67108863, messageTag=65535, messageComm=63)
    at /home/afh/svn/akypuera/src/tau/tau2paje_handlers.c:219
219       rank_last_time[destinationNodeToken] = time_to_seconds(time);
(gdb) where
#0  0x0000000000403f58 in SendMessage (userData=0x0, time=3075295, sourceNodeToken=0, sourceThreadToken=0, destinationNodeToken=65535, 
    destinationThreadToken=0, messageSize=67108863, messageTag=65535, messageComm=63)
    at /home/afh/svn/akypuera/src/tau/tau2paje_handlers.c:219
#1  0x0000000000406049 in Ttf_ReadNumEvents (fileHandle=0x62af50, callbacks=..., numberOfEvents=<optimized out>) at TAU_tf.cpp:670
#2  0x00000000004032c0 in main (argc=3, argv=0x7fffffffe0f8) at /home/afh/svn/akypuera/src/tau/tau2paje.c:101

This has to do with the way tau2paje grows rank_last_time - it assumes node ids are sequential. From DefThread:

if (nodeid + 1 > total_number_of_ranks) {                                     
  total_number_of_ranks = nodeid + 1;                                         
  rank_last_time =                                                            
      realloc(rank_last_time, sizeof(double) * total_number_of_ranks);        
}                                                                                                             
rank_last_time[nodeid] = 0;  

An analysis with gdb shows that 65535 is passed suddenly, that is, there is no increase in source/destToken nor on messageTag. So the code above leads to a segfault later. I'm not sure why this number appears in the first time, though (overflow? there are no unsigened shorts... MPI_ANY_SOURCE (-1 in OpenMPI)? there aren't any on Ondes...).

In any case, using a hash table (as it is currently used for states) instead of an array should fix the issue. The problem is that a double should be stored in the hashtable, and so some memory management would be required.

otf22paje documentation (-f option)

otf22paje documentation is not updated, in particular for the option -f that enables to build a hierarchy using a file describing the hardware platform. Moreover, it could be good to provide a sample of this kind of files to show what is the syntax.

Build fails with OpenMPI > 1.6.5

Several errors are issued concerning wrong function prototypes. This is because OpenMPI added const to some parameters in newer versions, but it is not obvious for the end user (Guilherme had this issue trying to install it on some machine with an OpenMPI version different from Debian's 1.6.5). Perhaps adding version checking to CMakeLists would be sufficient? I'm not sure how to do that. Alternatively, I patched Akypuera to OpenMPI 1.10.2 in a branch of my fork (not all MPI routines are implemented).

Akyconverter produces a wrong output

I would like to use Akypuera (with the LD_PRELOAD method) to trace the execution of an MPI application on several nodes, with one MPI process per core. I am using OpenMPI 2.0.2.

At the end of the execution, I copy all the .rst files on a single node and try to produce a paje trace:

/tmp/akypuera/build/aky_converter rastro-*rst > /tmp/foo.paje

I get these messages:

[aky_keys] at aky_get_key (no queue), there is no key available
[aky_keys] when type = ptp, src = 2, dst = 3.
[aky_converter] at main, no key to generate a pajeEndLink,
[aky_converter] got a receive at dst = 3 from src = 2
[aky_converter] but no send for this receive yet,
[aky_converter] do you synchronize your input traces?

All the .rst files combined take a total of 11GB, the files take between 1MB and 120MB. The resulting paje file takes only 8kB, so there is obviously something wrong going on.

I also tried to use the option --sync with a file generated with rastro_timesync, but it does not change anything.
Here is the synchronization file I used (I ran rastro_timesync twice, before and after the execution of my application):

dahu-10.grenoble.grid5000.fr 1537965682064746356 dahu-26.grenoble.grid5000.fr 1537965682063629855
dahu-10.grenoble.grid5000.fr 1537965682260877043 dahu-29.grenoble.grid5000.fr 1537965682259649118
dahu-10.grenoble.grid5000.fr 1537965682476775793 dahu-32.grenoble.grid5000.fr 1537965682475192376
dahu-10.grenoble.grid5000.fr 1537965698825878006 dahu-26.grenoble.grid5000.fr 1537965698824810581
dahu-10.grenoble.grid5000.fr 1537965699029008157 dahu-29.grenoble.grid5000.fr 1537965699027823231
dahu-10.grenoble.grid5000.fr 1537965699213178410 dahu-32.grenoble.grid5000.fr 1537965699211562007

Do you have an idea of what could be wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.