GithubHelp home page GithubHelp logo

dmtcp / dmtcp Goto Github PK

View Code? Open in Web Editor NEW
370.0 32.0 131.0 45.46 MB

DMTCP: Distributed MultiThreaded CheckPointing

Home Page: http://dmtcp.sourceforge.net/

License: Other

Shell 6.90% C++ 43.75% Makefile 9.95% C 33.39% Python 2.12% TeX 0.05% Assembly 0.25% Java 0.02% M4 1.35% Perl 1.53% Dockerfile 0.01% Roff 0.67%
mpi dmtcp checkpoint-restart

dmtcp's Introduction

Build

DMTCP is a tool to transparently checkpoint the state of multiple simultaneous applications, including multi-threaded and distributed applications. It operates directly on the user binary executable, without any Linux kernel modules or other kernel modifications.

Among the applications supported by DMTCP are MPI (various implementations), OpenMP, MATLAB, Python, Perl, R, and many programming languages and shell scripting languages. DMTCP also supports GNU screen sessions, including vim/cscope and emacs. With the use of TightVNC, it can also checkpoint and restart X Window applications. For a multilib (mixture of 32- and 64-bit processes), see "./configure --enable-multilib".

DMTCP supports the commonly used OFED API for InfiniBand, as well as its integration with various implementations of MPI, and resource managers (e.g., SLURM).

To install DMTCP, see INSTALL.md.

For an overview DMTCP, see QUICK-START.md.

For the license, see COPYING.

For more information on DMTCP, see: http://dmtcp.sourceforge.net.

For the latest version of DMTCP (both official release and git), see: http://dmtcp.sourceforge.net/downloads.html.

dmtcp's People

Contributors

amvisan avatar ankitcse07 avatar artpol84 avatar bricka avatar chirag-singh1 avatar dahongli avatar francoisaissaoui avatar frifich avatar fyshhh avatar gc00 avatar gweodoo avatar jaintwinkle avatar jansel avatar jiajuncao avatar jimenez avatar johnmuth81 avatar jungan avatar karya0 avatar kwharrigan avatar mcandress avatar planeta avatar rohgarg avatar sakshi-rg avatar shlomiya avatar solankip avatar subhasisb avatar tarunsmalviya avatar tdenniston avatar twinklejain avatar xuyao0127 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dmtcp's Issues

segfault on restart, if interrupt seen prior to ckpt

In an e-mail on dmtcp-forum, a user reports that he runs an interpreter (R, Python, or ocaml),
and interrupts a function. He then checkpoints, and he sees a segfault on restart.
We need to test carefully on this. I might have seen this too, but I don't know if it was repeatable.

The user reported that rc1 works well, but not rc2 and rc3.

simple fork test doesn't work without the pid plugin

The following test program fails to restart from the second checkpoint (ckpt-rst-ckpt-rst) when the PID plugin is disabled:

main() {
  pid_t pid = fork();
  if (pid > 0) {
    while (1) {
      printf("Parent\n"); sleep(1);
    }
  } else if (pid == 0) {
    while (1) {
      printf("Child\n"); sleep(1);
    }
  } else {
    return -1;
  }
  return 0;
}

mtcp_readfile() -- we should check its return address

In mtcp/mtcp_restart.c, it calls mtcp_readfile(). Recently, I modified the definintion to follow the semantics of the read() system call (0 = end of file, -1 = error, etc.). If it detects -1 and
mtcp_sys_errno is EAGAIN or EINTR, it does a retry, and after 10 retries, it aborts
For any other error, it aborts. Hence, currently, the caller of mtcp_readfile() mostly doesn't check the return value. We should consider checking the return value. (But if it returns -1, note that mtcp_sys_errno was available only as a local variable inside mtcp_readfile() for certain technical reasons. We don't want to use global variables in mtcp_restart, and so it's not clear how we would pass errno to the outside world.

build failure on syscallsreal.o

I'm seeing a build failure like

gcc -DHAVE_CONFIG_H -I. -I../include -I../include -I../jalib -fPIC -fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector -funwind-tables -fasynchronous-unwind-tables -g -c syscallsreal.c
In file included from syscallsreal.c:46:0:
syscallwrappers.h:368:62: warning: 'struct sigvec' declared inside parameter list [enabled by default]
int _real_sigvec(int sig, const struct sigvec _vec, struct sigvec *ovec);
^
syscallwrappers.h:368:62: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
syscallsreal.c:691:63: warning: 'struct sigvec' declared inside parameter list [enabled by default]
int _real_sigvec(int signum, const struct sigvec *vec, struct sigvec *ovec) {
^
syscallsreal.c:691:5: error: conflicting types for '_real_sigvec'
int _real_sigvec(int signum, const struct sigvec *vec, struct sigvec *ovec) {
^
In file included from syscallsreal.c:46:0:
syscallwrappers.h:368:7: note: previous declaration of '_real_sigvec' was here
int real_sigvec(int sig, const struct sigvec *vec, struct sigvec *ovec);
^
Makefile:818: recipe for target 'syscallsreal.o' failed
make[2]: *
* [syscallsreal.o] Error 1
make[2]: *** Waiting for unfinished jobs....

(with the 2.3.1 release)

Failing on --enable-m32 in CentOS 7 and CentOS 6.6

These are probably two different bugs, but I'll start them as a single issue for now.
In CentOS 7, do: ./configure --enable-m32 and make and dmtcp_launch test/dmtcp1.
dmtcp_launch correctly execs into dtmcp1. Within DmtcpWorker, I see:
(gdb) where
#0 0xffffce30 in ?? ()
#1 0xf7e9e19e in initialize_libc_wrappers () at syscallsreal.c:265
#2 0xf7e50bfa in dmtcp_prepare_wrappers () at dmtcpworker.cpp:145
#3 0xf7ea0719 in _real_dup2 (oldfd=827, newfd=827) at syscallsreal.c:649
#4 0xf7e90a31 in jalib::dup2 (oldfd=827, newfd=827) at ../jalib/jalib.cpp:97
#5 0xf7e91669 in jassert_internal::jassert_init () at ../jalib/jassert.cpp:124
#6 0xf7e90893 in jalib_init (jalibFuncPtrs=...,

elfInterpreter=0xf7eacb5d "/lib64/ld-linux-x86-64.so.2", stderrFd=827, 
jassertLogFd=828, dmtcp_fail_rc=99) at ../jalib/jalib.cpp:57

#7 0xf7e86a47 in initializeJalib () at jalibinterface.cpp:70
#8 0xf7e5165f in dmtcp::DmtcpWorker::DmtcpWorker (

this=0xf7ece18d <dmtcp::DmtcpWorker::theInstance>) at dmtcpworker.cpp:313

#9 0xf7e52ce4 in __static_initialization_and_destruction_0 (__initialize_p=1,

__priority=65535) at dmtcpworker.cpp:306

#10 0xf7e52d26 in _GLOBAL__sub_I_dmtcpworker.cpp(void) () at dmtcpworker.cpp:632
#11 0xf7fec47e in _dl_init_internal () from /lib/ld-linux.so.2
#12 0xf7fde05f in _dl_start_user () from /lib/ld-linux.so.2

NOTE:
(gdb) frame 1
#1 0xf7e9e19e in initialize_libc_wrappers () at syscallsreal.c:265

265 FOREACH_DMTCP_WRAPPER(GET_FUNC_ADDR);

Separately, in the second bug, on CentOS 6.6, with --enable-m32, it fails on restart. If address space randomization is turned off, then this bug goes away. With address space randomization turned on, and using the latest DMTCP (with the first bug fix from issue #56) and also with mtcp_check_vdso() being invoked, it still dumps core.

Define semantics of prefix-directory for remote nodes

dmtcp_launch has a flag --prefix to specify the DMTCP installation path for remote nodes. This path is useful when the DMTCP install directory on remote nodes is different than the DMTCP installation directory on the current node.

The flag is (somewhat) misleading and half broken. Currently DMTCP assumes that there is only one unique remote prefix dir. If there are more than one remote nodes, it will reject the processes that have a different remote prefix dir than the existing processes.

For most of our use cases, each node has the same prefix directory and thus I propose to get rid of this flag and not bother about the prefix directory at all. If we ever want to support prefix directory, we should do it per host.

DMTCP-specific environment variables on remote nodes

For remote processes created via ssh (under ckpt control), which environment variable should they inherit from the remote shell? For example, if DMTCP_HOST is defined in .bashrc, should it overwrite the command line flags? What about flags like DMTCP_PLUGIN, and so on.

I would argue that we shouldn't rely on any environment variables for remote nodes and always supply everything through command line arguments.

However, this isn't as simple as it sounds. In our current code base, we read the command line flags and accordingly set a few environment variables. Thus at the end of command line parsing, there is no way to tell if a given environment variable was set by us or it was inherited from the shell.

A related question is how to deal with the DMTCP installation path. I think we are doing the correct thing right now by having a --prefix flag for dmtcp_launch to specify the DMTCP installation directory on remote nodes. This flag is used to calculate the absolute path of DMTCP binaries on the remote node.

Unix domain socket recvmsg bug

DMTCP causes a bug when running a simple Unix domain socket test implemented using 'sendmsg' and 'recvmsg', on Ubuntu 12.04. Under DMTCP, if recvmsg is called on the client-side before sendmsg is called on the server-side, recvmsg blocks until the send call, as it should, but will fail with a "Bad address" error when attempting to read the message.
There is no error when running without DMTCP, or if sendmsg is called first.

Calculate the appropriate Coordinator IP address

Currently, the coordinator uses a mix ot hackery and tricks to compute its IP address. This involves a call to gethostname() followed by a call to getaddrinfo(). This is still unreliable and can give us 127.0.0.1 as the IP address, when we actually need the non-loopback IP.

Misc issues

Misc:

  • Ckpt filenames
  • Joshua's bugs
  • Raspberry PI
  • Better error message for Jiajun's bug.
  • Better return handling for dmtcp_process_event.
    • Force dmtcp_process_event() return type to catch user errors.
    • Update the documentation.
  • FIX bold face comments in plugin-tutorial.
  • Add --with-icc flag
  • Temporarily replace stdin/out/err and cin/out/err during checkpointing.
  • DMTCP manpages
  • Timer bug on CCIS linux - SEGV (?)
  • Remove "hijack" from plugin names.
  • --enable-fast-ckpt-restart -> --enable-mmapped-restart
    • --help should say that this turns off the gzip compression.
  • Trampolines that are x86-specific. Move them under #ifdef FRED
  • heuristics plugin
  • make check-java
  • --port-file (with --port 0)
  • --disable-tests
  • FReD for scipy
  • MAUI scheduler
  • ssh for MPICH2-hydra
  • port rev 2000 to trunk.
  • remove unused dmtcpSharedArea Files.
  • nslookup in dmtcp_restart_script.sh doesn't always work. Use gethostip instead.
  • Reverse order of execution for events related to resume/restart.
  • Versioned symbols: special c ases, [email protected] (general solution?)
  • locks (recursive), and lock/unlock across ckpt/restart (tid changes)
  • Simplify Locks
  • Async-signal-safe function handling

Atuomate testing:

  • --m32
  • --enable-mmapped-restart
  • --enable-forked-ckpt
  • x86/x86_64/arm
  • Contrib
    • Condor
    • Torque
    • KVM, Tun/Tap
    • IB
    • Python/IPython
    • MTCP for OpenMPI
  • Different architectures under U. Wisc. testing service.

Move src/config.h.in to include/config.h.in

Apparently, config.h is required by IPC plugin and the likes. One possibility is to move it to include/ directory. As such, plugins shouldn't include anything from src/ directory.

shared-memory1 is flaky

It sometimes fail with the following error (observed with ./test/autotest.py -v shared-memory1):

[14211] mtcp_restart.c:1085 read_shared_memory_area_from_file:
  error 2 opening mmap file /home/kapil/dmtcp/dmtcp-shared-memory.QExEDL

Migration of java app, new machine has different paths to .jar files

BRANCH: java-migration-bugfix
Partial fix for bug in migration of java apps

* Do not push into master.  This is still a work in progress.
* There are two parts so far:
*  MTCP:  After migration, if underlying file for MAP_SHARED doesn't
          exist, then re-map as MAP_PRIVATE | MAP_ANONYMOUS
*  fileconnection:  Don't do FileConnection::refill() on fd pointing
          to underlying file, since that file doesn't exist on dest machine.
*  connection:  There's still the fd to the former underlying file.
                How do we point that fd to /dev/null now?

Coding style

Right now the DMTCP code base doesn't have a uniform coding style and it is often counter productive. Let's settle down on one of the existing opensource coding standards and enforce it in the code base.

We preferably want something that can be mechanized through git pre-commit hooks, vim/emacs plugins, etc., to avoid spending additional time in formatting the code. For the automated formatting, Clang-format (http://clang.llvm.org/docs/ClangFormat.html) is one tool that comes to mind, but I am open to other ideas.

--enable-m32 doesn't work standalone

./configure --enable-m32 is supposed to work standalone, not just as part of a multi-arch build. Currently, to make it work, one must do something like:
make -j clean && rm -rf bin lib && make -j && (cd src && make -j || true) && (cp -prf lib/dmtcp/32/bin/ bin || true) && make -j check-dmtcp1
to make --enable-m32 work with all DMTCP commands built for m32. We should modify the make files to do this automatically, and rm -rf bin lib as necessary if this is part of a multi-arch build.
[ In addition to supporting other users, it makes it much easier for the developers to test m32 mode. ]

Memory corruption with multi-threaded process

I noticed a memory corruption related segfault on my laptop (2 cores, 4 threads) that is not too hard to reproduce. It doesn't even involve creating a checkpoint image.

dmtcp_launch ./test/pthread4

Let it run for a while and eventually, you will see a segmentation fault.

To look closely, you can define env var DMTCP_SEGFAULT_HANDLER=1 and instead of creating a core file it will loop inside the segfault handler function.

The next step should be to disable JAlloc by removing #define from jalloc.h. I couldn't use libc malloc on my laptop due to a separate issue involving isspace() (I'll create a separate issue about that).

Add InfiniBand detection

When user forgets to enable IB plugin while running an IB application, we should print a warning message.

For MVAPICH-1.9 with ibrun, a real pid appears in user code

Seen at Stampede at TACC. This can happen for ioctl(fd, S_GETOWN) during the leader election. This returns a real pid during the first ckpt instead of the virtual pid. This happens with a shared fd in the case of more than one process per node. Luckily, the logic compares the real pid to a virtual pid, and always returns false. This is the right answer in this case, but just by accident.

make check fails for pthread2 and pthread3 with newer libc (2.21)

The failure happens after the process has been successfully restarted. It eventually segfaults right after resuming from a restart. Here is the backtrace for the segfaulting thread:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007f37ba7a1112 in unwind_stop (version=<optimized out>, actions=<optimized out>, 
    exc_class=<optimized out>, exc_obj=<optimized out>, context=<optimized out>, 
    stop_parameter=<optimized out>) at unwind.c:82
#2  0x00007f37b9966512 in ?? () from /lib64/libgcc_s.so.1
#3  0x00007f37b9966894 in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1
#4  0x00007f37ba7a1180 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:126
#5  0x00007f37ba79b565 in __do_cancel () at pthreadP.h:283
#6  __pthread_exit (value=<optimized out>) at pthread_exit.c:28
#7  0x00007f37bac26ffe in _real_pthread_exit (retval=0x0) at syscallsreal.c:967
#8  0x00007f37babf43f3 in pthread_exit (retval=0x0) at threadwrappers.cpp:235
#9  0x0000000000400cee in threadMain (data=0x1b43d40) at pthread2.c:69
#10 0x00007f37babf40cc in pthread_start (arg=<optimized out>) at threadwrappers.cpp:159
#11 0x00007f37ba79a4b4 in start_thread (arg=0x7f37b7e12700) at pthread_create.c:333
#12 0x00007f37babf3f29 in clone_start (arg=0x7f37bb8de808) at threadwrappers.cpp:68
#13 0x00007f37ba4d8a4d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) 

timer_create with SIGEV_THREAD causes restart to fail

Libc handles SIGEV_THREAD by creating a helper thread and uses the tid of the helper thread along with SIGEV_THREAD_ID when creating the timer (i.e., libc emulates SIGEV_THREAD).

Thus, on timer expiration, the kernel signals the helper thread which in turn spawns a new user thread that calls sigev_notify_function as asked by the user when creating the timer (with SIGEV_THREAD).

The problems happens on restart, when libc tries to use the pre-ckpt helper-tid which is invalid now. Hence, any call to timer_create with SIGEV_THREAD fails with EINVAL.

Replace wrapperExecutionLock by finer grained locks

wrapperExecutionLock seems to be used for two different purposes. It is used at the time of checkpoint to defer checkpoint while a wrapper is in a critical section. And it is used to prevent a deadlock in which a user thread might make a call, for example, while the checkpoint thread or another DMTCP wrapper is doing something sensitive. The latter sometimes occurs in cases of malloc. We should distinguish these two cases by two distinct, finer-grained locks.

In a related issue, _exitInProgress is used in conjunction with the wrapperExecutionLock to avoid a wrapper blocking when we are trying to use it to clean up after a call to exit. A more principled approach would be to replace _exitInProgress by a new DmtcpWorker state: ZOMBIE.

These ideas were developed in discussions with @rohgarg. Any credit should be joint. Any blame should go to me alone, for analyzing and presenting the ideas badly.

Recalculate plugin paths on restart based on the current environment

Currently, LD_PRELOAD is hardcoded in the checkpoint image. When the process is restarted on a different machine where the hardcoded paths are invalid, any newly created child processes will not be under checkpoint control.

A potential fix is to record the plugin names in the checkpoint image and recalculate the plugin library paths upon restart. This information can potentially be placed in the shared-area by dmtcp_restart binary.

Processing of pre-existing shared socket

Currently, for a pre-existing socket, dmtcp treats it as a normal socket before checkpoint (i.e., dmtcp still performs leader election, draining, etc.), and as a non-shared socket on restart. This works for non-shared socket anyway, since each socket has exactly one leader (the process creating the socket), and the process will recreate the socket on restart.

Now suppose one pre-existing socket is shared by process A and B, and at checkpoint time, leader election is performed, and process A holds the lock. On restart, process A will recreate the socket, and treat it as a non-shared socket (it won't be put into the outgoing list). However, when process B handles the socket, it doesn't hold the lock, so process B will treat it as a shared fd, and mark it as an incoming connection, and wait for another process (should be process A) to send the shared fd back.

dlopen1 failing on CentOS 7 (and others?)

Checkpoint dlopen1 hangs with the following backtrace for main thread and the ckpt thread:

(gdb) bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fe1e3c33d4d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007fe1e3c33c6a in __GI___pthread_mutex_lock (mutex=0x7fe1e5352908 <_rtld_local+2312>)
    at pthread_mutex_lock.c:85
#3  0x00007fe1e4208189 in __dlsym (handle=<optimized out>, name=<optimized out>) at dlsym.c:68
#4  0x00007fe1e4afca25 in dmtcp::FileConnList::prepareShmList (this=this@entry=0x7fe1e534a208)
    at file/fileconnlist.cpp:151
#5  0x00007fe1e4afcdc9 in dmtcp::FileConnList::preLockSaveOptions (this=0x7fe1e534a208)
    at file/fileconnlist.cpp:75
#6  0x00007fe1e4ad5766 in dmtcp::ConnectionList::eventHook (this=0x7fe1e534a208, event=128, 
    event@entry=DMTCP_EVENT_THREADS_SUSPEND, data=data@entry=0x0) at connectionlist.cpp:118
#7  0x00007fe1e4afc2b3 in dmtcp_FileConnList_EventHook (event=event@entry=DMTCP_EVENT_THREADS_SUSPEND, 
    data=data@entry=0x0) at file/fileconnlist.cpp:46
#8  0x00007fe1e4acd54a in dmtcp_event_hook (event=DMTCP_EVENT_THREADS_SUSPEND, data=0x0) at ipc.cpp:46
#9  0x00007fe1e463bc88 in dmtcp::DmtcpWorker::waitForStage2Checkpoint () at dmtcpworker.cpp:550
#10 0x00007fe1e464b3a2 in dmtcp::callbackPreCheckpoint () at mtcpinterface.cpp:79
#11 0x00007fe1e465720c in checkpointhread (dummy=<optimized out>) at threadlist.cpp:358
#12 0x00007fe1e464dafc in pthread_start (arg=<optimized out>) at threadwrappers.cpp:159
#13 0x00007fe1e3c31df5 in start_thread (arg=0x7fe1e25cd700) at pthread_create.c:308
#14 0x00007fe1e464d959 in clone_start (arg=0x7fe1e532f808) at threadwrappers.cpp:68
#15 0x00007fe1e3f3c1ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) 
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fe1e3c33d4d in _L_lock_840 () from /lib64/libpthread.so.0
#2  0x00007fe1e3c33c6a in __GI___pthread_mutex_lock (mutex=0x7fe1e5352908 <_rtld_local+2312>)
    at pthread_mutex_lock.c:85
#3  0x00007fe1e4208189 in __dlsym (handle=<optimized out>, name=<optimized out>) at dlsym.c:68
#4  0x00007fe1e4afca25 in dmtcp::FileConnList::prepareShmList (this=this@entry=0x7fe1e534a208)
    at file/fileconnlist.cpp:151
#5  0x00007fe1e4afcdc9 in dmtcp::FileConnList::preLockSaveOptions (this=0x7fe1e534a208)
    at file/fileconnlist.cpp:75
#6  0x00007fe1e4ad5766 in dmtcp::ConnectionList::eventHook (this=0x7fe1e534a208, event=128, 
    event@entry=DMTCP_EVENT_THREADS_SUSPEND, data=data@entry=0x0) at connectionlist.cpp:118
#7  0x00007fe1e4afc2b3 in dmtcp_FileConnList_EventHook (event=event@entry=DMTCP_EVENT_THREADS_SUSPEND, 
    data=data@entry=0x0) at file/fileconnlist.cpp:46
#8  0x00007fe1e4acd54a in dmtcp_event_hook (event=DMTCP_EVENT_THREADS_SUSPEND, data=0x0) at ipc.cpp:46
#9  0x00007fe1e463bc88 in dmtcp::DmtcpWorker::waitForStage2Checkpoint () at dmtcpworker.cpp:550
#10 0x00007fe1e464b3a2 in dmtcp::callbackPreCheckpoint () at mtcpinterface.cpp:79
#11 0x00007fe1e465720c in checkpointhread (dummy=<optimized out>) at threadlist.cpp:358
#12 0x00007fe1e464dafc in pthread_start (arg=<optimized out>) at threadwrappers.cpp:159
#13 0x00007fe1e3c31df5 in start_thread (arg=0x7fe1e25cd700) at pthread_create.c:308
#14 0x00007fe1e464d959 in clone_start (arg=0x7fe1e532f808) at threadwrappers.cpp:68
#15 0x00007fe1e3f3c1ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

As can be seen, the main thread got suspended while in the middle of dlsym() and thus when ckpt thread tried to call dlsym(), it reached a deadlock. We should enable the dlsym wrapper and use WRAPPER_EXECUTION_LOCK inside it.

JTRACE/JNOTE doesn't print for SLURM/Intel MPI-4.0/PMI

At the U. Buffalo HPC supercomputing center, we use --enable-debug. The jassertlog is created for ckpt/resume, but not after restart. Also, JNOTE does not print to stderr after restart. This is a minor bug, which can be diagnosed and fixed later.

__reset_atfork

I found that UID of processes after fork is not resetted on Ohio Univ. RI cluster.
As the result parent and children have same UIDs and this prevents program from restart when using Hydra.
Here is additional information that I was able to gather:

  1. This problem is not reproduced on my laptop's SLURM installation.
  2. If I apply the following patch the problem disappears on RI cluster too:
diff --git a/src/plugin/pid/pid_miscwrappers.cpp b/src/plugin/pid/pid_miscwrappers.cpp
index d4d3de1..984d55a 100644
--- a/src/plugin/pid/pid_miscwrappers.cpp
+++ b/src/plugin/pid/pid_miscwrappers.cpp
@@ -61,6 +61,7 @@ extern void *__dso_handle __attribute__ ((__weak__,
 extern "C" int __register_atfork(void (*prepare)(void), void (*parent)(void),
                                  void (*child)(void), void *dso_handle)
 {
+  getpid(); // force process to recalculate it's PID
   if (!pthread_atfork_initialized) {
     pthread_atfork_initialized = true;
     /* If we use pthread_atfork here, it fails for Ubuntu 14.04 on ARM.

This two observations makes me sure that the problem is outside of the batch-queue plugin.

  1. Comparison of backtraces on my laptop and on RI shows that RI system calls pthread_atfork when my laptops system doesn't:
MY LAPTOP:
#2  0x00007f11d389a0c6 in __register_atfork (prepare=0x7f11d3afcf49 <pthread_atfork_prepare()>, parent=0x7f11d3afcf50 <pthread_atfork_parent()>, 
    child=0x7f11d3afcf57 <pthread_atfork_child()>, dso_handle=0x7f11d3d7a000) at pid_miscwrappers.cpp:68
#3  0x00007f11d3aeaabe in dmtcp_prepare_wrappers () at dmtcpworker.cpp:150
#4  0x00007f11d3b48042 in _real_dup2 (oldfd=827, newfd=827) at syscallsreal.c:619
#5  0x00007f11d3b3734d in jalib::dup2 (oldfd=827, newfd=827) at ../jalib/jalib.cpp:97
#6  0x00007f11d3b37fc4 in jassert_internal::jassert_init () at ../jalib/jassert.cpp:124
#7  0x00007f11d3b3711d in jalib_init (jalibFuncPtrs=..., elfInterpreter=0x7f11d3b57cde "/lib64/ld-linux-x86-64.so.2", stderrFd=827, jassertLogFd=828, dmtcp_fail_rc=99)
    at ../jalib/jalib.cpp:57
#8  0x00007f11d3b2be72 in initializeJalib () at jalibinterface.cpp:70
#9  0x00007f11d3aeb6d1 in dmtcp::DmtcpWorker::DmtcpWorker (this=0x7f11d3d7a2b5 <dmtcp::DmtcpWorker::theInstance>) at dmtcpworker.cpp:290
#10 0x00007f11d3aed9ee in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at dmtcpworker.cpp:283
#11 0x00007f11d3aeda23 in _GLOBAL__sub_I_dmtcpworker.cpp(void) () at dmtcpworker.cpp:603
#12 0x00007f11d48d813a in ?? () from /lib64/ld-linux-x86-64.so.2
#13 0x00007f11d48d8223 in ?? () from /lib64/ld-linux-x86-64.so.2
#14 0x00007f11d48c930a in ?? () from /lib64/ld-linux-x86-64.so.2
#15 0x0000000000000002 in ?? ()
#16 0x00007fff9a788e12 in ?? ()
#17 0x00007fff9a788e20 in ?? ()
#18 0x0000000000000000 in ?? ()


RI cluster node:
#2  0x00007f5481fe8075 in __register_atfork (prepare=0x7f5482248547 <pthread_atfork_prepare()>, parent=0x7f548224854d <pthread_atfork_parent()>, 
    child=0x7f5482248553 <pthread_atfork_child()>, dso_handle=0x7f54824c1200) at pid_miscwrappers.cpp:74
#3  0x00007f5482235e1e in dmtcp_prepare_wrappers () at dmtcpworker.cpp:150
#4  0x00007f5481fe80b3 in __register_atfork (prepare=0, parent=0, child=0x7f5481bc7950, dso_handle=0x7f5481dcb0a8) at pid_miscwrappers.cpp:84
#5  0x00007f5481bc79c4 in cri_pthread_init () from /usr/lib64/libcr.so.0
#6  0x00007f5481bc64d5 in ?? () from /usr/lib64/libcr.so.0
#7  0x00007f5481bc9016 in ?? () from /usr/lib64/libcr.so.0
#8  0x0000000000000002 in ?? ()
#9  0x0000000000000002 in ?? ()
#10 0x00007fffc9fd9998 in ?? ()
#11 0x00007f5481bc57e3 in _init () from /usr/lib64/libcr.so.0
#12 0x00007f54832164c8 in ?? ()
#13 0x00007f548301a545 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#14 0x00007f548300cb3a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#15 0x0000000000000002 in ?? ()
#16 0x00007fffc9fda8da in ?? ()
#17 0x00007fffc9fda8e8 in ?? ()
#18 0x0000000000000000 in ?? ()

Note that on RI cluster __register_atfork is called twice recursively and one of them is called before dmtcp_prepare_wrappers. And backtraces as whole are too different from each other.

I can't determine the root cause of the problem now. But maybe one of you would be able to guess it by this description. I am ready to provide additional information.

Rename `is32bit` with a more precise name

Looks like is32bit is a bad variable name when dealing with 32-bit binaries on 64-bit systems. We want something that says "it's a 32-bit binary on a 64-bit system". Is there a common convention that people use? If so, we can create a new variable that is set to true only in the said case and remains false otherwise.

When is registration of new sockets allowed?

This is related to a bug we saw with the PMI library (a7556cc). (In fact, we have seen a related issue with Bionic libc). The resource manager plugin would make a libpmi call during the pre-checkpoint time which created a short-lived socket. This would cause DMTCP to hang at checkpoint time while trying to register the new socket. So, my question is: what are the stages where registration of a new socket is safe?

This leads me to a more general question:
What's the set of phases where a plugin/application/library is allowed to create a new DMTCP resource? My guess is that this is specific to each plugin, but is there a common principle?

I think it would be good to document this (at least for the core plugins) and add some assertions in the code to make debugging easier.

changing path prefixes on restart

We're seeing more and more requests for changing path prefixes on restart. It's time that we rationalize our approach, instead of a series of ad hoc solutions. A large part of this may turn out to be a plugin, but some of it will have to go into the DMTCP core (at least into the MTCP core).
I propose that we have an environment variable: DMTCP_PATH_PREFIXES
It will be set to a colon-separated list both on launch and on restart. On restart, the list should be of the same lenght or else the behavior is undefined. Corresponding paths in the list indicate that the original path prefix should be virtualized and replaced by the new path prefix on restart.
This will directly affect mtcp_restart.c. Currently it has some commented code as a first start that tries to do this for the cwd. This could be a default setting: DMTCP_PATH_PREFIXES=$PWD by default. We will then re-map memory with an underlying file based on the new prefix.
Other places in DMTCP (probably the file plugin) are likely to also be affected. I understand that code less well, but I could probably hack through it. If that code could be re-factored into a new plugin for path virtualization, that might be the best approach.

There are some difficult issues that I do not propose to solve in this first implementation. For example, as we saw with the problem with Java migration, sometimes the underlying file will be strictly different. This will happen for example if there was a directory on the source machine based on one version of software and based on a different version on the destination machine. In this case, we might want to change to MAP_ANONYMOUS. A similar issue comes up for files mapped as MAP_SHARED. As I wrote earlier, I do not propose to solve these issues now --- but we need to be aware of them so that we don't implement a dead end for the evolution of our code.

Please consider this an RFC. I welcome all comments. I could start on a solution by this summer if we are all agreed.

Memory leak when timers are deleted

Timers created with SIGEV_THREAD have a dangling object when the timer is deleted. The fix is to intercept timer_delete and free the object.

Doxygen support for DMTCP

Having doxygen support (or some other form of hyper-linking) will help new developers to better understand the DMTCP code. This is important for building up the DMTCP community. Does anybody have experience, who could easily set up doxygen or other support for DMTCP?

Shared FD not restored on restart

In a simple client-server test using Unix domain sockets, open file descriptors shared over a Unix domain socket are not restored on restart.
The server can successfully open a FD and send it to the client over the socket. The client can immediately use the open FD, and it is added to the client process FD list. The FD is preserved through checkpoint, but on restart, it not restored in the client process. (It is no longer in the FD list, and is not a valid file descriptor)

ckptfile-plugin: make it useful

This plugin can be used by a user to specify what ckpt/restart behavior he/she wants for different open files. To this end I propose the following changes:

  • The plugin should read and act according to a set of rules (format described below) specified by the user in a text file.
    • Only the files specified to be checkpointed should be saved. This implies that this plugin should override the --ckpt-open-files option.
    • The files specified to be restored on restart should overwrite any existing files.
    • The files to be restored on restart should be subset of the checkpointed files.
       # ckptfile rules
       # list of open files to be saved at checkpoint
       ckpt: < none | all | file1 [file2 [file3 ... ]] >
       # list of files to be restored on restart
       restart: < none | all |  file1 [file2 [file3 ... ]] >

Put Coordinator Host and Port information in SharedArea

Currently, each process has to compute the coordinator address during fork/exec and ssh. The computation relies on getpeername(), etc. which is inefficient and error-prone. The fully qualified coordinator address (IP address and port) should be placed in sharedarea and then used by subsequent child processes.

Notice however that during dmtcp_launch and dmtcp_restart, the sharedarea is not yet available and so these should be the two places that compute the coordinator address.

dmtcp_command -k: exit() or _exit()?

Currently, 'dmtcp_command -k' causes the application to execute "_exit()". (See coordinatorapi.cpp) Because we call _exit() instead of exit(), pending buffer output is not flushed (glibc:_cleanup() not called), and any user atexit() functions are not called. Shouldn't we be calling exit() to respect the user's intentions?
The user might have registered an atexit() fnc that goes into an infinite loop, but I believe it's not up to us to protect the user from their own mistakes. Any user seeing this will soon discover that they registered a buggy atexit function.

BUG diagnosis: User files saved in DMTCP_TMPDIR are being blacklisted.

DMTCP_TMPDIR and TMPDIR can coincide. A user will put files in TMPDIR, and DMTCP considers that all files in DMTCP_TMPDIR are owned by DMTCP. This bug apparently affects any user who sets DMTCP_TMPDIR or who uses the --tmpdir flag of dmtcp_launch. In particular, it affects test/file1, and it affects Open MPI version 1.6.

The following fails:
DMTCP_TMPDIR=/tmp/dmtcp-gene@dekaksi make check-file1

This issue is that test/file1 creates and then unlinks a file while
continuing to use the file descriptor pointing to it.

Apparently, _isBlacklistedFile() considers this user file to be
blacklisted because it is in DMTCP_TMPDIR, and the program logic assumes
that all files in DMTCP_TMPDIR are owned by DMTCP.

Because the file is blacklisted, the file type is never set to FIKLE_DELETED.
So, on restart, the restart fails.

In order to demonstrate this diagnosis, try for example:
DMTCP_TMPDIR=/tmp/dmtcp-USER@HOST make check-file1
and restart will fail.
Then go to: FileConnection::handleUnlinkedFile() and comment out the call to !_isBlacklistedFile(_path), and the bug will go away when again executing:
DMTCP_TMPDIR=/tmp/dmtcp-USER@HOST make check-file1

I'll need to think carefully about a re-design of the file plugin to fix this. Any suggestions are welcome.

Nameservice socket to coordinator invalid after restart.

Description as received from @jiajuncao

In coordinatorAPI.cpp, we have 2 sockets: _coordinatorSocket and _nsSock. When DMTCP is not in running state, _coordinatorSocket is used for the publish/subscribe service (_nsSock and _coordinatorSocket are the same). To support publish/subscribe during running state, an extra socket is created (_nsSock = createNewSocketToCoordinator(COORD_ANY)). The IB plugin takes use of this feature after a restart: it needs to subscribe information about the remote memory region key on running. The first restart is OK. But when we try to checkpoint after the first restart and then restart the second time: _nsSock is a bad file descriptor, it remains the same fd as the one created after last restart, but it's not valid anymore. I think it's either closed or not successfully restored.

Here is a simple plugin doing just publish/subscribe. It's a stand-alone version based on example-db. You can test it by running dmtcp1.
Again, the test procedure is ckpt, restart, ckpt, restart. It'll fail after the second restart.

#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include "dmtcp.h"

struct keyPid {
  int key;
  pid_t pid;
} mystruct, mystruct_other;

static int is_restart = 0;
uint32_t sizeofPid;

void dmtcp_event_hook(DmtcpEvent_t event, DmtcpEventData_t *data)
{

  /* NOTE:  See warning in plugin/README about calls to printf here. */
  switch (event) {
  case DMTCP_EVENT_RESTART:
    is_restart = 1;
    break;
  case DMTCP_EVENT_REGISTER_NAME_SERVICE_DATA:
    if (is_restart) {
      mystruct.key = 1;
      mystruct.pid = getpid();
      dmtcp_send_key_val_pair_to_coordinator("ex-db",
                                             &(mystruct.key),
                                             sizeof(mystruct.key),
                                             &(mystruct.pid),
                                             sizeof(mystruct.pid));
    }
    break;
  default:
    break;
  }
  DMTCP_NEXT_EVENT_HOOK(event, data);
}

unsigned int sleep(unsigned int seconds)
{
  unsigned int result = NEXT_FNC(sleep)(seconds);
  if (is_restart) {
    sizeofPid = sizeof(mystruct_other.pid);
    mystruct_other.key = 1;
    dmtcp_send_query_to_coordinator("ex-db",
                                    &(mystruct_other.key),
                                    sizeof(mystruct_other.key),
                                    &(mystruct_other.pid),
                                    &sizeofPid);
    printf("Pid returned from coordinator is %ld.\n", (long)mystruct_other.pid);
  }

  return result;
}

Simplify DMTCP_TMPDIR environment variable

DMTCP_TMPDIR points to the directory where DMTCP places all temporary files. This variable is inherited by all child processes and is passed as a command line flag to remote (ssh) children.

Currently, DMTCP_TMPDIR is considered the root directory and we do not append anything to it. That is, if DMTCP_TMPDIR=/home/user/tmp, all temp files will end up in /home/user/tmp, even on remote nodes. This could lead to unexpected behavior with some file such as sharedArea.* which expect a node-private tmp dir.

A potential fix is to always append dmtcp-USER@HOSTNAME to the passed DMTCP_TMPDIR. This s the same suffix that we append to TMPDIR (or /tmp) if DMTCP_TMPDIR is not specified.

This also solves a problem related to restarted computation where the user might want to specify a different value for DMTCP_TMPDIR. In this case, we can put the value of DMTCP_TMPDIR in SharedData which is later used by all child processes.

Restart issues in 2.4.0rc2

I'm having issues with restarts passing on my CentOS 7.1 system; these issues don't seem to be present in 2.3.1; in both cases, I've used the default ./configure (no options), but I've included the config.log and config.status here.
Additionally the .cores and checkpoints are available.

Verifying there is enough disk space ...
== Tests ==
dmtcp1         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17746 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:
***** Copied checkpoint images to /tmp/[email protected]/dmtcp-autotest-505920597
FAILED
               root-pids: [17773] msg: restart error, 1 expected, 0 found, running=0
dmtcp2         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp3         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17833 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:PASSED; ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17883 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:PASSED
dmtcp4         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp5         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.18128 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:
***** Copied checkpoint images to /tmp/[email protected]/dmtcp-autotest-505920597
FAILED
               root-pids: [18164] msg: restart error, 2 expected, 1 found, running=0
syscall-tester ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
file1          ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcpaware1    ckpt:PASSED rstr:
FAILED (first process rec'd signal 11) (core.18229 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:y



^Cdmtcpaware1    FAILED
               root-pids: [] msg: failed to write 's' to coordinator (pid: 17740)
CLEANUP ERROR: failed to write 'k' to coordinator (pid: 17740)
SHUTDOWN() failed
make: *** [check] Error 1

Jalib shouldn't use dmtcp plugin API

I just noticed that some part of Jalib is now using the plugin API. This shouldn't be the case. Whatever is needed by jalib should be supplied via jalib_init. The whole point of jalib_init was to eliminate two-way communication between jalib and rest of dmtcp.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.