GithubHelp home page GithubHelp logo

Comments (11)

gc00 avatar gc00 commented on September 26, 2024

Jiajun and I talked, and we have a theory of the cause of the bug. It also seems to be confirmed by an experiment.
We believe that the MVAPICH libraries have constructors that create sockets even before main(). Because DMTCP is first in library search order, it is initialized last. So, the MVAPICH constructors run before DMTCP. Later, the DMTCP constructor (dmtcpWorker) calls scanForPreexisting(), which looks for things like named pipes, UNIX domain sockets (corresponding to filenames on disk), pty's, etc. These are declared as preexisting devices. So, at the time of DMTCP launch, these sockets are declared as preexisting. Much later, during restart, DMTCP remembers that these are preexisting devices, and so the leader from leader election refuses to believe that these fd's are shared fd's. So, these fd's are removed from outgoingCons. Meanwhile, the non-leader processes list these fd's in missingCons, and block while waiting for the leader to send these shared fd's.
We proved this theory by modifying scanForPreexisting() to also exclude any device beginning with the name "socket" as preexisting. Then, these fd's are no longer considered preexisting, and the bug goes away.
So, now we need to find a more robust bug fix. Maybe some devices from /proc/*/fd truly are preexisting, even if their names is of the form "socket[...]". What is the best way to distinguish between these fd's created by the MVAPICH constructor, and the truly preexisting ones?

from dmtcp.

karya0 avatar karya0 commented on September 26, 2024

I think the fix should be to somehow capture the call to socket() originating from MVAPICH constructors. Can you check if the socket() calls lands in our wrappers? The question of truly preexisting sockets becomes moot if we use this fix.

If we can't do that, it's much harder to figure out a way to restore the socket in a generic way. We can probably find a hack for MVAPICH but it would be nice to have a more generic solution.

from dmtcp.

jiajuncao avatar jiajuncao commented on September 26, 2024

Unfortunately, I wasn't able to capture the creation of the sockets. I wonder if there're other ways to create the sockets? I put JNOTE in all functions inside socketwrapper.cpp, but none of them was invoked.

from dmtcp.

gc00 avatar gc00 commented on September 26, 2024
That's interesting.  Then it seems like the constructor inside MVAPICH

is using a different library call other than socket(). Could it be
socketpair()?
Jiajun, since we have access to the MVAPICH developers, why
not ask them where file descriptors 3 and 4 come from, while explaining
that it seems to happen in a constructor before main, probably as part
of the MPI initialization.

I agree with Kapil's observation that if we can capture the original

creation of the socket, then we will have a more robust way of determining
preexisting sockets.

On Fri, Apr 10, 2015 at 12:26:57AM -0700, jiajuncao wrote:

Unfortunately, I wasn't able to capture the creation of the sockets. I wonder if there're other ways to create the sockets? I put JNOTE in all functions inside socketwrapper.cpp, but none of them was invoked.

from dmtcp.

karya0 avatar karya0 commented on September 26, 2024

One possibility is to use strace to see the order of syscalls to get an idea about the socket call.

from dmtcp.

jiajuncao avatar jiajuncao commented on September 26, 2024

Actually the sockets are there before dmtcp_prepare_wrappers() is called. Have we met this scenario before?

from dmtcp.

jiajuncao avatar jiajuncao commented on September 26, 2024

I think the sockets are inherited from the parent process. So we cannot assume that pre-existing sockets are not shared.

from dmtcp.

karya0 avatar karya0 commented on September 26, 2024

I think the sockets are inherited from the parent process. So we cannot
assume that pre-existing sockets are not shared.

Do you mean the sockets are present within the dmtcp_launch process? This
is less likely (although I have seen such situations where the shell itself
has a socket connection).

If the socket is not present in the dmtcp_launch process, then it must have
been created during the application launch. The only way to figure it out
is to use strace and do some analysis. You might also want to look for some
odd syscall that creates a socket as a side-effect.

from dmtcp.

jiajuncao avatar jiajuncao commented on September 26, 2024

I think it's the former case: the sockets are present in the dmtcp_launch process. In fact, the way we invoke dmtcp at Stampede is as follows: ibrun dmtcp_launch a.out. Here ibrun is a wrapper around mpirun_rsh. It will launch mpi spawn process on each node, which then will fork the real computing processes. Only the computing processes are running under dmtcp. I think the sockets are passed from the spawn process.

from dmtcp.

karya0 avatar karya0 commented on September 26, 2024

I think it's the former case: the sockets are present in the dmtcp_launch
process. In fact, the way we invoke dmtcp at Stampede is as follows: ibrun
dmtcp_launch a.out. Here ibrun is a wrapper around mpirun_rsh. It will
launch mpi spawn process on each node, which then will fork the real
computing processes. Only the computing processes are running under dmtcp.
I think the sockets are passed from the spawn process.

In this case, we need to figure out more information about those sockets
and may be write a separate plugin to handle ckpt/restore.

from dmtcp.

gc00 avatar gc00 commented on September 26, 2024

Commit 6503a5f is labelled as a fix for this issue. @jiajuncao: if this issue is truly now fixed, could you close this? Thanks.

from dmtcp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.