GithubHelp home page GithubHelp logo

Restart issues in 2.4.0rc2 about dmtcp HOT 16 OPEN

dmtcp avatar dmtcp commented on June 24, 2024
Restart issues in 2.4.0rc2

from dmtcp.

Comments (16)

bbarker avatar bbarker commented on June 24, 2024

I was a bit worried that the issue was due to my having a funny set of LDFLAGS, CPPFLAGS, etc. in my build environment; after getting rid of these and rebuilding the problem persisted, so finally I tested it on a clean CentOS 7 virtual machine (and previously, by accident, on a CentOS 6.6 or 6.5 system).

The verdict: the problem with 2.4.0rc2 persists in all cases so far on CentOS 7, but it was fine on CentOS 6.?.

from dmtcp.

gc00 avatar gc00 commented on June 24, 2024

Hi Brandon,
Thanks very much for your report. Right now, we don't have direct access
to a CentOS 7 system. We've been testing on CentOS 6 and Red Hat 6 so far.
Is there a chance that you could provide a guest account?
In the meantime, some information that might help us is:

First, from the root directory of DMTCP, could you send us the output of:
make display-build-env
Second, could you build DMTCP as follows:
./configure CFLAGS="-g -O0" CXXFLAGS="-g -O0"
make -j
and then:
make tidy
ulimit -c unlimited
bin/dmtcp_launch -i6 test/dmtcp1
bin/dmtcp_restart ckpt_dmtcp1__.dmtcp
[ This should generate a core dump ]
gdb test/dmtcp1 core_
(gdb) apply thread all full
and send us the output from GDB.

Thanks very much,

  • Gene

On Mon, Apr 13, 2015 at 06:36:41AM -0700, Brandon Elam Barker wrote:

I'm having issues with restarts passing on my CentOS 7.1 system; these issues don't seem to be present in 2.3.1; in both cases, I've used the default ./configure (no options), but I've included the config.log and config.status here.
Additionally the .cores and checkpoints are available.

Verifying there is enough disk space ...
== Tests ==
dmtcp1         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17746 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:
***** Copied checkpoint images to /tmp/[email protected]/dmtcp-autotest-505920597
FAILED
               root-pids: [17773] msg: restart error, 1 expected, 0 found, running=0
dmtcp2         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp3         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17833 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:PASSED; ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17883 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:PASSED
dmtcp4         ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcp5         ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.18128 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:
***** Copied checkpoint images to /tmp/[email protected]/dmtcp-autotest-505920597
FAILED
               root-pids: [18164] msg: restart error, 2 expected, 1 found, running=0
syscall-tester ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
file1          ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED
dmtcpaware1    ckpt:PASSED rstr:
FAILED (first process rec'd signal 11) (core.18229 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:y



^Cdmtcpaware1    FAILED
               root-pids: [] msg: failed to write 's' to coordinator (pid: 17740)
CLEANUP ERROR: failed to write 'k' to coordinator (pid: 17740)
SHUTDOWN() failed
make: *** [check] Error 1

Reply to this email directly or view it on GitHub:
#56

from dmtcp.

gc00 avatar gc00 commented on June 24, 2024

Hi all,
Just a quick status report on DMTCP with CentOS 7. So far what I'm seeing
is that restart is failing when we have multiple threads or processes (via fork).
So, it's failing for the dmtcp3 and dmtcp5 tests.
Interestingly, although dmtcp_restart fails, if I run it under gdb,
it starts up fine:
gdb --args dmtcp_restart ...
Typically this means that there's something about the memory layout,
and gdb forces a different memory layout (or initializes memory differently).
I looked at the memory layout when running under dmtcp3. Here's what I'm
seeing. So, we are seeing vsyscall on the last of the pages that
are mapped.
Kapil, I think you may have said something about that being significant.
Let me know if there are some other tests that you'd like me to run.
[ I'm editing out most of the memory map below. There's a simpler analysis below. - @gc00 ]

[dmtcp@euca-128-84-11-199 dmtcp]$ cat /proc/22243/maps
00400000-00401000 r-xp 00000000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
00600000-00601000 r--p 00000000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
00601000-00602000 rw-p 00001000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
016f2000-01713000 rw-p 00000000 00:00 0 [heap]
...
7fff3aefc000-7fff3b6df000 rw-p 00000000 00:00 0 [stack]
7fff3b7fe000-7fff3b800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

from dmtcp.

gc00 avatar gc00 commented on June 24, 2024

And I can now report that when address randomization is turned off,
DMTCP works. I set:
sudo bash -c 'echo 0 > /proc/sys/kernel/randomize_va_space'
and that makes DMTCP work.

Probably the effect of GDB is to turn off the randomization.
Next, I have to make it work again, with address randomization.

Best,

  • Gene

On Mon, Apr 13, 2015 at 07:49:30AM -0700, Brandon Elam Barker wrote:

I was a bit worried that the issue was due to my having a funny set of LDFLAGS, CPPFLAGS, etc. in my build environment; after getting rid of these and rebuilding the problem persisted, so finally I tested it on a clean CentOS 7 virtual machine (and previously, by accident, on a CentOS 6.6 or 6.5 system).

The verdict: the problem with 2.4.0rc2 persists in all cases so far on CentOS 7, but it was fine on CentOS 6.?.


Reply to this email directly or view it on GitHub:
#56 (comment)

from dmtcp.

karya0 avatar karya0 commented on June 24, 2024

@bbarker: Can you manually checkpoint/restart the test and provide us the output as @gc00 suggested above?

from dmtcp.

karya0 avatar karya0 commented on June 24, 2024

@gc00: Now that you can reproduce the bug, can you provide the output of dmtcp_restart ckpt_dmtcp1*.dmtcp?

from dmtcp.

gc00 avatar gc00 commented on June 24, 2024

Hi all, Since dmtcp-2.3.1 was running correctly, I ran 'git bisect' on this, in the CentOS 7 distro. I found the bug, and a simple one-line bug fix. For a one line fix, I won't do a pull request, but I'd like to analyze the implications of this fix here, and together we'll decide if that is enough.
The bug was introduced in: 1a7d8db
In that commit, mtcp_check_vdso() was turned off by default. ./configure CFLAGS=-DENABLE_VDSO_CHECK turns that function on again, and then the bug goes away.
@karya0: The #ifdef was your code. What is your preference? Shall we permanently remove the #ifdef ENABLE_VDSO_CHECK, and add a comment that CentOS7 requires mtcp_check_vdso()? Or do you want to leave the #ifdef in the code, and add a #define? I would suggest removing the #ifdef and adding a comment.

from dmtcp.

gc00 avatar gc00 commented on June 24, 2024

P.S.: For ./configure CFLAGS=-DENABLE_VDSO_CHECK to work for you above, you will first have to do: git pull --rebase. Another one of those annoying bugs that crept in while we were preparing for the release (and this one was my fault).

from dmtcp.

gc00 avatar gc00 commented on June 24, 2024

And there appears to be one remaining bug exposed in CentOS 7. The dlopen1 test is failing. I'll look into it.

from dmtcp.

gc00 avatar gc00 commented on June 24, 2024

And I meant to write to do: git pull --rebase before ./configure CFLAGS=-DENABLE_VDSO_CHECK. Time to get some more rest before I become too incoherent.

from dmtcp.

karya0 avatar karya0 commented on June 24, 2024

@gc00: The bug isn't quite related to ASLR. Apparently, if the variable type is "char_", JNOTE tries to automatically dereference the variable and prints it. However, when restoring the heap, curBrk was declared of type "char_" and points to the current (unmapped) heap. As obvious, JNOTE tried to dereference curBrk and segfaulted.

Flipping the ASLR hid the bug since the current and original value of the brk() was the same and hence no JNOTE was called and thus no dereference of invalid memory area.
:ta
I'll shortly push the fix to github.

I think I also understand the dlopen bug and have created a separate issue (#57) to track it.

from dmtcp.

gc00 avatar gc00 commented on June 24, 2024

I'll shortly push the fix to github.
I think I also understand the dlopen bug and ....
Very efficient! Thanks.

from dmtcp.

gc00 avatar gc00 commented on June 24, 2024

@karya0 and @jiajuncao: The bug fix by @karya0 definitely fixes DMTCP on CentOS 7. However, @jiajuncao has also been seeing a random bug on restart at Stampede (with MVAPICH). He can take the same checkpoint image and restart many times. A bug appears about 20% of the time. From the core image, we see that memory is corrupted on restart (but only 20% of the time).
We then tested at Stampede/MVAPICH by including the function mtcp_check_vdso(). During about 15 or 20 tests, we did not observe the bug on restart.
I propose to remove the #ifdef ENABLE_VDSO_CHECK and to change the corresponding comment to: // If mtcp_check_vdso isn't called, CentOS 7 fails on dmtcp3, dmtcp5, others
Do you agree, @karya0 ? Thanks.

from dmtcp.

gc00 avatar gc00 commented on June 24, 2024

@karya0: I've now created pull request #60 to fix this issue.

from dmtcp.

karya0 avatar karya0 commented on June 24, 2024

Before we enable mtcp_check_vdso, let's verify the underlying cause of the bug. What memory addresses are causing segfault? I am sure there is a different fix that doesn't involve vdso. Note that vDSO and ASLR are two separate issues and part of the reason I didn't like mtcp_check_vdso is because it tries to handle both. vDSO handling should be done strictly by the newer code, and we should create a mtcp_check_aslr to handle ASLR. But before we do any of that, let's verify the problem and faulty addresses.

from dmtcp.

gc00 avatar gc00 commented on June 24, 2024

@karya0: @jiajuncao and @rohgarg both have accounts at Stampede, and have both observed this bug there. They were using code that included the commit a4d67bf . They'll be the best ones to examine this newer version of the bug with you (the version that has only been observed on Stampede so far).

from dmtcp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.