Comments (16)
I was a bit worried that the issue was due to my having a funny set of LDFLAGS, CPPFLAGS, etc. in my build environment; after getting rid of these and rebuilding the problem persisted, so finally I tested it on a clean CentOS 7 virtual machine (and previously, by accident, on a CentOS 6.6 or 6.5 system).
The verdict: the problem with 2.4.0rc2 persists in all cases so far on CentOS 7, but it was fine on CentOS 6.?.
from dmtcp.
Hi Brandon,
Thanks very much for your report. Right now, we don't have direct access
to a CentOS 7 system. We've been testing on CentOS 6 and Red Hat 6 so far.
Is there a chance that you could provide a guest account?
In the meantime, some information that might help us is:
First, from the root directory of DMTCP, could you send us the output of:
make display-build-env
Second, could you build DMTCP as follows:
./configure CFLAGS="-g -O0" CXXFLAGS="-g -O0"
make -j
and then:
make tidy
ulimit -c unlimited
bin/dmtcp_launch -i6 test/dmtcp1
bin/dmtcp_restart ckpt_dmtcp1__.dmtcp
[ This should generate a core dump ]
gdb test/dmtcp1 core_
(gdb) apply thread all full
and send us the output from GDB.
Thanks very much,
- Gene
On Mon, Apr 13, 2015 at 06:36:41AM -0700, Brandon Elam Barker wrote:
I'm having issues with restarts passing on my CentOS 7.1 system; these issues don't seem to be present in 2.3.1; in both cases, I've used the default ./configure (no options), but I've included the config.log and config.status here.
Additionally the .cores and checkpoints are available.Verifying there is enough disk space ... == Tests == dmtcp1 ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17746 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry: ***** Copied checkpoint images to /tmp/[email protected]/dmtcp-autotest-505920597 FAILED root-pids: [17773] msg: restart error, 1 expected, 0 found, running=0 dmtcp2 ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED dmtcp3 ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17833 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:PASSED; ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.17883 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:PASSED dmtcp4 ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED dmtcp5 ckpt:PASSED rstr:FAILED (first process rec'd signal 11) (core.18128 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry: ***** Copied checkpoint images to /tmp/[email protected]/dmtcp-autotest-505920597 FAILED root-pids: [18164] msg: restart error, 2 expected, 1 found, running=0 syscall-tester ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED file1 ckpt:PASSED rstr:PASSED; ckpt:PASSED rstr:PASSED dmtcpaware1 ckpt:PASSED rstr: FAILED (first process rec'd signal 11) (core.18229 copied to DMTCP_TMPDIR:/tmp/[email protected]/) retry:y ^Cdmtcpaware1 FAILED root-pids: [] msg: failed to write 's' to coordinator (pid: 17740) CLEANUP ERROR: failed to write 'k' to coordinator (pid: 17740) SHUTDOWN() failed make: *** [check] Error 1
Reply to this email directly or view it on GitHub:
#56
from dmtcp.
Hi all,
Just a quick status report on DMTCP with CentOS 7. So far what I'm seeing
is that restart is failing when we have multiple threads or processes (via fork).
So, it's failing for the dmtcp3 and dmtcp5 tests.
Interestingly, although dmtcp_restart fails, if I run it under gdb,
it starts up fine:
gdb --args dmtcp_restart ...
Typically this means that there's something about the memory layout,
and gdb forces a different memory layout (or initializes memory differently).
I looked at the memory layout when running under dmtcp3. Here's what I'm
seeing. So, we are seeing vsyscall on the last of the pages that
are mapped.
Kapil, I think you may have said something about that being significant.
Let me know if there are some other tests that you'd like me to run.
[ I'm editing out most of the memory map below. There's a simpler analysis below. - @gc00 ]
[dmtcp@euca-128-84-11-199 dmtcp]$ cat /proc/22243/maps
00400000-00401000 r-xp 00000000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
00600000-00601000 r--p 00000000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
00601000-00602000 rw-p 00001000 fd:02 4797973 /home/dmtcp/dmtcp/test/dmtcp3
016f2000-01713000 rw-p 00000000 00:00 0 [heap]
...
7fff3aefc000-7fff3b6df000 rw-p 00000000 00:00 0 [stack]
7fff3b7fe000-7fff3b800000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
from dmtcp.
And I can now report that when address randomization is turned off,
DMTCP works. I set:
sudo bash -c 'echo 0 > /proc/sys/kernel/randomize_va_space'
and that makes DMTCP work.
Probably the effect of GDB is to turn off the randomization.
Next, I have to make it work again, with address randomization.
Best,
- Gene
On Mon, Apr 13, 2015 at 07:49:30AM -0700, Brandon Elam Barker wrote:
I was a bit worried that the issue was due to my having a funny set of LDFLAGS, CPPFLAGS, etc. in my build environment; after getting rid of these and rebuilding the problem persisted, so finally I tested it on a clean CentOS 7 virtual machine (and previously, by accident, on a CentOS 6.6 or 6.5 system).
The verdict: the problem with 2.4.0rc2 persists in all cases so far on CentOS 7, but it was fine on CentOS 6.?.
Reply to this email directly or view it on GitHub:
#56 (comment)
from dmtcp.
@bbarker: Can you manually checkpoint/restart the test and provide us the output as @gc00 suggested above?
from dmtcp.
@gc00: Now that you can reproduce the bug, can you provide the output of dmtcp_restart ckpt_dmtcp1*.dmtcp
?
from dmtcp.
Hi all, Since dmtcp-2.3.1 was running correctly, I ran 'git bisect' on this, in the CentOS 7 distro. I found the bug, and a simple one-line bug fix. For a one line fix, I won't do a pull request, but I'd like to analyze the implications of this fix here, and together we'll decide if that is enough.
The bug was introduced in: 1a7d8db
In that commit, mtcp_check_vdso() was turned off by default. ./configure CFLAGS=-DENABLE_VDSO_CHECK
turns that function on again, and then the bug goes away.
@karya0: The #ifdef
was your code. What is your preference? Shall we permanently remove the #ifdef ENABLE_VDSO_CHECK
, and add a comment that CentOS7 requires mtcp_check_vdso()
? Or do you want to leave the #ifdef
in the code, and add a #define
? I would suggest removing the #ifdef
and adding a comment.
from dmtcp.
P.S.: For ./configure CFLAGS=-DENABLE_VDSO_CHECK
to work for you above, you will first have to do: git pull --rebase
. Another one of those annoying bugs that crept in while we were preparing for the release (and this one was my fault).
from dmtcp.
And there appears to be one remaining bug exposed in CentOS 7. The dlopen1 test is failing. I'll look into it.
from dmtcp.
And I meant to write to do: git pull --rebase
before ./configure CFLAGS=-DENABLE_VDSO_CHECK
. Time to get some more rest before I become too incoherent.
from dmtcp.
@gc00: The bug isn't quite related to ASLR. Apparently, if the variable type is "char_", JNOTE tries to automatically dereference the variable and prints it. However, when restoring the heap, curBrk was declared of type "char_" and points to the current (unmapped) heap. As obvious, JNOTE tried to dereference curBrk and segfaulted.
Flipping the ASLR hid the bug since the current and original value of the brk() was the same and hence no JNOTE was called and thus no dereference of invalid memory area.
:ta
I'll shortly push the fix to github.
I think I also understand the dlopen bug and have created a separate issue (#57) to track it.
from dmtcp.
I'll shortly push the fix to github.
I think I also understand the dlopen bug and ....
Very efficient! Thanks.
from dmtcp.
@karya0 and @jiajuncao: The bug fix by @karya0 definitely fixes DMTCP on CentOS 7. However, @jiajuncao has also been seeing a random bug on restart at Stampede (with MVAPICH). He can take the same checkpoint image and restart many times. A bug appears about 20% of the time. From the core image, we see that memory is corrupted on restart (but only 20% of the time).
We then tested at Stampede/MVAPICH by including the function mtcp_check_vdso(). During about 15 or 20 tests, we did not observe the bug on restart.
I propose to remove the #ifdef ENABLE_VDSO_CHECK
and to change the corresponding comment to: // If mtcp_check_vdso isn't called, CentOS 7 fails on dmtcp3, dmtcp5, others
Do you agree, @karya0 ? Thanks.
from dmtcp.
@karya0: I've now created pull request #60 to fix this issue.
from dmtcp.
Before we enable mtcp_check_vdso
, let's verify the underlying cause of the bug. What memory addresses are causing segfault? I am sure there is a different fix that doesn't involve vdso. Note that vDSO and ASLR are two separate issues and part of the reason I didn't like mtcp_check_vdso
is because it tries to handle both. vDSO handling should be done strictly by the newer code, and we should create a mtcp_check_aslr
to handle ASLR. But before we do any of that, let's verify the problem and faulty addresses.
from dmtcp.
@karya0: @jiajuncao and @rohgarg both have accounts at Stampede, and have both observed this bug there. They were using code that included the commit a4d67bf . They'll be the best ones to examine this newer version of the bug with you (the version that has only been observed on Stampede so far).
from dmtcp.
Related Issues (20)
- DMTCP is skipping a non-zero protected library region during a checkpoint
- vdso / vvar address overlapping checking in older version of mtcp_restart.c
- How to debug programs which reliably crash after snapshot restauration
- DMTCP checkpoint restore is slow
- port dmtcp for windows HOT 9
- Support for RISC-V HOT 3
- Create checkpoint from a dump file or a running process HOT 1
- `make check` fails on an ARMv8 machine HOT 1
- Segmentation fault at restart
- dmtcp in docker on apple silicon HOT 5
- dmtcp CI is broken for master branch - root cause: python3.8 vs python3.10 pty module HOT 2
- add soversion
- dmtcp build failed on ppc64le, aarch64 and s390x architecture
- The last few checkpoints of the dmtcp save are particularly slow
- INSTALL.md mentions non-existing command line option --no-coordinator for dmtcp_launch HOT 1
- Release 3.0, Windows Subsystem for Linux with Ubuntu 22.04 LTS: all checks fail HOT 1
- Release 3.0, Windows Subsystem for Linux with Ubuntu 22.04 LTS: restart doesn't work HOT 1
- DMTCP build is broken with recent PR 1061 HOT 1
- Duplicating(forking) a checkpointed process? HOT 2
- Segmentation fault on dmtcp (2.6.0) using MPICH (4.2.0)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dmtcp.