Comments (3)
HERE'S A FULLER DESCRIIPTION OF THE BUG:
(gdb) where
#0 mtcp_abort () at mtcp_sys.h:455
#1 0x00007fcc59a56105 in open_shared_file (
filename=0x7fcc5a34ab80 "/usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/pulse-java.jar") at mtcp_restart.c:1340
#2 0x00007fcc59a5449c in read_shared_memory_area_from_file (fd=3,
area=0x7fcc5a34ab30, flags=0) at mtcp_restart.c:911
#3 0x00007fcc59a53cde in readmemoryareas (fd=3) at mtcp_restart.c:800
#4 0x00007fcc59a52dc5 in restorememoryareas (rinfo_ptr=0x610060)
at mtcp_restart.c:616
#5 0x0000000000405bb2 in restart_fast_path () at mtcp_restart.c:430
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
TO PRODUCE THIS:
GENERATE checkpoint image on dekaksi:
env CLASSPATH=./test ./bin/dmtcp_launch --checkpoint-open-files -i7 java -Xmx5M java1
[ Bug happens with or without --checkpoint-open-files ]
COPY CHECKPOINT IMAGES TO CCIS Linux:
cp /tmp/ckpt_java_1d4a852a5f139a6-40000-54f8acae.dmtcp tmp.dmtcp.gz
gunzip tmp.dmtcp.gz
cp tmp.dmtcp src/mtcp
cd src/mtcp
cp tmp.dmtcp ckpt_dmtcp1_test.dmtcp
make -f Makefile.debug gdb
I FIXED THIS PART WITH MY MODS TO mtcp.
NEXT:
COPY CHECKPOINT IMAGES TO CCIS Linux AND:
bin/dmtcp_restart ckpt_java_*.dmtcp
MY FIXES TO mtcp AND fileconnection.cpp NOW GET ME TO THIS BUG:
dmtcp_coordinator starting...
Host: timepilot (127.0.1.1)
Port: 7779
Checkpoint Interval: disabled (checkpoint manually instead)
Exit on last client: 1
Backgrounding...
[40000] NOTE at fileconnection.cpp:716 in refill; REASON='Can't re-create previously open file. Will not open it.'
_path = /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar
[40000] ERROR at connection.cpp:93 in restoreOptions; REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
_fds[0] = 3
_fcntlFlags = 32768
(strerror((*__errno_location ()))) = Bad file descriptor
java (40000): Terminating...
(It's because _fds[0] refers to a file that no longer exists.)
HERE'S A LONG-WINDED EXPLANATION:
RESTARTING JAVA ON A NEW HOST:
There are multiple releases of Java. When migrating a Java application
from a source machine to a destination machine, there is no guarantee
that the destination machine will have the same version.
This is a problem, because Java (e.g., openjdk) memory maps its jar files
into memory as MAP_SHARED (and as read-only), and fails to close the
corresponding file descriptors used in mmap. Note that POSIX says that
the mmap will add an extra reference to the file, and so the program
may call 'close(fd)'. But the openjdk jvm does not do so.
When we restart on a new host, the filenames may have changed (see above).
Nevertheless, the original memory area is saved in the checkpoint image,
and can still be used on restart.
This raises three issues:
- The default policy of MTCP is to re-create any shared maps on restart
and map them as shared. But the application usually does not have
privilege to open the jar files in a new directory on the destination
maching corresponding to the old directory on the source machine.
Furthermore, this would open up other problems.
So, the solution is to re-map the jar files in memory as
MAP_PRIVATE | MAP_ANONYMOUS. The memory area was mapped as
read-only in proc/*/maps, and so this seems safe. In the case
that the memory area mapped had been mapped as read-write, we will
notify the user with MTCP_PRINTF that we are changing the mapping
to MAP_PRIVATE, read-only. This should force an immediate error
if the application tries to write to the shared file in order to
communicate with another process. - The jvm had mapped the jar files as MAP_SHARED and then not closed
the fd used to map them. So, the fd is still open to files
in the old directory back on the source machine.
Does MTCP or DMTCP re-open the file descriptors? Anyway,
the solution is to find out where that happens and to skip that
step if the file no longer exists. So, the application will appear
to find that the file descriptor was closed. - If the original java had been checkpointed using --with-open-files,
then fileconnection::refill() will be called on an obect
with '_checkpointed == true' and with '_fileAlreadyExists == true'.
If it was previously checkpointed (saved using --with-open-files),
then the user intends to replace the old file with the new one.
So, the solution is to detect that _path no longer points to
an existing file, and that we do not have permission to recreate
the file, and so to return immediately from fileconnection::refill(),
since MTCP will have already done the right thing.
gene@dekaksi:~/dmtcp$ ls -l /proc/5427/fd
total 0
lrwx------ 1 gene gene 64 Mar 10 00:06 0 -> /dev/pts/1
lrwx------ 1 gene gene 64 Mar 10 00:06 1 -> /dev/pts/1
lr-x------ 1 gene gene 64 Mar 10 00:07 10 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/dnsns.jar
lrwx------ 1 gene gene 64 Mar 10 00:06 2 -> /dev/pts/1
lr-x------ 1 gene gene 64 Mar 10 00:06 3 -> /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar
lr-x------ 1 gene gene 64 Mar 10 00:07 4 -> /usr/share/java/java-atk-wrapper.jar
lr-x------ 1 gene gene 64 Mar 10 00:07 5 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/localedata.jar
lr-x------ 1 gene gene 64 Mar 10 00:07 6 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/zipfs.jar
lr-x------ 1 gene gene 64 Mar 10 00:07 7 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/sunpkcs11.jar
lr-x------ 1 gene gene 64 Mar 10 00:07 8 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/sunjce_provider.jar
lrwx------ 1 gene gene 64 Mar 10 00:06 821 -> socket:[56921781]
lrwx------ 1 gene gene 64 Mar 10 00:06 827 -> /dev/pts/1
lrwx------ 1 gene gene 64 Mar 10 00:06 831 -> /tmp/dmtcp-gene@dekaksi/dmtcpSharedArea.1d4a852a5f139a6-40000-54fe6dd2.54fe6dd31
lr-x------ 1 gene gene 64 Mar 10 00:07 9 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/pulse-java.jar
from dmtcp.
I've now pushed in the MTCP part of the fix. I want to get something in before the upcoming release.
Currently, DMTCP uses FileConnection::handleUnlinkedFile() to handle unlinked files. This is called by FileConnection::drain(). Unfortunately, this means that DMTCP must decide if the file has been unlinked at the time of checkpoint. In the case of Java migration, we won't know if the .jar files have been unlinked until we restart on a new host.
*** So, how can we call the equivalent of handleUnlinkedFile() for the .jar files at restart time?
from dmtcp.
I'm still thinking aloud on how to handle this. In each case, my last comment has my current thinking. (Read the comments in inverse order.) Recall that there are two remaining issues:
- the underlying file on the destination machine no longer exists. On restart, MTCP re-mapped
that memory region to MAP_ANONYMOUS so that MTCP could re-create the memory region
(which luckily is read-only). So, the refill method should not try to do anything with this file. - the fd used to mmap() a memory region is now pointing to an underlying file that no longer exists. We should either close the fd or re-direct it to /dev/null (in case the app wants to close it later).
- REFILL: My current thinking is to create a plugin, java-mig, in the contrib directory, which will
look at up to 100 file descriptors at the time of checkpoint. We can get the filename from them
by calling jalib::Filesystem::ResolveSymlink() or by copying that logic.
As long as the jvm opened an fd to a .jar file in O_RDONLY mode, it should
be safe to unlink that file at the time of checkpoint. Then ipc/connection.cpp will not see the fd.
This is done by a plugin because the user will know if they intend to migrate the java process, and then will call this plugin only in that special case. Is there a more general way to handle this at the time of restart? - DANGLING FD: That still leaves a problem in ipc/file/fileconnection.cpp in its refill method.
The /proc//maps showed that the shared area was mapped with the underlying .jar file (at least
at the time of checkpoint). At the time of restart, MTCP has re-mapped the file to MAP_ANONYMOUS with no underlying file. I have a hackish way to detect this in ipc/file/fileconnection.cpp and then decide not to refill. It would probably be better to re-read /proc//maps and note that MTCP has gotten rid of the underlying file for this region. Is there a clean way to accomplish this?
from dmtcp.
Related Issues (20)
- Create checkpoint from a dump file or a running process HOT 1
- `make check` fails on an ARMv8 machine HOT 6
- Segmentation fault at restart
- dmtcp in docker on apple silicon HOT 5
- dmtcp CI is broken for master branch - root cause: python3.8 vs python3.10 pty module HOT 2
- add soversion
- dmtcp build failed on ppc64le, aarch64 and s390x architecture HOT 1
- The last few checkpoints of the dmtcp save are particularly slow
- INSTALL.md mentions non-existing command line option --no-coordinator for dmtcp_launch HOT 1
- Release 3.0, Windows Subsystem for Linux with Ubuntu 22.04 LTS: all checks fail HOT 1
- Release 3.0, Windows Subsystem for Linux with Ubuntu 22.04 LTS: restart doesn't work HOT 1
- DMTCP build is broken with recent PR 1061 HOT 1
- Duplicating(forking) a checkpointed process? HOT 2
- Segmentation fault on dmtcp (2.6.0) using MPICH (4.2.0)
- Segfault when I set 2 ckpts in a program using share memory HOT 1
- Using '--enable-logging' leads to hang in a simplest case HOT 5
- Segfault when using gethostbyname() after dmtcp_restart HOT 1
- "dmtcp_command -kc" does not kill the node after checkpoint
- Fails ssh1 check on Rocky 8.10 HOT 1
- Hang when running DMTCP in docker and importing scipy.spatial HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dmtcp.