GithubHelp home page GithubHelp logo

Comments (3)

gc00 avatar gc00 commented on September 26, 2024

HERE'S A FULLER DESCRIIPTION OF THE BUG:

(gdb) where
#0 mtcp_abort () at mtcp_sys.h:455
#1 0x00007fcc59a56105 in open_shared_file (
filename=0x7fcc5a34ab80 "/usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/pulse-java.jar") at mtcp_restart.c:1340
#2 0x00007fcc59a5449c in read_shared_memory_area_from_file (fd=3,
area=0x7fcc5a34ab30, flags=0) at mtcp_restart.c:911
#3 0x00007fcc59a53cde in readmemoryareas (fd=3) at mtcp_restart.c:800
#4 0x00007fcc59a52dc5 in restorememoryareas (rinfo_ptr=0x610060)
at mtcp_restart.c:616
#5 0x0000000000405bb2 in restart_fast_path () at mtcp_restart.c:430
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

TO PRODUCE THIS:
GENERATE checkpoint image on dekaksi:
env CLASSPATH=./test ./bin/dmtcp_launch --checkpoint-open-files -i7 java -Xmx5M java1
[ Bug happens with or without --checkpoint-open-files ]
COPY CHECKPOINT IMAGES TO CCIS Linux:
cp /tmp/ckpt_java_1d4a852a5f139a6-40000-54f8acae.dmtcp tmp.dmtcp.gz
gunzip tmp.dmtcp.gz
cp tmp.dmtcp src/mtcp
cd src/mtcp
cp tmp.dmtcp ckpt_dmtcp1_test.dmtcp
make -f Makefile.debug gdb

I FIXED THIS PART WITH MY MODS TO mtcp.
NEXT:
COPY CHECKPOINT IMAGES TO CCIS Linux AND:
bin/dmtcp_restart ckpt_java_*.dmtcp

MY FIXES TO mtcp AND fileconnection.cpp NOW GET ME TO THIS BUG:
dmtcp_coordinator starting...
Host: timepilot (127.0.1.1)
Port: 7779
Checkpoint Interval: disabled (checkpoint manually instead)
Exit on last client: 1
Backgrounding...
[40000] NOTE at fileconnection.cpp:716 in refill; REASON='Can't re-create previously open file. Will not open it.'
_path = /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar
[40000] ERROR at connection.cpp:93 in restoreOptions; REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
_fds[0] = 3
_fcntlFlags = 32768
(strerror((*__errno_location ()))) = Bad file descriptor
java (40000): Terminating...

(It's because _fds[0] refers to a file that no longer exists.)

HERE'S A LONG-WINDED EXPLANATION:

RESTARTING JAVA ON A NEW HOST:
There are multiple releases of Java. When migrating a Java application
from a source machine to a destination machine, there is no guarantee
that the destination machine will have the same version.
This is a problem, because Java (e.g., openjdk) memory maps its jar files
into memory as MAP_SHARED (and as read-only), and fails to close the
corresponding file descriptors used in mmap. Note that POSIX says that
the mmap will add an extra reference to the file, and so the program
may call 'close(fd)'. But the openjdk jvm does not do so.
When we restart on a new host, the filenames may have changed (see above).
Nevertheless, the original memory area is saved in the checkpoint image,
and can still be used on restart.
This raises three issues:

  1. The default policy of MTCP is to re-create any shared maps on restart
    and map them as shared. But the application usually does not have
    privilege to open the jar files in a new directory on the destination
    maching corresponding to the old directory on the source machine.
    Furthermore, this would open up other problems.
    So, the solution is to re-map the jar files in memory as
    MAP_PRIVATE | MAP_ANONYMOUS. The memory area was mapped as
    read-only in proc/*/maps, and so this seems safe. In the case
    that the memory area mapped had been mapped as read-write, we will
    notify the user with MTCP_PRINTF that we are changing the mapping
    to MAP_PRIVATE, read-only. This should force an immediate error
    if the application tries to write to the shared file in order to
    communicate with another process.
  2. The jvm had mapped the jar files as MAP_SHARED and then not closed
    the fd used to map them. So, the fd is still open to files
    in the old directory back on the source machine.
    Does MTCP or DMTCP re-open the file descriptors? Anyway,
    the solution is to find out where that happens and to skip that
    step if the file no longer exists. So, the application will appear
    to find that the file descriptor was closed.
  3. If the original java had been checkpointed using --with-open-files,
    then fileconnection::refill() will be called on an obect
    with '_checkpointed == true' and with '_fileAlreadyExists == true'.
    If it was previously checkpointed (saved using --with-open-files),
    then the user intends to replace the old file with the new one.
    So, the solution is to detect that _path no longer points to
    an existing file, and that we do not have permission to recreate
    the file, and so to return immediately from fileconnection::refill(),
    since MTCP will have already done the right thing.

gene@dekaksi:~/dmtcp$ ls -l /proc/5427/fd
total 0
lrwx------ 1 gene gene 64 Mar 10 00:06 0 -> /dev/pts/1
lrwx------ 1 gene gene 64 Mar 10 00:06 1 -> /dev/pts/1
lr-x------ 1 gene gene 64 Mar 10 00:07 10 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/dnsns.jar
lrwx------ 1 gene gene 64 Mar 10 00:06 2 -> /dev/pts/1
lr-x------ 1 gene gene 64 Mar 10 00:06 3 -> /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar
lr-x------ 1 gene gene 64 Mar 10 00:07 4 -> /usr/share/java/java-atk-wrapper.jar
lr-x------ 1 gene gene 64 Mar 10 00:07 5 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/localedata.jar
lr-x------ 1 gene gene 64 Mar 10 00:07 6 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/zipfs.jar
lr-x------ 1 gene gene 64 Mar 10 00:07 7 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/sunpkcs11.jar
lr-x------ 1 gene gene 64 Mar 10 00:07 8 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/sunjce_provider.jar
lrwx------ 1 gene gene 64 Mar 10 00:06 821 -> socket:[56921781]
lrwx------ 1 gene gene 64 Mar 10 00:06 827 -> /dev/pts/1
lrwx------ 1 gene gene 64 Mar 10 00:06 831 -> /tmp/dmtcp-gene@dekaksi/dmtcpSharedArea.1d4a852a5f139a6-40000-54fe6dd2.54fe6dd31
lr-x------ 1 gene gene 64 Mar 10 00:07 9 -> /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/pulse-java.jar

from dmtcp.

gc00 avatar gc00 commented on September 26, 2024

I've now pushed in the MTCP part of the fix. I want to get something in before the upcoming release.

Currently, DMTCP uses FileConnection::handleUnlinkedFile() to handle unlinked files. This is called by FileConnection::drain(). Unfortunately, this means that DMTCP must decide if the file has been unlinked at the time of checkpoint. In the case of Java migration, we won't know if the .jar files have been unlinked until we restart on a new host.

*** So, how can we call the equivalent of handleUnlinkedFile() for the .jar files at restart time?

from dmtcp.

gc00 avatar gc00 commented on September 26, 2024

I'm still thinking aloud on how to handle this. In each case, my last comment has my current thinking. (Read the comments in inverse order.) Recall that there are two remaining issues:

  1. the underlying file on the destination machine no longer exists. On restart, MTCP re-mapped
    that memory region to MAP_ANONYMOUS so that MTCP could re-create the memory region
    (which luckily is read-only). So, the refill method should not try to do anything with this file.
  2. the fd used to mmap() a memory region is now pointing to an underlying file that no longer exists. We should either close the fd or re-direct it to /dev/null (in case the app wants to close it later).
  3. REFILL: My current thinking is to create a plugin, java-mig, in the contrib directory, which will
    look at up to 100 file descriptors at the time of checkpoint. We can get the filename from them
    by calling jalib::Filesystem::ResolveSymlink() or by copying that logic.
    As long as the jvm opened an fd to a .jar file in O_RDONLY mode, it should
    be safe to unlink that file at the time of checkpoint. Then ipc/connection.cpp will not see the fd.
    This is done by a plugin because the user will know if they intend to migrate the java process, and then will call this plugin only in that special case. Is there a more general way to handle this at the time of restart?
  4. DANGLING FD: That still leaves a problem in ipc/file/fileconnection.cpp in its refill method.
    The /proc//maps showed that the shared area was mapped with the underlying .jar file (at least
    at the time of checkpoint). At the time of restart, MTCP has re-mapped the file to MAP_ANONYMOUS with no underlying file. I have a hackish way to detect this in ipc/file/fileconnection.cpp and then decide not to refill. It would probably be better to re-read /proc/
    /maps and note that MTCP has gotten rid of the underlying file for this region. Is there a clean way to accomplish this?

from dmtcp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.