GithubHelp home page GithubHelp logo

pftool's Introduction

###############################################################################
          PFTool: Parallel File Tool
###############################################################################

PFTool (Parallel File Tool) can stat, copy, and compare files in parallel. 
It's optimized for an HPC workload and uses MPI for message passing.

Additional info:
Available under LANL LACC-2012-072.
see COPYRIGHT file

***************************************************************************
Dependencies
***************************************************************************

autoconf
automake
m4
libtool
mpi

***************************************************************************
Installing PFTool
***************************************************************************
From the top-level directory

./autogen

#To configure the base version:
./configure

#To configure the threaded version:
#./configure --enable-threads

#Other options:
#./configure --help

./make clean
./make all
./make install

Note that pftool.cfg is created and installed into {install_prefix}/etc
during the "make install". Setting in this file should be reviewed and 
modified based on the configuration of the cluster the software is
running on. 

***************************************************************************
PFTool RPM
***************************************************************************
To build an RPM, from the top-level directory:

cd package
make rpm

Rpmbuild is used by this make file to generate the RPM. It is assumed that
the directory/tree $HOME/rpmbuild exists and has the subdirectories SOURCES,
BUILD, and RPMS. The resulting RPM is written to $HOME/rpmbuild/RPMS.

***************************************************************************
Configuration
***************************************************************************
{install_prefix}/etc/pftool.cfg is read by the pftool scripts pfls, pfcm, pfcp

example config files located in ./etc/. 

***************************************************************************
Versioning
***************************************************************************
In order to change the version number for this project, modify the following
line in ./configure.ac:

AC_INIT([pftool], [2.0.5], [[email protected]])

Then rerun

./configure

Note that the maintainer information can be changed as well.

***************************************************************************
Using PFtool
***************************************************************************
PFTool can be invoked directly, but the preferred method is through helper
scripts located in {install_prefix}/scripts/

-----------

Usage: pfls [options] sourcePath

pfls --  list file(s) based on sourcePath in parallel

Options:
  -h, --help     show this help message and exit
  -R             list directories recursively
  -v             verbose result output
  -i INPUT_LIST  input file list

-----------

Usage: pfcm [options] sourcePath destinationPath

pfcm -- compare file(s) from sourcePath to destinationPath in parallel

Options:
  -h, --help  show this help message and exit
  -R          copy directories recursively
  -M          changes to Block-by-Block vs Metadata only
  -v          verbose result output

-----------

Usage: pfcp [options] sourcePath destinationPath

pfcp -- copy file(s) from sourcePath to destinationPath in parallel

Options:
  -h, --help  show this help message and exit
  -R          copy directories recursively
  -v          verbose result output
  -n          only copy files that have a different date or file size than the
              same files at the destination or not in the destination


pftool's People

Contributors

brettkettering avatar bringhurst avatar bws avatar cadejager avatar dsherril avatar gransom avatar gsparrow avatar jti-lanl avatar leicao88124 avatar shanegoff avatar thewacokid avatar wfvining avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pftool's Issues

Relative paths resolve incorrectly -

I don't have an exhaustive comparison of how paths are expanded, but they do seem to resolve improperly for special dirs like "." and ".." for source and destination.

I'll put in a pull request soon for the updates to the pfcp/pfcm/pfls scripts that launch pftool, I added an absolute path resolver in them so that pftool only has to deal with absolute paths (for now). The assumption I had to make is that nobody will have "/var/lib/perceus/vnfs/" along with "/rootfs/" in their path. I believe that's fairly safe, but I'm documenting it here just in case someone uses it outside of LANL and decides to Google.

PFTool file copy retry flag

When copying large data with PFTool, either a large file or large numbers of files, PFTool often encounters non-fatal errors during the copy and not all data is copied. Since PFTool returns a non-zero exit code when this occurs, it would be nice if we could catch the exit status and retry the command a given number of times. Considerations need to be made for a low maximum, in the event that the command is incorrect, and what the default number of tries should be. This should likely be placed inside of the pfcp script.

new feature: provide an exclude-list, like rsync

A user of the OS Campagin is using rsync, instead of pfcp, to get a 10x reduction in the amount of data being copied. (They are probably actually losing performance, because pfcp would add ~64x threads running on 3 FTAs, but the point is that they could potentially have the 10x as well as the 64x.)

I imagine we could parse the exclude-list into a read-only vector of regex_t, somewhere in the global options. Perhaps we could compile these once in the master-task initializations, and share out the pre-compiled list as part of options, or else share the uncompiled versions and have every process compile them locally. Either way, we then have a set of per-task compiled exprs.

Some of them might eliminate entire directories, whereas others could match files in multiple directories. So, I think we'd have to run through the list of regexes before worker_readdir() calls opendir(), and then maybe we have to run through the list again, before accepting each result from readdir(), unless we can do something clever to identify those regexes that apply only to directories, and those that apply only to files.

pfcm appears to use "real" stat through FUSE -

Multi-host pfcm jobs appear to only compare properly on the node with FUSE mounted for MarFS:

(output shortened for brevity, the only rank on the FUSE host is RANK 5):
RANK 9: INFO DATACOMPARE compared /lustre/scratch5/dbonnie/errorlists/archive.errorlist.20150809.0 to /campaign/admins/dbonnie/errorlists/archive.errorlist.20150809.0 -- MISSING DESTINATION
RANK 7: INFO DATACOMPARE compared /lustre/scratch5/dbonnie/errorlists/archive.errorlist.20150723.0 to /campaign/admins/dbonnie/errorlists/archive.errorlist.20150723.0 -- MISSING DESTINATION
RANK 6: INFO DATACOMPARE compared /lustre/scratch5/dbonnie/errorlists/archive.errorlist.20150712.0 to /campaign/admins/dbonnie/errorlists/archive.errorlist.20150712.0 -- MISSING DESTINATION
RANK 3: INFO DATACOMPARE compared /lustre/scratch5/dbonnie/errorlists/archive.errorlist.20150714.0 to /campaign/admins/dbonnie/errorlists/archive.errorlist.20150714.0 -- MISSING DESTINATION
RANK 5: INFO DATACOMPARE compared /lustre/scratch5/dbonnie/errorlists/archive.errorlist.20151030.0 to /campaign/admins/dbonnie/errorlists/archive.errorlist.20151030.0 -- SUCCESS
INFO FOOTER ====== NONFATAL ERRORS = 167
INFO FOOTER Total Files/Links Examined: 211
INFO FOOTER Total Dirs Examined: 2
INFO FOOTER Total Files Compared: 211
INFO FOOTER Elapsed Time: 0 seconds

copy of file to non-existent destination file produces warning

It works, but you just get a "non-fatal error" ...

$ pfcp -vn /dev/shm/jti/source/1x11G/f1 /dev/shm/jti/dest/1x11G/f1

INFO HEADER Starting Path: /dev/shm/jti/source/1x11G/f1
INFO HEADER Source-type: POSIX_Path
INFO HEADER Dest-type: POSIX_Path
RANK 3: ERROR NONFATAL: Failed to stat path /dev/shm/jti/dest/1x11G/f1
[copy operations succeed ...]

If you are copying directories recursively , you don't get the warning

$ rm -f /dev/shm/jti/dest/1x11G/f1
$ pfcp -vn /dev/shm/jti/source/1x11G/ /dev/shm/jti/dest/

copy a directory to marfs_mount_point/namespace/ fails

When copying a directory tree to marfs_mount_point/namespace, pftool is going up a directory to check permissions, this fails as pftool does not have permissions for the marfs mount point, and causes the copy to fail immediately. In directories other than the marfs mount point, this does not cause issue.

mpirun -np 4 path_to_pftool/pftool -r -w 0 -c /marfs/namespace/ -p /dev/shm/directory_tree

pftool hangs

pftool hangs when I copy 50 or more files using the command:
pfcp -R /gpfs/chris/testfiles/dir* /marfs/atorrez

But when I do:
pfcp -R /gpfs/chris/testfiles/ /marfs/atorrez

it works fine.

need to coordinate build with libmarfs build

When Path.h (indirectly) includes dot-h files from libmarfs, it lacks the pre-proc settings used in the libmarfs-build, for e.g. USE_MDAL, which control the expansion of those dot-h files, and the way functions in the library are implemented. The pftool build has no knowledge of the environment-variables that were provided to control the libmarfs build.

I assumed pftool was agnostic on the MDAL-ness of the libmarfs build, because that just affects how libmarfs performs internal MDFS operations. But one way this does make a difference to pftool is in the size of MarFS_FileHandle. There's a bug in which a libmarfs function doing a wipe of the file-handle is affecting memory beyond the file-handle in pftool (i.e. because they see different sizes of file-handle).

One simple solution to the immediate problem is to have the file-handle always include the components it would use for both MDAL and non-MDAL, giving the file-handle a constant size for all builds.

A more-thorough (and not much harder) solution, is to have libmarfs build distinctly-named libraries for MDAL/non-MDAL, and to have the pftool --enable-marfs configuration-option do both (a) choose the properly-built marfs library, and (b) provide defines to allow Path.h (etc) to see all the same expansions in the marfs dot-h files.

Restore parameters for various task counts

There used to be command-line parameters for readdir, stat, and copy task counts. This allowed the user to specify how much parallelism you wanted in each of stat, readdir, and copy because you want to tune those separately.

Comb through the code and see if there are variables for these or if there is a way to make variables for each of these separate from the mpirank or thread counts in the current pftool.cfg files.

Gary Grider feels it's very important to be able to set these three things independent of one another.

mkpath calls mkdir directly

I am working on resovling issue #46 and I have noticed that mkdir is called directly in the function pfutils.cpp:mkpath(). This function is then called in ctf.c. I don't know if this is currently an issue but it seems like it would be cleaner to go though the path class.

This needs further investigation.

Weirdness with the sticky bits (not really an issue, just documenting) -

Since we have pftool set with the sticky bit for various reasons, things like LD_LIBRARY_PATH get stripped when a user invokes pftool. This is annoying, since we generally build and run using the modules system like standard clusters do.

The solution is to embed the OpenMPI dynamic library paths into the executable:
https://www.thanassis.space/tricks.html#smartdynamic

This way, users will always get the correct MPI library regardless of their environment, and regardless of whether or not their environment has been stripped due to the sticky bit security stuff.

When retrying a pftool copy without using the -n flag PFTool fails silently

If I run a pftool copy from POSIX to MarFS with a bad environment (such as with proxy environment vars set) and cancel the job with ^C when I notice it is not succeeding I am left with all the files in the mdfs with restart xattrs. This is the correct behavior, however, if I attempt to run the job again PFTool reports success, but the files are not copied. All the mdfs files in MarFS are left with their restart matter intact and cannot be read. Example output is shown below:

$ mpirun -np 4 pftool -r -w 0 -p /path/to/test_files/d1 -c /marfs/ns/test02/d1
INFO  HEADER   ========================  TestJob  ============================
INFO  HEADER   Starting Path: /path/to/test_files/d1
INFO  HEADER   Source-type: POSIX_Path
INFO  HEADER   Dest-type:   MARFS_Path
INFO ACCUM  files/chunks:   0        data:      0 B /  16.0 GB       avg BW:      0 B/s      errs: 0
INFO ACCUM  files/chunks:   0        data:      0 B /  16.0 GB       avg BW:      0 B/s      errs: 0
RANK   3: ERROR NONFATAL: (RETRY) /marfs/ns/test02/d1/uni.057, at 0+0, len 66
INFO ACCUM  files/chunks:   0        data:      0 B /  16.0 GB       avg BW:      0 B/s      errs: 1
^Cmpirun: killing job...
$ getfattr --recursive -d /path/to/mdfs/test02/d1 | grep restart | wc -l
getfattr: Removing leading '/' from absolute path names
202   # there are 202 files in the directory I am copying
$ [ unset proxies ]
$ mpirun -np 4 pftool -r -w 0 -p /path/to/test_files/d1 -c /marfs/ns/test02/d1
INFO  HEADER   ========================  TestJob  ============================
INFO  HEADER   Starting Path: /path/to/test_files/d1
INFO  HEADER   Source-type: POSIX_Path
INFO  HEADER   Dest-type:   MARFS_Path
INFO ACCUM  files/chunks:   0        data:      0 B /  16.0 GB       avg BW:      0 B/s      errs: 0
INFO ACCUM  files/chunks:   0        data:      0 B /  16.0 GB       avg BW:      0 B/s      errs: 0
INFO ACCUM  files/chunks:   0        data:      0 B /  16.0 GB       avg BW:      0 B/s      errs: 0
INFO ACCUM  files/chunks:   0        data:      0 B /  16.0 GB       avg BW:      0 B/s      errs: 0
INFO ACCUM  files/chunks:   0        data:      0 B /  16.0 GB       avg BW:      0 B/s      errs: 0
INFO ACCUM  files/chunks:   0        data:      0 B /  16.0 GB       avg BW:      0 B/s      errs: 0
INFO ACCUM  files/chunks:   0        data:      0 B /  16.0 GB       avg BW:      0 B/s      errs: 0
INFO ACCUM  files/chunks:   0        data:      0 B /  16.0 GB       avg BW:      0 B/s      errs: 0
INFO ACCUM  files/chunks: 128        data:   8.0 GB /  16.0 GB       avg BW:  90.1 MB/s      errs: 0
INFO ACCUM  files/chunks: 128        data:   8.0 GB /  16.0 GB       avg BW:  81.1 MB/s      errs: 0
INFO ACCUM  files/chunks: 128        data:   8.0 GB /  16.0 GB       avg BW:  73.7 MB/s      errs: 0
INFO ACCUM  files/chunks: 128        data:   8.0 GB /  16.0 GB       avg BW:  67.6 MB/s      errs: 0
INFO ACCUM  files/chunks: 128        data:   8.0 GB /  16.0 GB       avg BW:  62.4 MB/s      errs: 0
INFO ACCUM  files/chunks: 128        data:   8.0 GB /  16.0 GB       avg BW:  57.9 MB/s      errs: 0
INFO ACCUM  files/chunks: 128        data:   8.0 GB /  16.0 GB       avg BW:  54.1 MB/s      errs: 0
INFO ACCUM  files/chunks: 128        data:   8.0 GB /  16.0 GB       avg BW:  50.7 MB/s      errs: 0
INFO  FOOTER   ========================   NONFATAL ERRORS = 0   ================================
INFO  FOOTER   =================================================================================
INFO  FOOTER   Total Files/Links Examined: 202
INFO  FOOTER   Total Dirs Examined: 1
INFO  FOOTER   Total Buffers Written: 202
INFO  FOOTER   Total Bytes Copied: 17179882384
INFO  FOOTER   Total Megabytes Copied: 16384
INFO  FOOTER   Data Rate: 95 MB/second
INFO  FOOTER   Elapsed Time: 171 seconds
$ getfattr --recursive -d /path/to/mdfs/test02/d1 | grep restart | wc -l;
getfattr: Removing leading '/' from absolute path names
202
$ cat /marfs/ns/test02/d1/uni.007
cat: /marfs/ns/test02/d1/uni.007: Invalid argument

Group permission on directories may be incorrect:

It looks like user groups may not be set properly on output directories, though I believe this did work in the past.

[dbonnie@fta01 ~]$ pfcp -Rn /lustre/scratch/dbonnie/errorlists/ /campaign/admins/dbonnie/
"/users/dbonnie" pfcp -Rn /lustre/scratch/dbonnie/errorlists/ /campaign/admins/dbonnie/
Debugging: dest_path '/campaign/admins/dbonnie' -> dest_node '/campaign/admins/dbonnie/errorlists'
Debugging: Path subclass is 'MARFS_Path'
Debugging: created directory '/campaign/admins/dbonnie/errorlists'
manager: creating temp_path /campaign/admins
INFO HEADER ======================== dbonnie564163132016fta04.localdomain ============================
INFO HEADER Starting Path: /lustre/scratch/dbonnie/errorlists
INFO FOOTER ======================== NONFATAL ERRORS = 0 ================================
INFO FOOTER =================================================================================
INFO FOOTER Total Files/Links Examined: 210
INFO FOOTER Total Dirs Examined: 1
INFO FOOTER Total Buffers Written: 213
INFO FOOTER Total Bytes Copied: 4201097366
INFO FOOTER Total Megabytes Copied: 4006
INFO FOOTER Data Rate: 250 MB/second
INFO FOOTER Elapsed Time: 15 seconds
Launched /opt/campaign/pftool/installed/bin/pfcp from host fta04.localdomain at: Thu Mar 31 10:04:56 MDT 2016
Job finished at: Thu Mar 31 10:05:13 MDT 2016
[dbonnie@fta01 ~]$ ls -alh /lustre/scratch/dbonnie/
total 32K
drwxr-xr-x 3 dbonnie dbonnie 4.0K Mar 25 14:24 .
drwxr-xr-x 162 root root 12K Mar 28 09:08 ..
drwx------ 2 dbonnie dbonnie 20K Mar 25 14:18 errorlists
[dbonnie@fta01 ~]$ ls -alh /campaign/admins/dbonnie/
total 32K
drwxrwxrwx 3 dbonnie dbonnie 4.0K Mar 31 10:04 .
drwxr-xr-x 9 root root 4.0K Mar 23 22:30 ..
drwx------ 2 dbonnie root 16K Mar 31 10:04 errorlists

pfcp creates destination before assuring it can read source

[Dave Bonnie mentioned this.] The result is that we get an error an leave a size-zero destination-file.

Should probably do an access(), or something, in process_stat_buffer(). If we can't read, then print a non-fatal error, and do not add the file to the copy list.

PFTool does not preserve directory mode when copying

Dave mentioned this yesterday, and I can confirm that pftool does not preserve directory permissions on the destination when copying. The destination directories always have the mode 0700. This is true whether copying from POSIX to POSIX or from POSIX to MarFS. For example:

$ umask
0002
$ ls -la ./perm_test
total 1056
drwxr----x  2 wfvining wfvining   4096 Oct 12 12:41 td
-rw-r-----  1 wfvining wfvining 262144 Oct 12 10:38 sf.4
-rw-r-----  1 wfvining wfvining 262144 Oct 12 10:38 sf.3
-rw-r-----  1 wfvining wfvining 262144 Oct 12 10:38 sf.2
-rw-r-----  1 wfvining wfvining 262144 Oct 12 10:38 sf.1
drwxr-xr-x 28 wfvining wfvining   8192 Oct 12 12:42 ..
drwxr-----  3 wfvining wfvining   4096 Oct 12 12:41 .
$ mpirun -np 4 pftool -w 0 -r -p `pwd`/perm_test -c `pwd`/perm_test2
[...]
$ ls -la ./perm_test2
total 1056
drwx------  3 wfvining wfvining   4096 Oct 12 12:42 .
drwxr-xr-x 28 wfvining wfvining   8192 Oct 12 12:42 ..
-rw-r-----  1 wfvining wfvining 262144 Oct 12 10:38 sf.1
-rw-r-----  1 wfvining wfvining 262144 Oct 12 10:38 sf.2
-rw-r-----  1 wfvining wfvining 262144 Oct 12 10:38 sf.3
-rw-r-----  1 wfvining wfvining 262144 Oct 12 10:38 sf.4
drwx------  2 wfvining wfvining   4096 Oct 12 12:42 td

(Note: the default behavior of cp is to preserve the directory mode on the destination.)

Make pfcp do permissions & ownership like cp

Update pfcp's permission and ownership handling functionality for applying permissions and ownership as cp does. Add cp's -p flag capability for preserving source permission and ownership on destination.

PFTool Synthetic data hangs when MarFS is enabled

When using compiling PFTool with the flag --enable-marfs and the flag --enable-syndata, and using PFTool's synthetic data as the source, the mpirun command just hangs. If you compile without --enable-marfs, PFTool runs successfully. The relevant test in Jenkins is PFTool_create_MarFS_synthetic_multifile_prove_using_ls-. Below is a simple reproducer.

export P=/dev/syndata.L0
export C=$WORKSPACE/$BUILD_TAG.txt
export SIZE=12

mpirun -np $NP $PFTOOL -w 0 -x $SIZE -c $C -p $P
ls -l $C > file.txt
FILE_SIZE_0=`cut -d ' ' -f 5 file.txt`

if [ $FILE_SIZE_0 != $SIZE ]
then
	exit 1
fi

Path::identical() never worked

[Low priority. See tail of this item.]

Meant to fix this. The intent is to notice when source and destination are identical (like, same inode on the same FS). It's called in process_stat_buffer():

        // Are these items *identical* ? (e.g. same POSIX inode)
        if (p_work->identical(p_dest)) {
            write_count--;
            continue;
        }

p_work and p_dest are both just PathPtr.
The Path base-class has this default:

   virtual bool identical(Path& p)     { return false; } // same item, same FS
   virtual bool identical(PathPtr& p)  { return identical(*p); } // same item, same FS

The idea is that subclasses implement e.g. identical(MyType& p), which was supposed to be invoked by the base-class identical(*p). That doesn't happen.

Luckily, source and destination are never identical, in practice. If that ever happens, we would fail to detect it.

restart is conflicted for copy vs. compare of MarFS files

The pftool compare task uses CTM just like the copy task. During comparison, worker_comparelist() calls update_chunk() which evokes worker_update_chunk(). I think the reasoning must be that long "deep" file comparisons (i.e. with '-M') ought to be able to restart, so they should maintain CTM as they go. Furthermore, if someone attempted to compare a source with an incomplete copy (which would have copy-related CTM), the surviving CTM built during the copy task would just imply parts of the file not to compare, but the other parts (not existing) could still correctly indicate a "mismatch", when the read() on them fails. (Actually, you'd get a non-fatal read-error on each such chunk.)

The problem that crops up with chunked comparisons involving MarFS destinations is that worker_update_chunk() also causes us to overwrite the chunk-info in the MDFS file. We might overwrite with correct info, but this still changes the mtime of the file, which causes MD comparisons for the remaining source/destination chunks to fail. [The destination is then re-adjusted to again have correct MD, when pftool finishes the comparison.] The result is that chunked comparison of a source with a correctly-copied MarFS destination appear to show a "mismatch" on all but the first chunk.

The simplest solution would be to disable restarts for chunked "deep" comparisons. A better, not-much-harder approach would be to get worker_update_chunk() to know the difference between copy and compare tasks, so the call to write_chunkinfo() in libmarfs can only be called in the copy case. Another alternative would be to only do the MD comparisons when looking at the first chunk. This seems less good, because it still means messing with the MDFS file during comparisons.

Change the Default chunksize and chunk_at to 10GB

Currently the default pftool chunksize and chunk_at is set to 100GB. In the pfscripts it defaults to 10GB. I would like to change both of these defaults to 10GB as it has caused some confusion in our testing with the chunk_at so large.

The change would affect pftool.cpp:177-178. Is this an acceptable change?

BUG: Attempting to copy non-existent file with pfcp

When attempting to copy a non-existent file using pftool (pfcp), it should generate an error.

It does, but $? is still set to 0.

-bash-4.1$ pfcp a b
"/users/gellergr/src" pfcp a b
get_base_path  Failed to stat path a
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 11519 on
node cc-fta02 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Launched /opt/campaign/pftool/installed/bin/pfcp from host cc-fta03.localdomain at: Mon Jun 20 16:22:25 MDT 2016
ERROR: /opt/campaign/pftool/installed/bin/pfcp failed
Job finished at: Mon Jun 20 16:22:27 MDT 2016
-bash-4.1$ echo $?
0

PFTool fails to list multiple POSIX files

PFTool fails to list multiple POSIX files, as in the example command
mpirun -np $NP PFTool -w 1 -p ABSOLUTE_PATH/FILE_0 ABSOLUTE_PATH/FILE_1
It appears that PFTool gives n-1 non-fatal errors, where n is the number of POSIX files being listed.
In addition, it also cannot list a MarFS file and then a POSIX file, but can list a POSIX file followed by a MarFS file.

return status of MARFS_PATH::close_fh not being checked

In pftool.cpp around line 2942 the return status of close_fh is not being checked. This could lead to loss of data if the object stream does not close the object stream correctly and then pftool marks the files as done.

pfscripts.py assumes pftool location

In pfscripts.py, on line 11, it assumes pftool is located in the ../bin/ directory. The default behaviour should be to check for an environment variable, and failing an environment variable, check the PATH variable.

The script also assumes the location of pftool.cfg is in ../etc/

PFTool defaults to -w 1 when not given a -w arguement

When running PFTool with no -w arguement, it appears to default to -w 1 or listing the files.
mpirun -np $NP PFTool -p ABSOLUTE_PATH/FILE_0 -c ABSOLUTE_PATH/FILE_1
It works this way regardless of MarFS or POSIX, and regardless of whether or not you include the -c arguement.

PFTool copy single file from filesystem to filesystem returns error

When using PFTool to copy a single file from a filesystem to the same filesystem, PFTool copies the file successfully, but returns a non-fatal error. This occurs on both POSIX and MarFS, but does not happen for Packed Files (copying a directory of small files). Below is a simple POSIX to POSIX copy, and below that are a list of the relevant Jenkins tests.

export P=$WORKSPACE/$BUILD_TAG.txt
export C=$WORKSPACE/$BUILD_TAG.2.txt

echo "Hello World" > $P

mpirun -np $NP $PFTOOL -w 0 -c $C -p $P

The failing Jenkins tests are

PFTool_copy_POSIX_unifile_to_POSIX_prove_using_ls
PFCP_POSIX_unifile_to_POSIX_prove_using_ls
PFTool_copy_POSIX_multifile_to_POSIX_prove_using_ls
PFCP_POSIX_multifile_to_POSIX_prove_using_ls
PFTool_copy_MarFS_unifile_to_MarFS_prove_using_ls
PFCP_MarFS_unifile_to_MarFS_prove_using_ls
PFTool_copy_MarFS_multifile_to_MarFS_prove_using_ls
PFCP_MarFS_multifile_to_MarFS_prove_using_ls
PFTool_copy_unifile_from_MarFS_Packed_file_to_MarFS_prove_using_ls
PFCP_MarFS_unifile_from_MarFS_Packed_file_to_MarFS_prove_using_ls

The passing Jenkins tests are

PFTool_copy_POSIX_packed_file_to_POSIX_prove_using_ls
PFCP_POSIX_packed_file_to_POSIX_prove_using_ls
PFTool_copy_MarFS_Packed_file_to_MarFS_prove_using_ls
PFCP_MarFS_Packed_file_to_MarFS_prove_using_ls

pftool include list doesn't work

When using PFTool and include lists together, things do not work. When listing files with an include list, when the -vv option is also set, pftool does a DATASTAT of the files, but returns the size in number of bytes of the input file, rather than the combined size of the files in the input file. In Jenkins this is test "PFTool_list_three_MarFS_unifiles_from_POSIX_input_file" and also test "PFTool_list_four_Posix_unifiles_from_POSIX_input_file". include lists on copy with cp and compare with include lists also do not work, and result in a non-zero exit status of PFTool. These are in Jenkins as "PFTool_copy_a_unifile_from_MarFS_to_POSIX_using_an_input_list_prove_using_PFTool", "PFTool_compare_two_different_MarFS_unifiles_using_an_input_list" and "PFTool_compare_the_same_MarFS_unifile_using_an_input_list". Below is a simple reproducer of the bug with ls.

export P=/marfs/testing/test.txt
export P2=/marfs/testing/test.2.txt
export P3=/marfs/testing/test.3.txt
export INPUT_FILE=./input_file.txt

echo "Hello World" > $P
echo "Hello Worl" > $P2
echo "Hello Wor" > $P3

echo $P > $INPUT_FILE
echo $P2 >> $INPUT_FILE
echo $P3 >> $INPUT_FILE

sleep 10

mpirun -np $NP $PFTOOL -w 1 -i $INPUT_FILE

get_output_path() drops dir-component from path, for compare-task

This COPY task:

pftool -p /a/b/c[/] -c /dest -r -w 0

Correctly generates output-paths under /dest/c/, recursively. So, /a/b/c/d would go to /dest/c/d, and so forth.

However, this COMPARE task:

pftool -p /a/b/c[/] -c /dest -r -w 2

drops the "c[/]" from the source. In the case of /a/b/c/d, we'd end up comparing with /dest/d (which probably wouldn't exist), instead of /dest/c/d.

The problem seems to be that get_output_path() should bring along the last component of the dest_node, onto the tail of the new output_path (after the base_path, before adding the "slice" from src_node). This last component is the name of the directory we're descending.

I'm fuzzy on whether this should only be in the o.recursive case, and whether it should only be for the compare-task. I think it works as is for the copy-task because the copy-task computes directory-paths by a separate mechanism, in process_stat_buffers(), whereas for file-paths, it correctly strips off the tail of the source path. In the case of the compare-task, I think we don't know whether stripping off that tail-component is right or not, until we stat it to find out whether it's a directory.

chunking-loop needs a throttle

The "chunking-loop" in process_stat_buffer() iterates generating chunks and sending them to the manager. The manager receives these (as PROCESCMD) and pushes them onto a linked-list, where they are eventually doled out to workers that need something to do.

In some situations (e.g. gigantic file that will yield many chunks), the chunker can generate these TBD chunks faster than other workers can consume them, causing the manager to eat VM and grow indefinitely.

  • One first-cut solution would be to have workers in the chunking-loop look at the size of the linked list. If it's above some threshold (e.g. 1M entries), then sleep for e.g. 60 seconds and try again.
  • For finer control, instead of polling, there could be a lock. Worker sends message saying "unlock me when list-size falls below threshold/2", then waits on a per-worker lock. Then master checks the list size whenever it decreases, and when it falls below the target it unlocks all waiting chunkers.
  • A probably-unnecessarily-complicated approach would be for the chunking loop to bail when the list is full, so it could join the other workers doing the moving of chunks. Then the manager could redeploy someone to resume chunking when the list got below e.g. threshold/2. The problem here is that it implies worker_readdir() could return without having finished it's job. Some state would be needed so the re-deploy could pick up where the previous guy left off. If there are plenty of tasks, then this approach seems like unnecessary complication.

This last approach could be necessary if it's possible for there to be multiple large files such that all the workers are stuck in chunking loops, waiting for room in the linked list, with nobody doing moves. We could avoid that by segregating workers (as was apparently the way things were done, once upon a time).

Unable to build pftool

I'm trying to compile pftool on a debian testing, with autoreconf 2.69, but autogen/configure complain about tompi. Any ideas ?

$ git clone https://github.com/pftool/pftool.git && cd pftool
$ ./autogen 
libtoolize: putting auxiliary files in '.'.
libtoolize: copying file './ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: copying file 'm4/libtool.m4'
libtoolize: copying file 'm4/ltoptions.m4'
libtoolize: copying file 'm4/ltsugar.m4'
libtoolize: copying file 'm4/ltversion.m4'
libtoolize: copying file 'm4/lt~obsolete.m4'
configure.ac:64: installing './ar-lib'
configure.ac:40: installing './compile'
configure.ac:65: installing './config.guess'
configure.ac:65: installing './config.sub'
configure.ac:13: installing './install-sh'
configure.ac:13: installing './missing'
automake: warnings are treated as errors
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/coll/allreduce.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
automake: warning: possible forward-incompatibility.
automake: At least a source file is in a subdirectory, but the 'subdir-objects'
automake: automake option hasn't been enabled.  For now, the corresponding output
automake: object file(s) will be placed in the top-level directory.  However,
automake: this behaviour will change in future Automake versions: they will
automake: unconditionally cause object files to be placed in the same subdirectory
automake: of the corresponding sources.
automake: You are advised to start using 'subdir-objects' option throughout your
automake: project, to avoid future incompatibilities.
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/coll/barrier.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/coll/bcast.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/coll/builtin_ops.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/coll/reduce.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/comm/builtin_attribs.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/comm/comm_compare.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/comm/comm_dup.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/comm/comm_group.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/comm/comm_rank.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/comm/comm_size.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/comm/comm_world.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/comm/keyval_create.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/comm/keyval_free.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/comm/simple_fn.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/context/context.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/error/do_nothing.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/error/errhandler_create.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/error/errhandler_free.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/error/errhandler_get.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/error/errhandler_set.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/error/error.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/error/error_class.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/error/error_string.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/error/fatal_error.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/group/captain.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/group/group_compare.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/group/group_rank.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/group/group_size.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/group/group_translate_ranks.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/alloc.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/exit.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/finalize.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/g2tsd_lib.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/global.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/info.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/init.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/initialized.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/me.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/options.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/pcontrol.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/waiter.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/misc/wtime.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/get_count.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/irecv.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/issend.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/match.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/queue.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/recv.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/recv_init.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/ssend.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/ssend_init.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/start.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/wait.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/pt2pt/waitall.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/types/builtin_types.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am:4: warning: source file '$(top_srcdir)/libs/tompi/src/types/type_size.c' is in a subdirectory,
libs/tompi/Makefile.am:4: but option 'subdir-objects' is disabled
libs/tompi/Makefile.am: installing './depcomp'
autoreconf: automake failed with exit status: 1

The configure file is still generated, but is complaining about libs/tompi :

$ ./configure --enable-threads
checking whether make supports nested variables... yes
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking for mpicc... no
checking for cc... cc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether cc accepts -g... yes
checking for cc option to accept ISO C89... none needed
checking whether cc understands -c and -o together... yes
checking for style of include used by make... GNU
checking dependency style of cc... gcc3
checking for ar... ar
checking the archiver (ar) interface... ar
checking build system type... x86_64-pc-linux-gnu
checking host system type... x86_64-pc-linux-gnu
checking how to print strings... printf
checking for a sed that does not truncate output... /bin/sed
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for fgrep... /bin/grep -F
checking for ld used by cc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1572864
checking how to convert x86_64-pc-linux-gnu file names to x86_64-pc-linux-gnu format... func_convert_file_noop
checking how to convert x86_64-pc-linux-gnu file names to toolchain format... func_convert_file_noop
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for dlltool... no
checking how to associate runtime and link libraries... printf %s\n
checking for archiver @FILE support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from cc object... ok
checking for sysroot... no
checking for a working dd... /bin/dd
checking how to truncate binary pipes... /bin/dd bs=4096 count=1
checking for mt... mt
checking if mt is a manifest tool... no
checking how to run the C preprocessor... cc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if cc supports -fno-rtti -fno-exceptions... no
checking for cc option to produce PIC... -fPIC -DPIC
checking if cc PIC flag -fPIC -DPIC works... yes
checking if cc static flag -static works... yes
checking if cc supports -c -o file.o... yes
checking if cc supports -c -o file.o... (cached) yes
checking whether the cc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking sys/vfs.h usability... yes
checking sys/vfs.h presence... yes
checking for sys/vfs.h... yes
checking gpfs.h usability... no
checking gpfs.h presence... no
checking for gpfs.h... no
checking gpfs_fcntl.h usability... no
checking gpfs_fcntl.h presence... no
checking for gpfs_fcntl.h... no
checking dmapi.h usability... no
checking dmapi.h presence... no
checking for dmapi.h... no
checking xattr.h usability... no
checking xattr.h presence... no
checking for xattr.h... no
checking for pid_t... yes
checking for size_t... yes
checking for stdlib.h... (cached) yes
checking for GNU libc compatible malloc... yes
checking for memset... yes
checking for strerror... yes
checking for strtoul... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating libs/Makefile
config.status: error: cannot find input file: `libs/tompi/Makefile.in'

pfcm failures don't set $?

Just noticed that running pfcm -M on two different files doesn't result in setting $? to non-zero even though it results in a nonfatal error.

It makes sense to me that any error (fatal or otherwise) should set $?.

Thoughts?

Less verbose progress feedback

From Gary Grider:

One of the things we learned in Open Science was that users run pfcp non verbose and then kill it because of lack of feedback. Then they run pfcp verbose and hate the deluge. Then they kill it and run it again without verbose.

Need a lighter weight verbose. Perhaps multiple "v"s, e.g., -v, -vv, -vvv, -vvvv with the last being the current verbose mode. Or, -v with an argument. The argument is the number of files to pass between status update and no argument is the current super verbose.

Users are interested in progress of the copy. So, the number of files inspected, moved, etc. Every 100 files, every 1000 files etc. Simply let the trailer proc count and write status to stderr or out. Something easy to grep. Trivial but nice to have.

delete file-CTM when destination doesn't exist

Here's an example of a problematic scenario:

  • run pftool on a non-existent destination
    run long enough to save CTM ( >= 4 chunks for file-CTM)
    ctl-C
  • delete the destination (e.g. through fuse)
  • run pftool again
    run not long enough to save a new CTM file (< 4 chunks)
    ctl-C

Now, the CTM file indicates the 4+ chunks originally written, though there are only the 2 new chunks actually written. When you restart pftool yet again, it will resume writing where the CTM file says the last chunk was written, which would be at e.g. chunk 4+. (Similar behavior happens if the first run writes e.g. 8, and the second run writes 4. In other words, the problem doesn't depend on the second run stopping short of saving CTM.)

It looks like the fix for this would be to add a "purgeCTM(out_node.path)" in the early part of process_stat_buffer(), where it has learned that the destination doesn't exist, but before it starts actual processing.

[I have a patch for this, but I'm testing something else, first.]

Progress indicator

It would be super helpful to have a basic progress indicator short of verbose mode, which makes it waaay too hard to see any errors flying by. Even as someone familiar with the tool, you can't be quite sure at runtime that anything is getting done (from a single terminal) without being blasted with millions of lines of output.

So, I'm proposing something similar to:

dbonnie@testsystem: pfcp /intputfiledir /outputdir
[regular pftool startup output]
Status: W files stat'd, X files moved, Y MB moved, XX:YY elapsed
(repeat every few minutes or something, nothing fancy, no overwriting/animations)
[regular pftool final ouput]

Something simple that gives users a good idea that the program is still at least doing something for them.

Pftool uses a regular lstat in file copies were multiple inputs are provided

This causes problems if it is copying to a file system like marfs and the fuse mount is not available.

The cause of the problem is in src/pftool.cpp.
Search for "Multiple inputs and target '%s' is not a directory\n"

Reproduce with:
mpirun -n 4 /home/dejager/pftool/installed/bin/../bin/pftool -w 0 -r -c /marfs/dejager/test/ -p /home/dejager/test/

For non multi/chunked files change when pre and post process are called

The current packed implementation for marfs had to add a function to finalize packed files after all the files in the object have been written. This happens near the end of worker_copylist().

This function could be just post_process except now it is getting called to soon. I think that for small files it may make sense to have the copy process do the pre_process() work. Currently pre_process() is done by the readdir process. For small files post_process() called after the file is written but before the work is done. I would like to call post_process() after all of the current work is done and simplify the interface to path.

Doing it this way should also simply both the marfs and pftool code involved with packing.

[cpp branch] add new Path subclass to provide "/dev/zero.n.m/"

This is aimed at MarFS directory-sharding, and MD-insert performance testing, but is really a generic facility that would be useful for any write-performance testing.

Similarly to what we already did with /dev/null/ as an output directory (where we provide an arbitrary number of low-cost parallel data-sinks), we should provide /dev/zero.n.m/ as an input directory, providing an arbitrary number of low-cost parallel data sources.

is the number of files to create. This means the directory-listing in pftool (Path::readdir()) would generate filenames like "test.00001", or something. (Should they be unique per run, by adding an encoding of the date?).

is the apparent length of the test.* files, as returned by stat().

Calls to read() would return buffers full of zeros

For MarFS MD testing, you'd run 'pftool ... -p /dev/zero.n.0/ -c /marfs/blah/' and pftool would attempt zero-length file-creates under /marfs/blah/.

It looks like this issue would eliminate the need for everything in the "GEN_SYNDATA" cases. We could clean that up afterwards.


**double-check:** When copy_file() is copying a zero-length file, it looks like it will just skip the calls to read() and write(), with length 0. Make sure.
[Maybe we also want a **/dev/urandom._n_._m_/**, to fill buffers with random data? Perhaps add a "seed" argument, to make the sequence of data predictable, per file?]

Occationaly pfls calls fail

We have occasionally been getting this failure on pfls.

********************************************************************************
Use PFTool to list multiple (two) POSIX files
manager: failed to create temp_path ��
ERROR FATAL: parent doesn't exist: ��/h��: 
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

retry of successful recursive copy fails.

[cleaned up]
[latest cpp branch, plus latest libmarfs, etc.]

Seems reproducible. Gets a '500 Internal Server Error' from Scality. That could mean a lot of different things. I saw that happen when we were failing to generate unique new object-names, for example.

[user@04 pftool]# mpirun -np 15 -H localhost installed/bin/pftool -s 16M -S 10G -C 10G -w 0 -r -p /source/1x10G/ -vv -c /dest
INFO HEADER ======================== TestJob ============================
INFO HEADER Starting Path: /source/1x10G/
RANK 3: INFO DATASTAT - drwxrwxr-x 8 19398 19398 4096 Wed Mar 02 2016 10:38:58 /source/1x10G/
RANK 4: INFO DATASTAT - -rw-rw-r-- 0 19398 19398 10737418240 Wed Mar 02 2016 10:38:58 /source/1x10G/f01
RANK 3: INFO DATACOPY Copied /source/1x10G/f01 offs 0 len 10737418240 to /dest/1x10G/f01
INFO FOOTER ======================== NONFATAL ERRORS = 0 ================================
INFO FOOTER =================================================================================
INFO FOOTER Total Files/Links Examined: 1
INFO FOOTER Total Dirs Examined: 1
INFO FOOTER Total Buffers Written: 1
INFO FOOTER Total Bytes Copied: 10737418240
INFO FOOTER Total Megabytes Copied: 10240
INFO FOOTER Data Rate: 75 MB/second
INFO FOOTER Elapsed Time: 134 seconds

[user@04 pftool]# mpirun -np 15 -H localhost installed/bin/pftool -s 16M -S 10G -C 10G -w 0 -r -p /source/1x10G/ -vv -c /dest
INFO HEADER ======================== TestJob ============================
INFO HEADER Starting Path: /source/1x10G/
RANK 3: INFO DATASTAT - drwxrwxr-x 8 19398 19398 4096 Wed Mar 02 2016 10:38:58 /source/1x10G/
RANK 4: INFO DATASTAT - -rw-rw-r-- 0 19398 19398 10737418240 Wed Mar 02 2016 10:38:58 /source/1x10G/f01
RANK 3: ERROR NONFATAL: /dest/1x10G/f01: wrote -1 bytes instead of 16777216 (Input/output error, curl: '500 Internal Server Error')
RANK 3: ERROR NONFATAL: Failed to close dest file: /dest/1x10G/f01 (Connection timed out)
INFO FOOTER ======================== NONFATAL ERRORS = 2 ================================
INFO FOOTER =================================================================================
INFO FOOTER Total Files/Links Examined: 1
INFO FOOTER Total Dirs Examined: 1
INFO FOOTER Total Buffers Written: 0
INFO FOOTER Total Bytes Copied: 0
INFO FOOTER Elapsed Time: 32 seconds

restart on partial-copy is partially-failing on cc-fta

Seeing fails doing restart of Multi on the cc-fta test-bed, but not on Hb's cluster. This is not the same problem as #4.

I think this is what is going on:

CTM is updated once every 4 files, though it may be that more objects are written. [This applies when using file-mode record-keeping, as opposed to xattrs, and is controlled by the #define CTF_UPDATE_STORE_LIMIT, in ctf.c.]

Thus, at restart, the first object(s) being rewritten may already exist. Trying to PUT to an object that already exists gets a 500 from Scality, on cc-fta, but not on Hb's cluster. Some curl-testing shows that the different Scality versions respond differently to attempts to PUT to a object-path that already exists. (HEAD of an object returns scality-version "128" on Hb's cluster, "64" on cc-fta.)

It appears I can fix the problem by changing CTF_UPDATE_STORE_LIMIT to 0, and rebuilding. This means the CTM file gets updated after every chunk, which is probably not super-efficient.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.