GithubHelp home page GithubHelp logo

pcp's People

Contributors

guycoates avatar jrandall avatar mjwoods avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pcp's Issues

Lustre stripe error noise when resuming from checkpoint

If a pcp job with -l (preserve Lustre striping) is resumed from a checkpoint, there are a lot of error messages of the form:

error on ioctl 0x4008669a for '/lustre/file/name' (10): stripe already set

which obscure other problems. It would be nice if these messages didn't appear.

(I understand that these are printed directly by the Lustre library)

Chunking and verification failure

The PCP available on the farm is a slightly modified version of e4b161c (as of 2018-02-27)

PCP has been observed to incorrectly chunk files and falsely verify said chunks, based on presumably the same underlying problem. That is, chunks are copied incorrectly and the MD5 sum is based on what's copied, rather than the actual source, so it passes verification.

One can workaround this problem by specifying the -b0 option (i.e., no chunking).

RFE: sync option

Would it be a massive change to provide a sync option which would have the option to remove files on the target on update operations if they've been removed from the source? (like rsync can do)

Open MPI warns about fork() system call

Open MPI displays the following warning during startup of pcp:

An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

Although the warning is not fatal, it is off-putting for new users.

Update option `-u ' does not work as expected

pcp -u transfer all files again even after successful "full" run. Source is a NFS location, target a Luster system.
first run:
>mpirun -n 4 ~/pcp/pcp -v -p -K "<path_to_ckeckpoint>" <nfs_source> <lustre dest>

second run:
>mpirun -n 4 ~/pcp/pcp -v -p -u -K "<path_to_ckeckpoint>" <nfs_source> <lustre dest>

WARNING: appears to have problems with openmpi on
large job sizes; if pcp hangs, consider using mpich
instead.
R0: All workers have reported in.
Starting 4 processes.
Will only copy files if source is newer than destination
 or destination does not exist.
Will checkpoint every 60 minutes to /lustre/scratch/zipfewo1/transfer/zipfweo1
Files larger than 500 Mbytes will be copied in parallel chunks.

Starting phase I: Scanning and copying directory structure...
Phase I done: Scanned 339 files, 3 dirs in 00 hrs 00 mins 00 secs (2842 items/sec).
 339 files will be copied.

Starting phase II: Copying files...

R2: Nov 22 22:07:06 copied *********  44.00 Mbytes (309.89 Mbytes/s)
R3: Nov 22 22:07:06 copied *********  53.39 Mbytes (302.85 Mbytes/s)
R1: Nov 22 22:07:06 copied **********  49.93 Mbytes (282.28 Mbytes/s)
R2: Nov 22 22:07:06 copied *********  59.33 Mbytes (345.61 Mbytes/s)
R3: Nov 22 22:07:06 copied *********  44.99 Mbytes (254.63 Mbytes/s)
[...]

Phase II done.
R0: Sending SHUTDOWN to workers
rank 1 shutdown
rank 2 shutdown
rank 3 shutdown
R0: Gathering results

Copy Statisics:
Rank 1 copied 5.40 Gbytes in 112 files (320.59 Mbytes/s)
Rank 2 copied 5.39 Gbytes in 114 files (322.45 Mbytes/s)
Rank 3 copied 5.37 Gbytes in 113 files (319.69 Mbytes/s)
Total data copied: 16.17 Gbytes in 339 files (951.28 Mbytes/s)
Total Time for copy: 00 hrs 00 mins 17 secs
Warnings 0

Starting phase III: Setting directory timestamps...
Phase III Done. 00 hrs 00 mins 00 secs

Running it with -u again, will copy the file again.
Checking the mtime of source and dest shows same timestamp.
Any idea?

How to detect limits on the number of stripes per file

The current pcp code assumes that the number of stripes per file is only limited by the number of OSTs in the filesystem. This works on the filesystems that I have used, but some versions of Lustre may have separate limits on the total number of OSTs and the number of OSTs used by a single file. If so, I do not know how to find the limits using code that is backwards compatible to older Lustre versions. Any suggestions?

getting the trace while running with -lf flag

Hello Team

I am getting the following trace while trying to do pcp to lustre FS

without -lf flag pcp works as expected

env python 2.7
pcp from git

Thank you

Tue Mar 20 17:09:19 2018:WARNING: appears to have problems with openmpi on
Tue Mar 20 17:09:19 2018:large job sizes; if pcp hangs, consider using mpich
Tue Mar 20 17:09:19 2018:instead.
Tue Mar 20 17:09:19 2018:R0: All workers have reported in.
Tue Mar 20 17:09:19 2018:Starting 32 processes.
Tue Mar 20 17:09:19 2018:Files larger than 500 Mbytes will be copied in parallel chunks.
Tue Mar 20 17:09:19 2018:Will force stripe all files.
Tue Mar 20 17:09:19 2018:Will md5 verify copies.
Tue Mar 20 17:09:19 2018:
Tue Mar 20 17:09:19 2018:Starting phase I: Scanning and copying directory structure...
Tue Mar 20 17:09:19 2018:Exception on rank 0 host node25:
Tue Mar 20 17:09:19 2018:Traceback (most recent call last):
Tue Mar 20 17:09:19 2018: File "/usr/bin/pcp", line 1533, in
Tue Mar 20 17:09:19 2018: scantree(sourcedir, destdir, statedb)
Tue Mar 20 17:09:19 2018: File "/usr/bin/pcp", line 354, in scantree
Tue Mar 20 17:09:19 2018: listofpaths = walker.Execute(sourcedir)
Tue Mar 20 17:09:19 2018: File "/usr/local/lib/python2.7/site-packages/pcplib/parallelwalk. py", line 260, in Execute
Tue Mar 20 17:09:19 2018: self._ProcessNode()
Tue Mar 20 17:09:19 2018: File "/usr/local/lib/python2.7/site-packages/pcplib/parallelwalk. py", line 175, in _ProcessNode
Tue Mar 20 17:09:19 2018: self.ProcessDir(filename)
Tue Mar 20 17:09:19 2018: File "/usr/bin/pcp", line 1345, in ProcessDir
Tue Mar 20 17:09:19 2018: copyDir(directoryname, newdir)
Tue Mar 20 17:09:19 2018: File "/usr/bin/pcp", line 1042, in copyDir
Tue Mar 20 17:09:19 2018: layout = lustreapi.getstripe(sourcedir)
Tue Mar 20 17:09:19 2018: File "/usr/local/lib/python2.7/site-packages/pcplib/lustreapi.py" , line 121, in getstripe
Tue Mar 20 17:09:19 2018: raise IOError(err, os.strerror(err))
Tue Mar 20 17:09:19 2018:IOError: [Errno 25] Inappropriate ioctl for device

preserve stripe information option?

It seems like the stripe preservation options for lustre are simply stripe=1 or stripe=-1. Would it be possible to get an option to preserve striping as it was on the source?

terminal window size change causes crash

Running pcp inside screen seems not to like the window size being changed - I can reproducibly make it die with the error "Profiling timer expired" by maximising or unmaximising the window.

I'd thought this was caused by SIGWINCH, but the default is to ignore SIGWINCH. So, unsurprisingly, adding
signal.signal(signal.SIGWINCH, signal.SIG_IGN)
after the existing signal handler setup doesn't help.

I've also seen this error when MPI's not working right, so possibly it's something lower down that's complaining.

This is probably a "wishlist" bug, because pcp can be run in a non-interactive job to avoid the problem.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.