Comments (6)
a) atomic file creation will prevent partial file from being present on error
b) unless you are out of disk space [handled by a] we should not be having
file systems problems. It's very reliable.
Joel Armstrong [email protected] writes:
I just had several jobTrees fail (presumably due to a filesystem problem) with
this error:Batch system is reporting that the job (1, 298079848) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079849) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079823) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job failed with exit value 256
There was a .new file for the job and no .updating file /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job
Traceback (most recent call last):
File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/bin/cactus_progressive.py", line 239, in
main()
File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/progressive/cactus_progressive.py", line 235, in main
Stack(RunCactusPreprocessorThenProgressiveDown(options, args)).startJobTree(options)
File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 95, in startJobTree
return mainLoop(config, batchSystem)
File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 454, in mainLoop
processFinishedJob(jobID, result, updatedJobFiles, jobBatcher, childJobFileToParentJob, childCounts, config)
File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 270, in processFinishedJob
job = Job.read(jobFile)
File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/job.py", line 39, in read
job = _convertJsonJobToJob(pickler.load(fileHandle))
EOFError: EOF read where object expectedI haven't looked into jobTree internals in depth, but I think this is due to
the job writing a completely blank pickle file (the t2/t3/t1/t2/job file has
size 0). I'll look to see if we can be resilient to these types of errors and
just retry the job if this happens. It wouldn't have helped in this particular
case, since all jobs were failing in the same way, but presumably this could
also happen if a single job is killed in the wrong way.—
Reply to this email directly or view it on GitHub.*
from jobtree.
Didn't mean to imply that GPFS itself was truncating the file somehow -- just that there were issues with read/writes blocking for long times that could've easily contributed to the process itself writing only a partial file and triggering a previously unnoticed race condition, or simply getting killed at an inopportune time.
Atomic file creation is a great idea for this file, I think that's the correct fix here.
from jobtree.
We already do try to do this, see:
https://github.com/benedictpaten/jobTree/blob/master/src/job.py#L43
On Mon, Feb 2, 2015 at 9:58 AM, Joel Armstrong [email protected]
wrote:
Didn't mean to imply that GPFS itself was truncating the file somehow --
just that there were issues with read/writes blocking for long times that
could've easily contributed to the process itself writing only a partial
file and triggering a previously unnoticed race condition, or simply
getting killed at an inopportune time.Atomic file creation is a great idea for this file, I think that's the
correct fix here.—
Reply to this email directly or view it on GitHub
#26 (comment)
.
from jobtree.
Oh wow, hadn't looked at the code yet. That seems pretty foolproof. Could this have the same root cause as the issue we had with the cactus sequences file? I'm stumped.
from jobtree.
Actually, the code can be simplified. The ".updating" file was there
because of how we used to do it, but it can be removed and we should be
able to just rely on the move of the ".new" version to replace the previous
version of the file. Doing this would be good, but I'd need to look through
the other code to see how we check for the existence of the ".updating"
file.
If the filesystem itself went wrong I think all bets are off.
On Mon, Feb 2, 2015 at 10:23 AM, Joel Armstrong [email protected]
wrote:
Oh wow, hadn't looked at the code yet. That seems pretty foolproof. Could
this have the same root cause as the issue we had with the cactus sequences
file? I'm stumped.—
Reply to this email directly or view it on GitHub
#26 (comment)
.
from jobtree.
I have just encountered the same error in jobTree as part of a progressiveCactus run. The error message is below:
Batch system is reporting that the job (1, 4907) /n/home12/tsackton/ratite_scratch/wga/ratite_alignment/ratiteDir/jobTree/jobs/t1/t1/t1/t3/t3/t0/t0/t1/t1/job failed with exit value 1
There was a .new file for the job and no .updating file /n/home12/tsackton/ratite_scratch/wga/ratite_alignment/ratiteDir/jobTree/jobs/t1/t1/t1/t3/t3/t0/t0/t1/t1/job
Traceback (most recent call last):
File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/cactus/bin/cactus_progressive.py", line 239, in <module>
main()
File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/cactus/progressive/cactus_progressive.py", line 235, in main
Stack(RunCactusPreprocessorThenProgressiveDown(options, args)).startJobTree(options)
File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 95, in startJobTree
return mainLoop(config, batchSystem)
File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/jobTree/src/master.py", line 454, in mainLoop
processFinishedJob(jobID, result, updatedJobFiles, jobBatcher, childJobFileToParentJob, childCounts, config)
File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/jobTree/src/master.py", line 270, in processFinishedJob
job = Job.read(jobFile)
File "/n/home12/tsackton/sw/progs/progressiveCactus/submodules/jobTree/src/job.py", line 39, in read
job = _convertJsonJobToJob(pickler.load(fileHandle))
EOFError: EOF read where object expected
It is possible that there is some filesystem issues that are in play here, as our main shared filesystem had some recent instability, but at least in theory that was fixed >24 hours ago and after I started the progressiveCactus run. I should also note that this is using the Harvard informatics SLURM fork of jobTree (https://github.com/harvardinformatics/jobTree), but I think the relevant code here is unchanged.
Edited to add: the directory with the failed job identified above has a 0-length job file and a 0-length updating file in it, in case that is relevant.
from jobtree.
Related Issues (20)
- Relative imports don't always work in jobTree HOT 1
- Deep thoughts (by Glenn) HOT 1
- argparse functionality broken by 76a4328 HOT 5
- LogLevel options are not included in the jobtree options group, leading to confusing help messages HOT 1
- Job Tree does not re-build the stats.xml file frequently enough HOT 2
- Please reconsider dependency on sonLib HOT 1
- Please limit the permissions for the tmp_* directories build by `make test`
- Crash getting global temp dir when restarting jobtree HOT 2
- periodically hangs entire parasol hub when listing jobs HOT 2
- Documentation/code separation of script tree from everything else is confusing
- Documentation examples should use the neater function specification
- Sonlib dependency is unnecessary
- Should make package pip installable
- restarting jobTree the parameter changes have no effect
- job failures not correctly reducing retry count
- make test error
- jobTree support for argparse HOT 2
- Stalling behavior with --maxThreads --maxJobs HOT 1
- When --batchSystem != singleMachine, --maxThreads should spark an error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jobtree.