GithubHelp home page GithubHelp logo

benedictpaten / jobtree Goto Github PK

View Code? Open in Web Editor NEW
23.0 23.0 18.0 3.18 MB

Python based pipeline management software for clusters (but checkout toil: https://github.com/BD2KGenomics/toil, its successor)

License: MIT License

Makefile 0.43% Python 99.57%

jobtree's Issues

Crash getting global temp dir when restarting jobtree

When I go to restart one of my jobTree scripts that uses the global and local temp directories, after some of its jobs have I get a crash from the internal jobTree code complaining about some directories not existing.

Can the code be made to handle the lack of existence of those directories?

log.txt:    ---JOBTREE SLAVE OUTPUT LOG---
log.txt:    Traceback (most recent call last):
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/src/jobTreeSlave.py", line 271, in main
log.txt:        defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth)
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/stack.py", line 153, in execute
log.txt:        self.target.run()
log.txt:      File "/cluster/home/anovak/hive/sgdev/mhc/targets.py", line 244, in run
log.txt:        index_dir = sonLib.bioio.getTempFile(rootDir=self.getGlobalTempDir())
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/target.py", line 103, in getGlobalTempDir
log.txt:        self.globalTempDir = self.stack.getGlobalTempDir()
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/stack.py", line 129, in getGlobalTempDir
log.txt:        return getTempDirectory(rootDir=self.globalTempDir)
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/sonLib-1.0-py2.7.egg/sonLib/bioio.py", line 457, in getTempDirectory
log.txt:        os.mkdir(rootDir)
log.txt:    OSError: [Errno 20] Not a directory: '/cluster/home/anovak/hive/sgdev/mhc/tree7/jobs/t2/t3/t1/t0/gTD0/tmp_Zss3uyl5X6/tmp_45OevDhWor/tmp_vxiVIbzGSw'
log.txt:    Exiting the slave because of a failed job on host ku-1-21.local
log.txt:    Due to failure we are reducing the remaining retry count of job /cluster/home/anovak/hive/sgdev/mhc/tree7/jobs/t2/t3/t1/t0/job to 0

errors in "make test"

Hi Benedict,

I got some error messages when I ran "make test"


Starting to create the job tree setup for the first time
Traceback (most recent call last):
  File "/projects/gec/tool/assemblathon1/jobTree/bin/scriptTreeTest_Sort.py", line 119, in <module>
    main()
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/sort/scriptTreeTest_Sort.py", line 112, in main
    i = Stack(Setup(options.fileToSort, int(options.N))).startJobTree(options)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/scriptTree/stack.py", line 112, in startJobTree
    config, batchSystem = createJobTree(options)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/src/jobTreeRun.py", line 221, in createJobTree
    batchSystem = loadTheBatchSystem(config)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/src/jobTreeRun.py", line 157, in loadTheBatchSystem
    batchSystem = GridengineBatchSystem(config)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/batchSystems/gridengine.py", line 136, in __init__
    self.obtainSystemConstants()
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/batchSystems/gridengine.py", line 267, in obtainSystemConstants
    p = subprocess.Popen(["qhost"], stdout = subprocess.PIPE,stderr = subprocess.STDOUT)
  File "/usr/lib64/python2.6/subprocess.py", line 595, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1106, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
E
======================================================================
ERROR: Uses the jobTreeTest code to test the scriptTree Target wrapper.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/scriptTree/scriptTreeTest.py", line 31, in testScriptTree_Example
    system(command)
  File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
    raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Wrapper.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_KEwg5sdOhb/jobTree --logLevel=INFO --retryCount=10 exited with non-zero status 256

======================================================================
ERROR: Tests that the global and local temp dirs of a job behave as expected.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/scriptTree/scriptTreeTest.py", line 39, in testScriptTree_Example2
    system(command)
  File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
    raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Wrapper2.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_GKficygzKn/jobTree --logLevel=INFO --retryCount=0 exited with non-zero status 256

======================================================================
ERROR: Tests the jobTreeStats utility using the scriptTree_sort example.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/utilities/statsTest.py", line 35, in testJobTreeStats_SortSimple
    system(command)
  File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
    raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Sort.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_rZqK92206Q/jobTree --logLevel=DEBUG --fileToSort=/gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_rZqK922
06Q/tmp_wj73HRVMZx --N 1000 --stats --jobTime 0.5 exited with non-zero status 256

----------------------------------------------------------------------
Ran 14 tests in 318.742s

FAILED (errors=3)

Could I ask what it happened?
Thank you.

Regards,

Yun

Should make package pip installable

Obviously - though consideration needs to be made of the command line binaries - this could possibly be managed via the python path, removing the command line binaries.

entire jobTree crashes when a job's pickle file isn't written correctly

I just had several jobTrees fail (presumably due to a filesystem problem) with this error:

Batch system is reporting that the job (1, 298079848) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079849) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079823) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job failed with exit value 256
There was a .new file for the job and no .updating file /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job
Traceback (most recent call last):
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/bin/cactus_progressive.py", line 239, in <module>
    main()
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/progressive/cactus_progressive.py", line 235, in main
    Stack(RunCactusPreprocessorThenProgressiveDown(options, args)).startJobTree(options)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 95, in startJobTree
    return mainLoop(config, batchSystem)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 454, in mainLoop
    processFinishedJob(jobID, result, updatedJobFiles, jobBatcher, childJobFileToParentJob, childCounts, config)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 270, in processFinishedJob
    job = Job.read(jobFile)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/job.py", line 39, in read
    job = _convertJsonJobToJob(pickler.load(fileHandle))
EOFError: EOF read where object expected

I haven't looked into jobTree internals in depth, but I think this is due to the job writing a completely blank pickle file (the t2/t3/t1/t2/job file has size 0). I'll look to see if we can be resilient to these types of errors and just retry the job if this happens. It wouldn't have helped in this particular case, since all jobs were failing in the same way, but presumably this could also happen if a single job is killed in the wrong way.

argparse functionality broken by 76a4328

Commit 76a4328 broke argparse functionality by failing to maintain the new method of displaying default values in help strings. The old, optparse method is to use

    help="blah blah blah default=%default"

and the argparse method, is to use

    help="blah blah blah default=%(default)s"

The two methods are incompatible with one another, which was why we had two methods to handle this.

76a4328

job failures not correctly reducing retry count

Courtesy of Tim Sackton:

I ran into another issue with progressiveCactus where a job that is issued with too little memory to finish will lead to a loop of that job getting restarted, failing, and getting restarted again, without the retryCount (apparently) going down at all. So basically we get stuck in a loop where one job is constantly failing.

Am I misunderstanding something? It seems like retryCount should deincrement each time the job fails, but you can see from the log here (https://gist.github.com/tsackton/03b1605c4e29762376f2) that the failing job is reissued several times but after the second and third failures the retry count is still at 5.

Is this a bug or an error in my code/understanding? It could easily be the latter....

Regardless of whether this is how retries are supposed to work, I was able to get past that error by doubling the memory the retry gets each time a job fails (see here: https://github.com/harvardinformatics/jobTree/blob/master/src/master.py#L79). Ideally I'd also be able to increase the amount of time a job requests, as that would be the other reason to get consistent failures, but I don't see how to do that, or even if it is possible.

jobTree support for argparse

optparse is deprecated, argparse is the replacement. This can be patched by checking the .class of the input in the addOptions() function contained in jobTree.src.jobTreeRun BUT will also need to be changed in sonLib.bioio addLoggingOptions(). I think the way jobTree and sonLib have used optparse is generic enough that the transition would be transparent to script writers.

periodically hangs entire parasol hub when listing jobs

This has been a problem for a while, but I'm just putting an issue up so I remember to fix this somehow.

When parasol has more than a million or so jobs queued, like now, the periodic "parasol -extended list jobs" command that jobTree runs hangs the entire parasol hub process for a couple minutes while it gets a listing of every job. This sucks, since it means that the cluster nodes start to go idle waiting for work, since the hub can't issue new jobs while it it's busy sending the list of queued jobs to jobTree. This gets even worse when there are a few jobTrees running; the cluster sometimes sits completely idle for several minutes.

We (read: I) should try to find some way around listing every job, maybe by looking to see if there's a way we can get the same information, but limited to just the jobTree batch rather than all batches. If there isn't a way currently, maybe modify parasol to include that functionality.

restarting jobTree the parameter changes have no effect

When restarting a jobTree parameter changes to the script (like changing the batch system) do not have any effect, because such parameters are overwritten when reading the state of the earlier jobTree. This is poor behaviour and should be fixed - at least warnings should be issued.

Relative imports don't always work in jobTree

Per our conversation, jobTree sometimes loses track of where imports are coming from.

Reporting file: /hive/users/dearl/alignathon/testPSAR/jobTree_flies_reg2_swarm/jobs/tmp_IqnDTgr8uv/tmp_AzbE0HYFWt/tmp_70td82ax1K/log.txt
log.txt:        Parsed arguments and set up logging
log.txt:        Traceback (most recent call last):
log.txt:          File "/cluster/home/dearl/sonTrace/jobTree/bin/jobTreeSlave", line 206, in main
log.txt:            loadStack(command).execute(job=job, stats=stats,
log.txt:          File "/cluster/home/dearl/sonTrace/jobTree/bin/jobTreeSlave", line 53, in loadStack
log.txt:            _temp = __import__(moduleName, globals(), locals(), [className], -1)
log.txt:        ImportError: No module named batchPsar
log.txt:        Exiting the slave because of a failed job on host kkr18u44.local
log.txt:        Finished running the chain of jobs on this node, we ran for a total of 5.177906 seconds

Requires explicit setting of PYTHONPATH when running on swarm, when running on kolossus simply running from the same directory as the script is sufficient.

Eliminate the cryptic use of colors to hide status

What does the use of colors, instead of descriptive english sentences, save you on a command line cluster management program? Because it costs your users brain space to store your chromatic cypher.

Furthermore, "dead" is not a color.

When do the tests end?

The basic make test takes ages... On a singleMachine I see a few dozen processes, all of them doing strictly nothing. Does this mean that some kind of race condition was hit? Or that the test was designed to test me?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.