GithubHelp home page GithubHelp logo

benedictpaten / jobtree Goto Github PK

View Code? Open in Web Editor NEW
23.0 7.0 18.0 3.18 MB

Python based pipeline management software for clusters (but checkout toil: https://github.com/BD2KGenomics/toil, its successor)

License: MIT License

Makefile 0.43% Python 99.57%

jobtree's Introduction

#jobTree Python based pipeline management software for clusters that makes running recursive and dynamically scheduled computations straightforward. So far works with gridEngine, lsf, parasol and on multi-core machines.

##Authors Benedict Paten, Dent Earl, Daniel Zerbino, Glenn Hickey, other UCSC people.

##Requirements

  • Python 2.5 or later, but less than 3.0

##Installation

  1. Install sonLib. See https://github.com/benedictpaten/sonLib

  2. Place the directory containing the jobTree in the same directory as sonLib. The directory containing both sonLib and jobTree should be on your python path. i.e. PYTHONPATH=${PYTHONPATH}:FOO where FOO/jobTree is the path containing the base directory of jobTree.

  3. Build the code: Type 'make all' in the base directory, this just puts some stuff that is currently all python based in the bin dir. In the future there might be some actual compilation.

  4. Test the code: python allTests.py or 'make test'.

##Running and examining a jobTree script

The following walks through running a jobTree script and using the command-line tools jobTreeStatus, jobTreeRun and jobTreeStats, which are used to analyse the status, restart and print performance statistics, respectively, about a run.

Once jobTree is installed, running a jobTree script is performed by executing the script from the command-line, e.g. (using the file sorting toy example in tests/sort/scriptTreeTest_Sort.py):

[]$ scriptTreeTest_Sort.py --fileToSort foo --jobTree bar/jobTree --batchSystem parasol --logLevel INFO --stats

Which in this case uses the parasol batch system, and INFO level logging and where foo is the file to sort and bar/jobTree is the location of a directory (which should not already exist) from which the batch will be managed. Details of the jobTree options are described below; the stats option is used to gather statistics about the jobs in a run.

The script will return a zero exit value if the jobTree system is successfully able to run to completion, else it will create an exception. If the script fails because a job failed then the log file information of the job will be reported to std error. The jobTree directory (here 'bar/jobTree') is not automatically deleted regardless of success or failure, and contains a record of the jobs run, which can be enquired about using the jobTreeStatus command. e.g.

[]$jobTreeStatus bar/jobTree --verbose

There are 0 active jobs, 0 parent jobs with children, 0 totally failed jobs and 0 empty jobs (i.e. finished but not cleaned up) curre
ntly in job tree: jobTree
There are no failed jobs to report

If a job failed, this provides a convenient way to reprint the error. The following are the important options to jobTreeStatus:

--jobTree=JOBTREE     Directory containing the job tree. The jobTree location can also be specified as the argument to the script. default=./jobTree
--verbose             Print loads of information, particularly all the log
                    files of jobs that failed. default=False
--failIfNotComplete   Return exit value of 1 if job tree jobs not all
                    completed. default=False

If a job in the script failed or the system goes down, you may wish to retry the job after fixing the error. This can be achieved by restarting the script with the jobTreeRun command which will restart an existing jobTree.

[]$ jobTreeRun --jobTree bar/jobTree --logLevel INFO

It will always attempt to restart the jobs from the previous point of failure.

If the script was run with the --stats option then jobTreeStats can be run on the pipeline do generate information about the performance of the run, in terms of how many jobs were run, how long they executed for and how much CPU time/wait time was involved, e.g.:

[]$jobTreeStats bar/jobTree

Batch System: singleMachine
Default CPU: 1  Default Memory: 2097152K
Job Time: 0.50  Max CPUs: 9.22337e+18  Max Threads: 4
Total Clock: 0.09  Total Runtime: 7.60
Slave
    Count |                                    Time* |                                    Clock |                                     Wait |                                               Memory 
        n |      min    med*     ave     max   total |      min     med     ave     max   total |      min     med     ave     max   total |        min       med       ave       max       total 
      365 |     0.01    0.02    0.02    0.06    6.82 |     0.01    0.01    0.01    0.04    4.71 |     0.00    0.00    0.01    0.03    2.11 |   9781248K 13869056K 13799121K 14639104K 5036679168K
Target
 Slave Jobs   |     min    med    ave    max
              |       2      2      2      2
    Count |                                    Time* |                                    Clock |                                     Wait |                                               Memory 
        n |      min    med*     ave     max   total |      min     med     ave     max   total |      min     med     ave     max   total |        min       med       ave       max       total 
      367 |     0.00    0.00    0.00    0.03    0.68 |     0.00    0.00    0.00    0.01    0.42 |     0.00    0.00    0.00    0.03    0.26 |   9461760K 13869056K 13787694K 14639104K 5060083712K
 Cleanup
    Count |                                    Time* |                                    Clock |                                     Wait |                                               Memory 
        n |      min    med*     ave     max   total |      min     med     ave     max   total |      min     med     ave     max   total |        min       med       ave       max       total 
        1 |     0.00    0.00    0.00    0.00    0.00 |     0.00    0.00    0.00    0.00    0.00 |     0.00    0.00    0.00    0.00    0.00 |  14639104K 14639104K 14639104K 14639104K   14639104K
 Up
    Count |                                    Time* |                                    Clock |                                     Wait |                                               Memory 
        n |      min    med*     ave     max   total |      min     med     ave     max   total |      min     med     ave     max   total |        min       med       ave       max       total 
      124 |     0.00    0.00    0.00    0.01    0.15 |     0.00    0.00    0.00    0.01    0.12 |     0.00    0.00    0.00    0.01    0.03 |  13713408K 14090240K 14044985K 14581760K 1741578240K
 Setup
    Count |                                    Time* |                                    Clock |                                     Wait |                                               Memory 
        n |      min    med*     ave     max   total |      min     med     ave     max   total |      min     med     ave     max   total |        min       med       ave       max       total 
        1 |     0.00    0.00    0.00    0.00    0.00 |     0.00    0.00    0.00    0.00    0.00 |     0.00    0.00    0.00    0.00    0.00 |   9551872K  9551872K  9551872K  9551872K    9551872K
 Down
    Count |                                    Time* |                                    Clock |                                     Wait |                                               Memory 
        n |      min    med*     ave     max   total |      min     med     ave     max   total |      min     med     ave     max   total |        min       med       ave       max       total 
      241 |     0.00    0.00    0.00    0.03    0.53 |     0.00    0.00    0.00    0.00    0.30 |     0.00    0.00    0.00    0.03    0.23 |   9461760K 13828096K 13669354K 14155776K 3294314496K

The breakdown is given per "slave", which is unit of serial execution, and per "target", which corresponds to a scriptTree target (see below). Despite its simplicity, we've found this can be very useful for tracking down performance issues, particularly when trying out a pipeline on a new system.

The important arguments to jobTreeStats are:

--outputFile=OUTPUTFILE
                    File in which to write results
--raw                 output the raw xml data.
--pretty, --human     if not raw, prettify the numbers to be human readable.
--categories=CATEGORIES
                    comma separated list from [time, clock, wait, memory]
--sortCategory=SORTCATEGORY
                    how to sort Target list. may be from [alpha, time,
                    clock, wait, memory, count]. default=%(default)s
--sortField=SORTFIELD
                    how to sort Target list. may be from [min, med, ave,
                    max, total]. default=%(default)s
--sortReverse, --reverseSort
                    reverse sort order.
--cache               stores a cache to speed up data display.

##jobTree options

A jobTree script will have the following command-line options.

Options that control logging.

--logOff            Turn off logging. (default is CRITICAL)
--logInfo           Turn on logging at INFO level. (default is CRITICAL)
--logDebug          Turn on logging at DEBUG level. (default is CRITICAL)
--logLevel=LOGLEVEL
                    Log at level (may be either OFF/INFO/DEBUG/CRITICAL).
                    (default is CRITICAL)
--logFile=LOGFILE   File to log in
--rotatingLogging   Turn on rotating logging, which prevents log files
                    getting too big.

Options to specify the location of the jobTree and turn on stats collation about the performance of jobs.

--jobTree=JOBTREE   Directory in which to place job management files and
                    the global accessed temporary file directories(this
                    needs to be globally accessible by all machines
                    running jobs). If you pass an existing directory it
                    will check if it's a valid existing job tree, then try
                    and restart the jobs in it. The default=./jobTree
--stats             Records statistics about the job-tree to be used by
                    jobTreeStats. default=False

Options for specifying the batch system, and arguments to the batch system/big batch system (see below).

--batchSystem=BATCHSYSTEM
                    The type of batch system to run the job(s) with,
                    currently can be
                    'singleMachine'/'parasol'/'acidTest'/'gridEngine'/'lsf'.
                    default=singleMachine
--maxThreads=MAXTHREADS
                    The maximum number of threads (technically processes
                    at this point) to use when running in single machine
                    mode. Increasing this will allow more jobs to run
                    concurrently when running on a single machine.
                    default=4
--parasolCommand=PARASOLCOMMAND
                    The command to run the parasol program default=parasol

Options to specify default cpu/memory requirements (if not
specified by the jobs themselves), and to limit the total amount of
memory/cpu requested from the batch system.

--defaultMemory=DEFAULTMEMORY
                    The default amount of memory to request for a job (in
                    bytes), by default is 2^31 = 2 gigabytes,
                    default=2147483648
--defaultCpu=DEFAULTCPU
                    The default the number of cpus to dedicate a job.
                    default=1
--maxCpus=MAXCPUS   The maximum number of cpus to request from the batch
                    system at any one time. default=9223372036854775807
--maxMemory=MAXMEMORY
                    The maximum amount of memory to request from the batch
                    system at any one time. default=9223372036854775807

Options for rescuing/killing/restarting jobs, includes options for jobs that either run too long/fail or get lost (some batch systems have issues!).

--retryCount=RETRYCOUNT
                    Number of times to retry a failing job before giving
                    up and labeling job failed. default=0
--maxJobDuration=MAXJOBDURATION
                    Maximum runtime of a job (in seconds) before we kill
                    it (this is a lower bound, and the actual time before
                    killing the job may be longer).
                    default=9223372036854775807
--rescueJobsFrequency=RESCUEJOBSFREQUENCY
                    Period of time to wait (in seconds) between checking
                    for missing/overlong jobs, that is jobs which get lost
                    by the batch system. Expert parameter. (default is set
                    by the batch system)

jobTree big batch system options; jobTree can employ a secondary batch system for running large memory/cpu jobs using the following arguments.

--bigBatchSystem=BIGBATCHSYSTEM
                    The batch system to run for jobs with larger
                    memory/cpus requests, currently can be
                    'singleMachine'/'parasol'/'acidTest'/'gridEngine'.
                    default=none
--bigMemoryThreshold=BIGMEMORYTHRESHOLD
                    The memory threshold above which to submit to the big
                    queue. default=9223372036854775807
--bigCpuThreshold=BIGCPUTHRESHOLD
                    The cpu threshold above which to submit to the big
                    queue. default=9223372036854775807
--bigMaxCpus=BIGMAXCPUS
                    The maximum number of big batch system cpus to allow
                    at one time on the big queue.
                    default=9223372036854775807
--bigMaxMemory=BIGMAXMEMORY
                    The maximum amount of memory to request from the big
                    batch system at any one time.
                    default=9223372036854775807

Miscellaneous options.

--jobTime=JOBTIME   The approximate time (in seconds) that you'd like a
                    list of child jobs to be run serially before being
                    parallelized. This parameter allows one to avoid over
                    parallelizing tiny jobs, and therefore paying
                    significant scheduling overhead, by running tiny jobs
                    in series on a single node/core of the cluster.
                    default=30
--maxLogFileSize=MAXLOGFILESIZE
                    The maximum size of a job log file to keep (in bytes),
                    log files larger than this will be truncated to the
                    last X bytes. Default is 50 kilobytes, default=50120
--command=COMMAND   The command to run (which will generate subsequent
                    jobs). This is deprecated

##Overview of jobTree

The following sections are for people creating jobTree scripts and as general information. The presentation docs/jobTreeSlides.pdf is also a quite useful, albeit slightly out of date, guide to using jobTree. -

Most batch systems (such as LSF, Parasol, etc.) do not allow jobs to spawn other jobs in a simple way.

The basic pattern provided by jobTree is as follows:

  1. You have a job running on your cluster which requires further parallelisation.
  2. You create a list of jobs to perform this parallelisation. These are the 'child' jobs of your process, we call them collectively the 'children'.
  3. You create a 'follow-on' job, to be performed after all the children have successfully completed. This job is responsible for cleaning up the input files created for the children and doing any further processing. Children should not cleanup files created by parents, in case of a batch system failure which requires the child to be re-run (see 'Atomicity' below).
  4. You end your current job successfully.
  5. The batch system runs the children. These jobs may in turn have children and follow-on jobs.
  6. Upon completion of all the children (and children's children and follow-ons, collectively descendants) the follow-on job is run. The follow-on job may create more children.

##scriptTree ScriptTree provides a Python interface to jobTree, and is now the only way to interface with jobTree (previously you could manipulate XML files, but I've removed that functionality as I improved the underlying system).

Aside from being the interface to jobTree, scriptTree was designed to remediate some of the pain of writing wrapper scripts for cluster jobs, via the extension of a simple python wrapper class (called a 'Target' to avoid confusion with the more general use of the word 'job') which does much of the work for you. Using scriptTree, you can describe your script as a series of these classes which link together, with all the arguments and options specified in one place. The script then, using the magic of python pickles, generates all the wrappers dynamically and clean them up when done.

This inherited template pattern has the following advantages:

  1. You write (potentially) just one script, not a series of wrappers. It is much easier to understand, maintain, document and explain.
  2. You write less boiler plate.
  3. You can organise all the input arguments and options in one place.

The best way to learn how to use script tree is to look at an example. The following is taken from (an old version of) jobTree.test.sort.scriptTreeTest_Sort.py which provides a complete script for performing a parallel merge sort.

Below is the first 'Target' of this script inherited from the base class 'jobTree.scriptTree.Target'. Its job is to setup the merge sort.

class Setup(Target):
    """Sets up the sort.
    """
    def __init__(self, inputFile, N):
        Target.__init__(self, time=1, memory=1000000, cpu=1)
        self.inputFile = inputFile
        self.N = N
    
    def run(self):
        tempOutputFile = getTempFile(rootDir=self.getGlobalTempDir())
        self.addChildTarget(Down(self.inputFile, 0, os.path.getsize(self.inputFile), self.N, tempOutputFile))
        self.setFollowOnTarget(Cleanup(tempOutputFile, self.inputFile))

The constructor (init()) assigns some variables to the class. When invoking the constructor of the base class (which should be the first thing the target does), you can optionally pass time (in seconds), memory (in bytes) and cpu parameters. The time parameter is your estimate of how long the target will run - UPDATE: IT IS CURRENTLY UNUSED BY THE SCHEDULAR. The memory and cpu parameters allow you to guarantee resources for a target.

The run method is where the variables assigned by the constructor are used and where in general actual work is done. Aside from doing the specific work of the target (in this case creating a temporary file to hold some intermediate output), the run method is also where children and a follow-on job are created, using addChildTarget() and setFollowOnTarget(). A job may have arbitrary numbers of children, but one or zero follow-on jobs.

Targets are also provided with two temporary file directories called localTempDir and globalTempDir, which can be accessed with the methods getLocalTempDir() and getGlobalTempDir(), respectively. The localTempDir is the path to a temporary directory that is local to the machine on which the target is being executed and that will exist only for the length of the run method. It is useful for storing interim results that are computed during runtime. All files in this directory are guaranteed to be removed once the run method has finished - even if your target crashes.

A job can either be created as a follow-on, or it can be the very first job, or it can be created as a child of another job. Let a job not created as a follow-on be called a 'founder'. Each founder job may have a follow-on job. If it has a follow-on job, this follow-on job may in turn have a follow-on, etc. Thus each founder job defines a chain of follow-ons. Let a founder job and its maximal sequence of follow-ons be called a 'chain'. Let the last follow-on job in a chain be called the chain's 'closer'. For each chain of targets a temporary directory, globalTempDir, is created immediately prior to calling the founder target's run method, this directory and its contents then persist until the completion of closer target's run method. Thus the globalTempDir is a scratch directory in which temporary results can be stored on disk between target jobs in a chain. Furthermore, files created in this directory can be passed to the children of target jobs in the chain, allowing results to be transmitted from a target job to its children.

##Making Functions into Targets

To avoid the need to create a Target class for every job, I've added the ability to wrap functions, hence the code for the setup function described above becomes:

def setup(target, inputFile, N):
    """Sets up the sort.
    """
    tempOutputFile = getTempFile(rootDir=target.getGlobalTempDir())
    target.addChildTargetFn(down, (inputFile, 0, os.path.getsize(inputFile), N, tempOutputFile))
    target.setFollowOnFn(cleanup, (tempOutputFile, inputFile))

The code to turn this into a target uses the static method Target.makeTargetFnTarget:

Target.makeTargetFnTarget(setup, (fileToSort, N))

Notice that the child and follow-on targets have also been refactored as functions, hence the methods addChildTargetFn and setFollowOnFn, which take functions as opposed to Target objects.

Note, there are two types of functions you can wrap - target functions, whose first argument must be the wrapping target object (the setup function above is an excample of a target function), and plain functions that do not have a reference to the wrapping target.

##Creating a scriptTree script:

ScriptTree targets are serialized (written and retrieved from disk) so that they can be executed in parallel on a cluster of different machines. Thankfully, this is mostly transparent to the user, except for the fact that targets must be 'pickled' (see python docs), which creates a few constraints upon what can and can not be passed to and stored by a target. Currently the preferred way to run a pipeline is to create an executable python script. For example, see tests/sorts/scriptTreeTest_Sort.py.

The first line to notice is:

from jobTree.scriptTree.target import Target, Stack

This imports the Target and Stack objects (the stack object is used to run the targets).

Most of the code defines a series of targets (see above). The main() method is where the script is setup and run.

The line:

    parser = OptionParser()

Creates an options parser using the python module optparse.

The line:

    Stack.addJobTreeOptions(parser)

Adds the jobTree options to the parser. Most importantly it adds the command line options "--jobTree [path to jobTree]", which is the location in which the jobTree will be created, and which must be supplied to the script.

The subsequent lines parse the input arguments, notably the line:

    options, args = parser.parse_args()

reads in the input parameters.

The line:

    i = Stack(Setup(options.fileToSort, int(options.N))).startJobTree(options)

Is where the first target is created (the Setup target shown above), where a stack object is created, which is passed the first target as its sole construction argument, and finally where the jobTree is executed from, using the stack's startJobTree(options) function. The 'options' argument will contain a dictionary of command line arguments which are used by jobTree. The return value of this function is equal to the number of failed targets. In this case we choose to throw an exception if there are any remaining.

One final important detail, the lines:

    if __name__ == '__main__':
        from jobTree.test.sort.scriptTreeTest_Sort import *

reload the objects in the module, such that their module names will be absolute (this is necessary for the serialization that is used). Targets in other classes that are imported do not need to be reloaded in this way.

##Atomicity jobTree and scriptTree are designed to be robust, so that individuals jobs (targets) can fail, and be subsequently restarted. It is assumed jobs can fail at any point. Thus until jobTree knows your children have been completed okay you can not assume that your Target has been completed. To ensure that your pipeline can be restarted after a failure ensure that every job (target):

  1. Never cleans up / alters its own input files. Instead, parents and follow on jobs may clean up the files of children or prior jobs.

  2. Can be re-run from just its input files any number of times. A job should only depend on its input, and it should be possible to run the job as many times as desired, essentially until news of its completion is successfully transmitted to the job tree master process.

    These two properties are the key to job atomicity. Additionally, you'll find it much easier if a job:

  3. Only creates temp files in the two provided temporary file directories. This ensures we don't soil the cluster's disks.

  4. Logs sensibly, so that error messages can be transmitted back to the master and the pipeline can be successfully debugged.

##Environment jobTree replicates the environment in which jobTree or scriptTree is invoked and provides this environment to all the jobs/targets. This ensures uniformity of the environment variables for every job.

##FAQ's:

  • How robust is jobTree to failures of nodes and/or the master?

    JobTree checkpoints its state on disk, so that it or the job manager can be wiped out and restarted. There is some gnarly test code to show how this works, it will keep crashing everything, at random points, but eventually everything will complete okay. As a user you needn't worry about any of this, but your child jobs must be atomic (as with all batch systems), and must follow the convention regarding input files.

  • How scaleable?

    We have tested having 1000 concurrent jobs running on our cluster. This will depend on the underlying batch system being used.

  • Can you support the XYZ batch system?

    See the abstract base class 'AbstractBatchSystem' in the code to see what functions need to be implemented. It's reasonably straight forward.

  • Is there an API for the jobTree top level commands?

    Not really - at this point please use scriptTree and the few command line utilities present in the bin directory.

  • Why am I getting the error "ImportError: No module named etree.ElementTree"?

    The version of python in your path is less than 2.5. When jobTree spawns a new job it will use the python found in your PATH. Make sure that the first python in your PATH points to a python version greater than or equal to 2.5 but less than 3.0

jobtree's People

Contributors

adamnovak avatar benedictpaten avatar cvaske avatar dentearl avatar dzerbino avatar glennhickey avatar ifiddes avatar joelarmstrong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

jobtree's Issues

Relative imports don't always work in jobTree

Per our conversation, jobTree sometimes loses track of where imports are coming from.

Reporting file: /hive/users/dearl/alignathon/testPSAR/jobTree_flies_reg2_swarm/jobs/tmp_IqnDTgr8uv/tmp_AzbE0HYFWt/tmp_70td82ax1K/log.txt
log.txt:        Parsed arguments and set up logging
log.txt:        Traceback (most recent call last):
log.txt:          File "/cluster/home/dearl/sonTrace/jobTree/bin/jobTreeSlave", line 206, in main
log.txt:            loadStack(command).execute(job=job, stats=stats,
log.txt:          File "/cluster/home/dearl/sonTrace/jobTree/bin/jobTreeSlave", line 53, in loadStack
log.txt:            _temp = __import__(moduleName, globals(), locals(), [className], -1)
log.txt:        ImportError: No module named batchPsar
log.txt:        Exiting the slave because of a failed job on host kkr18u44.local
log.txt:        Finished running the chain of jobs on this node, we ran for a total of 5.177906 seconds

Requires explicit setting of PYTHONPATH when running on swarm, when running on kolossus simply running from the same directory as the script is sufficient.

When do the tests end?

The basic make test takes ages... On a singleMachine I see a few dozen processes, all of them doing strictly nothing. Does this mean that some kind of race condition was hit? Or that the test was designed to test me?

jobTree support for argparse

optparse is deprecated, argparse is the replacement. This can be patched by checking the .class of the input in the addOptions() function contained in jobTree.src.jobTreeRun BUT will also need to be changed in sonLib.bioio addLoggingOptions(). I think the way jobTree and sonLib have used optparse is generic enough that the transition would be transparent to script writers.

periodically hangs entire parasol hub when listing jobs

This has been a problem for a while, but I'm just putting an issue up so I remember to fix this somehow.

When parasol has more than a million or so jobs queued, like now, the periodic "parasol -extended list jobs" command that jobTree runs hangs the entire parasol hub process for a couple minutes while it gets a listing of every job. This sucks, since it means that the cluster nodes start to go idle waiting for work, since the hub can't issue new jobs while it it's busy sending the list of queued jobs to jobTree. This gets even worse when there are a few jobTrees running; the cluster sometimes sits completely idle for several minutes.

We (read: I) should try to find some way around listing every job, maybe by looking to see if there's a way we can get the same information, but limited to just the jobTree batch rather than all batches. If there isn't a way currently, maybe modify parasol to include that functionality.

argparse functionality broken by 76a4328

Commit 76a4328 broke argparse functionality by failing to maintain the new method of displaying default values in help strings. The old, optparse method is to use

    help="blah blah blah default=%default"

and the argparse method, is to use

    help="blah blah blah default=%(default)s"

The two methods are incompatible with one another, which was why we had two methods to handle this.

76a4328

errors in "make test"

Hi Benedict,

I got some error messages when I ran "make test"


Starting to create the job tree setup for the first time
Traceback (most recent call last):
  File "/projects/gec/tool/assemblathon1/jobTree/bin/scriptTreeTest_Sort.py", line 119, in <module>
    main()
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/sort/scriptTreeTest_Sort.py", line 112, in main
    i = Stack(Setup(options.fileToSort, int(options.N))).startJobTree(options)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/scriptTree/stack.py", line 112, in startJobTree
    config, batchSystem = createJobTree(options)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/src/jobTreeRun.py", line 221, in createJobTree
    batchSystem = loadTheBatchSystem(config)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/src/jobTreeRun.py", line 157, in loadTheBatchSystem
    batchSystem = GridengineBatchSystem(config)
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/batchSystems/gridengine.py", line 136, in __init__
    self.obtainSystemConstants()
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/batchSystems/gridengine.py", line 267, in obtainSystemConstants
    p = subprocess.Popen(["qhost"], stdout = subprocess.PIPE,stderr = subprocess.STDOUT)
  File "/usr/lib64/python2.6/subprocess.py", line 595, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1106, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory
E
======================================================================
ERROR: Uses the jobTreeTest code to test the scriptTree Target wrapper.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/scriptTree/scriptTreeTest.py", line 31, in testScriptTree_Example
    system(command)
  File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
    raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Wrapper.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_KEwg5sdOhb/jobTree --logLevel=INFO --retryCount=10 exited with non-zero status 256

======================================================================
ERROR: Tests that the global and local temp dirs of a job behave as expected.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/scriptTree/scriptTreeTest.py", line 39, in testScriptTree_Example2
    system(command)
  File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
    raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Wrapper2.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_GKficygzKn/jobTree --logLevel=INFO --retryCount=0 exited with non-zero status 256

======================================================================
ERROR: Tests the jobTreeStats utility using the scriptTree_sort example.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/gpfs2/projects/gec/tool/assemblathon1/jobTree/test/utilities/statsTest.py", line 35, in testJobTreeStats_SortSimple
    system(command)
  File "/gpfs2/projects/gec/tool/assemblathon1/sonLib/bioio.py", line 160, in system
    raise RuntimeError("Command: %s exited with non-zero status %i" % (command, i))
RuntimeError: Command: scriptTreeTest_Sort.py --jobTree /gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_rZqK92206Q/jobTree --logLevel=DEBUG --fileToSort=/gpfs2/projects/gec/tool/assemblathon1/jobTree/tmp_rZqK922
06Q/tmp_wj73HRVMZx --N 1000 --stats --jobTime 0.5 exited with non-zero status 256

----------------------------------------------------------------------
Ran 14 tests in 318.742s

FAILED (errors=3)

Could I ask what it happened?
Thank you.

Regards,

Yun

Should make package pip installable

Obviously - though consideration needs to be made of the command line binaries - this could possibly be managed via the python path, removing the command line binaries.

entire jobTree crashes when a job's pickle file isn't written correctly

I just had several jobTrees fail (presumably due to a filesystem problem) with this error:

Batch system is reporting that the job (1, 298079848) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t1/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079849) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job failed with exit value 256
No log file is present, despite job failing: /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job
Due to failure we are reducing the remaining retry count of job /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t3/t1/t0/t2/t0/job to 3
We have set the default memory of the failed job to 8589934593.0 bytes
Batch system is reporting that the job (1, 298079823) /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job failed with exit value 256
There was a .new file for the job and no .updating file /hive/users/jcarmstr/cactusStuff/phylogenyTests/glires/work-noRescue2/jobTree/jobs/t2/t2/t2/t2/t3/t1/t2/job
Traceback (most recent call last):
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/bin/cactus_progressive.py", line 239, in <module>
    main()
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/cactus/progressive/cactus_progressive.py", line 235, in main
    Stack(RunCactusPreprocessorThenProgressiveDown(options, args)).startJobTree(options)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 95, in startJobTree
    return mainLoop(config, batchSystem)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 454, in mainLoop
    processFinishedJob(jobID, result, updatedJobFiles, jobBatcher, childJobFileToParentJob, childCounts, config)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/master.py", line 270, in processFinishedJob
    job = Job.read(jobFile)
  File "/cluster/home/jcarmstr/progressiveCactus/submodules/jobTree/src/job.py", line 39, in read
    job = _convertJsonJobToJob(pickler.load(fileHandle))
EOFError: EOF read where object expected

I haven't looked into jobTree internals in depth, but I think this is due to the job writing a completely blank pickle file (the t2/t3/t1/t2/job file has size 0). I'll look to see if we can be resilient to these types of errors and just retry the job if this happens. It wouldn't have helped in this particular case, since all jobs were failing in the same way, but presumably this could also happen if a single job is killed in the wrong way.

Eliminate the cryptic use of colors to hide status

What does the use of colors, instead of descriptive english sentences, save you on a command line cluster management program? Because it costs your users brain space to store your chromatic cypher.

Furthermore, "dead" is not a color.

job failures not correctly reducing retry count

Courtesy of Tim Sackton:

I ran into another issue with progressiveCactus where a job that is issued with too little memory to finish will lead to a loop of that job getting restarted, failing, and getting restarted again, without the retryCount (apparently) going down at all. So basically we get stuck in a loop where one job is constantly failing.

Am I misunderstanding something? It seems like retryCount should deincrement each time the job fails, but you can see from the log here (https://gist.github.com/tsackton/03b1605c4e29762376f2) that the failing job is reissued several times but after the second and third failures the retry count is still at 5.

Is this a bug or an error in my code/understanding? It could easily be the latter....

Regardless of whether this is how retries are supposed to work, I was able to get past that error by doubling the memory the retry gets each time a job fails (see here: https://github.com/harvardinformatics/jobTree/blob/master/src/master.py#L79). Ideally I'd also be able to increase the amount of time a job requests, as that would be the other reason to get consistent failures, but I don't see how to do that, or even if it is possible.

Crash getting global temp dir when restarting jobtree

When I go to restart one of my jobTree scripts that uses the global and local temp directories, after some of its jobs have I get a crash from the internal jobTree code complaining about some directories not existing.

Can the code be made to handle the lack of existence of those directories?

log.txt:    ---JOBTREE SLAVE OUTPUT LOG---
log.txt:    Traceback (most recent call last):
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/src/jobTreeSlave.py", line 271, in main
log.txt:        defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth)
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/stack.py", line 153, in execute
log.txt:        self.target.run()
log.txt:      File "/cluster/home/anovak/hive/sgdev/mhc/targets.py", line 244, in run
log.txt:        index_dir = sonLib.bioio.getTempFile(rootDir=self.getGlobalTempDir())
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/target.py", line 103, in getGlobalTempDir
log.txt:        self.globalTempDir = self.stack.getGlobalTempDir()
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/jobTree-1.0-py2.7.egg/jobTree/scriptTree/stack.py", line 129, in getGlobalTempDir
log.txt:        return getTempDirectory(rootDir=self.globalTempDir)
log.txt:      File "/cluster/home/anovak/.local/lib/python2.7/site-packages/sonLib-1.0-py2.7.egg/sonLib/bioio.py", line 457, in getTempDirectory
log.txt:        os.mkdir(rootDir)
log.txt:    OSError: [Errno 20] Not a directory: '/cluster/home/anovak/hive/sgdev/mhc/tree7/jobs/t2/t3/t1/t0/gTD0/tmp_Zss3uyl5X6/tmp_45OevDhWor/tmp_vxiVIbzGSw'
log.txt:    Exiting the slave because of a failed job on host ku-1-21.local
log.txt:    Due to failure we are reducing the remaining retry count of job /cluster/home/anovak/hive/sgdev/mhc/tree7/jobs/t2/t3/t1/t0/job to 0

restarting jobTree the parameter changes have no effect

When restarting a jobTree parameter changes to the script (like changing the batch system) do not have any effect, because such parameters are overwritten when reading the state of the earlier jobTree. This is poor behaviour and should be fixed - at least warnings should be issued.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.