Comments (4)
Can you be more clear? What software are you running? What is your environment? Your configuration? What is your precise use-case? In what way is it failing?
For Falcon, pypeflow already does what you seem to want. If any task fails, it will be re-run when pypeflow is re-invoked (fc_run/fc_unzip/etc). We could retry failed tasks without quitting, but that has never been an important use-case, since it's so easy to restart.
As for keeping partial results, you need to be very specific. We already partition the workflow. I see no reason why a given task cannot be restarted from scratch if it failed before. If a task is too large, you can control that youself by alteringpa_daligner_option
or pa_DBsplit_option
.
To put a runtime limit on a task, use pwatcher_type=blocking
and specify the limits in the submit
string. If you need different limits for different sections, you can specify your own variables (using ALL_CAPS) to be substituted into your own submit
string. It's very flexible.
from pbbioconda.
We're running a variant of SGE, but that's not the point. In our environment, a 'qsub' command can fail for reasons that we do not have control over and which have nothing to do with PacBio's code. For a large assembly, it is impractical to restart pypeflow when there is a failure of qsub command. We have solved this problem for our use case by implementing a script aptly named, 'qsub_with_retry'. It can detect if a qsub command completed normally, and if not, it will try again subject to limits that we can control.
As a hypothetical example, the task that creates the Dazzler database for the raw reads or corrected reads should first check to see if the database already exists in the directory where the database will be created. If it does, it should be deleted first before the FASTA files are loaded into it. If that task were restarted after the database was partially written, we would have two copies of the sequences loaded during the first attempt.
In the case of SMRT Link, restartability is easy to accomplish. Each task writes a small set of files in the task directory, and those files have the same name regardless of the function of the task. Thus, a task can be restarted by deleting all other files in the task directory before rerunning the 'qsub' command that accomplishes the task.
If the files written into a task directory at the beginning of a task for pb-assembly and pypeflow are the same for all tasks, then I can solve this problem for myself. Is that the case?
from pbbioconda.
As a hypothetical example, the task that creates the Dazzler database for the raw reads or corrected reads should first check to see if the database already exists in the directory where the database will be created. If it does, it should be deleted first before the FASTA files are loaded into it. If that task were restarted after the database was partially written, we would have two copies of the sequences loaded during the first attempt.
We already do that:
$ cat 0-rawreads/build/build_db.sh
#!/bin/bash
set -vex
echo "PBFALCON_ERRFILE=$PBFALCON_ERRFILE"
set -o pipefail
rm -f raw_reads.db .raw_reads.* # in case of re-run
...
All steps should be restartable on error. If you find one that is not, please let us know.
If the files written into a task directory at the beginning of a task for pb-assembly and pypeflow are the same for all tasks, then I can solve this problem for myself. Is that the case?
Ok. I see what you're looking for. You want to delete everything yourself (since you don't trust us), aside from the files which you need to keep. So you want to know which files to keep.
You have not mentioned which pwatcher_type
you use. (You have not even supplied your .cfg
, which would be helpful.) I can explain how it works for pwatcher_type=blocking
:
- pypeflow dumps some files into a run-directory:
task.json
task.sh
run.sh
run-XXX.bash
(maybe, depending on pwatcher_type)
- pypeflow calls your "submit" command (e.g.
qsub
) onpypeFLOW/pwatcher/mains/job_start.sh
. job_start.sh
will be passed two environment variables by pypeflow:PYPEFLOW_JOB_START_SCRIPT
-- the generatedrun-XXX.bash
scriptPYPEFLOW_JOB_START_TIMEOUT
-- a number
job_start.sh
will wait TIMEOUT seconds for the SCRIPT to exist. Then it will run that script.- Its purpose is to give qsub a definitely existing script,
job_start.sh
. (Generated files might be subject to filesystem latency. Many users have had latency problems with generated scripts.)
- Its purpose is to give qsub a definitely existing script,
- The
run-XXX.bash
script will change to the correct run-directory and runrun.sh
.- Its purpose is to change to the run-directory, in case qsub/etc. did not. (Some users need this.)
run.sh
will runtask.sh
and touchrun.sh.done
when finished.- Its purpose is the creation of that sentinel file. (This indicates "success", not the finishing of qsub.)
task.sh
will runpython do_task task.json
- Its purpose is to tell us something about the current machine on error, in case resources are being over-used.
do_task.py
will wait on input files intask.json
, and on output files at the end.- This could be written in a different language someday.
(For pwatcher_type=fs_based
, things are slightly different. But you can still rely on run.sh
, notwithstanding filesystem latency.)
(One reason why pbsmrtpipe is simpler is that it's slow, so it tends to have fewer filesystem latency problems. Another is that it's used by a smaller set of users, so it hasn't encountered as many user problems as we have via GitHub interactions.)
Here is what we actually pass to submit
:
About to submit: Node(0-rawreads/report)
Popen: '/bin/bash -C /localdisk/scratch/cdunn/repo/pypeFLOW/pwatcher/mains/job_start.sh >| /localdisk/scratch/cdunn/repo/FALCON-examples/run/synth0/0-rawreads/report/run-P0_report_19f0d0cd122fac952635bbb0e199e785.bash.stdout 2>| /localdisk/scratch/cdunn/repo/FALCON-examples/run/synth0/0-rawreads/report/run-P0_report_19f0d0cd122fac952635bbb0e199e785.bash.stderr'
With
#submit = bash -c ${JOB_SCRIPT} >| ${JOB_STDOUT} 2>| ${JOB_STDERR}
#submit = bash -c ${JOB_SCRIPT}
submit = qsub -S /bin/bash -sync y -V -q ${JOB_QUEUE} \
-N ${JOB_ID} \
-o "${STDOUT_FILE}" \
-e "${STDERR_FILE}" \
-pe smp ${NPROC} \
"${CMD}"
we would pass something like:
About to submit: Node(0-rawreads/report)
Popen: 'qsub -S /bin/bash -sync y -V -q default7 \
-N P0_report_ffbf6f8eb30c62d8c2e227e1625f5a1a \
-o "/lustre/hpcprod/cdunn/repo/FALCON-examples/run/synth0/0-rawreads/report/run-P0_report_ffbf6f8eb30c62d8c2e227e1625f5a1a.bash.stdout" \
-e "/lustre/hpcprod/cdunn/repo/FALCON-examples/run/synth0/0-rawreads/report/run-P0_report_ffbf6f8eb30c62d8c2e227e1625f5a1a.bash.stderr" \
-pe smp 1 \
"/localdisk/scratch/cdunn/repo/pypeFLOW/pwatcher/mains/job_start.sh"'
Yes, you need only those 4 files. If it's a problem that run-XXX.bash
has an unpredictable name, we can change that. Or, you can actually skip that file, instead switching to the run-directory yourself and calling run.sh
. So technically, you need only 3 files.
But you should avoid deleting run-XXX.bash.stderr/out
. Those are actually stderr/stdout of the actual qsub call, as shown above. (pbsmrtpipe also writes both stderr/out and cluster.stderr/out into the run-directory, but the filenames are known, which helps in your case.)
from pbbioconda.
Just to avoid creating a negative impression -- I trust all the PacBio developers -- you all do a great job solving the assembly problem with this software. What I don't trust is our Linux cluster. It is highly reliable, but not enough for jobs that require 100,000 job submissions (or more).
I missed the code for creating a Dazzler database, and what you wrote above addresses my issue. I used a bad example -- my apologies.
I'm currently running a job that has about 700 Gbp of Sequel reads, and I'm occasionally restarting jobs -- so far, no problems. Our cluster is having some NFS issues, so I have been losing nodes with my tasks running on them, but every restart has worked (as far as not causing an error exit).
from pbbioconda.
Related Issues (20)
- Isoseq collapse filtering out criteria HOT 1
- pbfusion HOT 11
- ipa fails at 18-purge_dups
- isoseq refine HOT 8
- Convert FASTQ in unaligned BAM HOT 1
- hifihla call-reads: thread 'main' panicked at src/cli/callreads_cli.rs:320:14: HOT 3
- dedup
- questions about pbsv
- Isoseq3 cluster fatal run HOT 1
- Mapping MT and RP genes from 10x single cell kinnex HOT 1
- Demultiplexing using lima HOT 2
- CL: Annotation of alignment classes HOT 7
- The *.flnc_count.txt output from isoseq collapse does not generate columns for multiple samples
- lima ERROR: [pbcopper] alarm ERROR: cannot write to empty alarm filename HOT 11
- Error while running Falcon HOT 1
- Cannot download Iso-Seq example data HOT 1
- filter
- Lima failing to detect CCS data after Skera de-concatination & Bam2Fastq conversion
- pigeon classification doesn't work (Chinese Hamster) HOT 4
- Segmentation fault (core dumped) when installing ilma.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pbbioconda.