stjudecloud / workflows Goto Github PK
View Code? Open in Web Editor NEWBioinformatics workflows developed for and used on the St. Jude Cloud project.
License: MIT License
Bioinformatics workflows developed for and used on the St. Jude Cloud project.
License: MIT License
black
has a very good rationale for their stance on trailing commas and I say we follow that example.
black
which says short lists with few items should be collapsed to one line.The common thread on these 3 rules is that they are all trivial to enforce via an auto-formatter tool (and tedious/annoying to enforce manually). Therefore, I vote we hold off adopting these officially till that auto-formatter has reached fruition. Mainly to keep our codebase compliant with our documents.
All of the other steps that use the GTF file seem to properly handle compressed input, but this step requires the file to be plain-text. It should handle compressed input appropriately.
Line 21 in f19e0d3
Finally, we plan to migrate the WDL tools
to a new repository. There are several tasks required before we can make the change.
The recently released RNA-Seq Standard 2.1.0 is not backward compatible with 2.0.0, particularly with input and output naming.
In my case, inputs were changed: rnaseq_standard.strand
was renamed to rnaseq_standard.strandedness
; and rnaseq_standard.htseq_count.memory_gb
changed behavior to rnaseq_standard.htseq_count.added_memory_gb
. The feature counts suffix was also changed from .counts.txt
to .feature-counts.txt
.
These changes are unexpected, as the readme says versioned workflows follow semantic versioning.
Workflow: RNA-Seq Standard 2.0.0
When the input records have mates, htseq-count keeps an arbitrarily-sized buffer to match record pairs. In extreme cases, the default buffer size --max-reads-in-buffer 30000000
is too small, causing the following error:
Error occured when processing SAM input (record #396226907 in file sample.bam):
Maximum alignment buffer size exceeded while pairing SAM alignments.
I propose either adding an input to override the value (Int max_reads_in_buffer = 30000000
) or fixing the value to an infeasibly high record count, e.g., 2^63-1. The latter is then simply bounded by memory.
Update tasks and workflows to WDL version 1.1
docker
runtime keys to container
sep
expressions to sep
functionThe analyses run should be toggle-able for users only interested in a subset of the workflow (and to keep costs down)
From feedback on RFC 0001, we have decided to remove duplicate marking. Since we have no ability to determine whether reads are actually duplicated and considering the results presented in https://www.nature.com/articles/srep25533, the best course of action appears to be no longer marking duplicates to avoid giving downstream tools inaccurate information and allow the tools to make their own determination.
Multiple included WDL files have incorrect runtime settings when running on Google Cloud.
For example:
https://github.com/stjudecloud/workflows/blob/92339ee9786e5eb55df80e2cba7b07c4822f2366/tools/star.wdl#L163
The setting should be 'disks' and not 'disk'. The following is the output from Cromwell when running the rnaseq-standard-fastq.wdl workflow:
2021-05-05 18:47:34,785 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,787 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,787 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk, dx_instance_type] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,792 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
I'll single the 3 worst offenders out: util.wdl
, samtools.wdl
, and picard.wdl
have so many tasks in them it becomes difficult to find the one you're looking for. At least that's my experience. They are each >700 lines long, which I think in just about every other language would be considered a behemoth for maintenance. Most languages have recommended file lengths (guessing an average consensus would be around 500 lines?), I propose we adopt something similar for WDL/this repo.
Although I don't want to base it off line number. I feel that can encourage some sloppy coding when the file in question is around the limit. e.g. collapsing lines that should be separated in order to keep below the length limit.
The below indented section would be me thinking out loud realizing all my ideas have some fatal flaw. I'm stumped as to how to solve the problem. Feel free to skip the indented section, or read it to see the thoughts I've had.
A saner approach to me is a task limit. The exact number might require some trial and error. Rough gut feeling: 5 seems too strict, 10 seems too lenient. I'd say we start looking in the 6-9 range for our task number limit.
"But we currently organize our files by tool. Are we throwing that scheme out?" No! (At least that's not my initial proposal. I'd hear someone out if they have an alternative.)
I say we start with organizing our files by tool, and then once they grow past 6-9 tasks, we make a split. What that split is will depend on some context.
For ex.picard.wdl
: could be split intopicard-qc.wdl
andpicard-manipulation.wdl
.
picard-qc
has all the Picard tasks which generate a report of some kind, and don't change the BAM file.
picard-manipulation
has all the Picard tasks which deal with modifying BAM files.
samtools.wdl
could be split into... Alright I don't see a great way to split this file.
Let's tryutil.wdl
: could be split intoutil-python-scripts.wdl
, and gosh this is proving more difficult than I expected.Pivot: what about sorting our tasks? That would also accomplish the goal of making it easier to find a specific task in a long file.
Let's start with a file whose order I like:
kraken2.wdl
. It's ordered so well I know it off the top of my head:download_taxonomy
,download_library
,create_library_from_fastas
.build_db
,kraken
. It flows in order that the tasks would be used. It's logical. This seems to be a special case I don't see a way to generalize. Unfortunate...
ngsderive.wdl
is roughly in the order that the task/subcommand was created. Chronological ordering makes sense, although I'd say that knowledge is pretty esoteric. I doubt anyone besides me and Clay could rattle off the order that commands were added tongsderive
. So that works for me but is probably not an ordering we should stand by. Now that I think about it, I think we always (or nearly always) add new tasks to the bottom of the file. So really most of our files are ordered in this chronological way. Kinda works for helping us regular maintainers find what we're looking for, but not helpful for anyone else.Alphabetize? Are there any other sorting choices? Would alphabetized tasks be an improvement? It wouldn't be of very much help to me. My brain is not great at alphabetizing. Not "can't do it" bad, but also it wouldn't be trivial for me to locate what I'm looking for in that sort. I imagine the situation would be roughly the same for me. Maybe a small improvement?
At first I thought this would be a silly suggestion but I don't hate it: shorter tasks at the top of the file, longer tasks the bottom of the file.
Con: really really annoying to get that sorted and maintain that sort (assuming we don't automate it).
Pro: it's the closest to "locate by vibe" that exists ๐
So I'm stumped. I still think our files are too long and they should be broken up. Or sorted, although I'd prefer a scheme for splitting files into smaller chunks. But I don't like any specific implementation I can come up with.
The best thing I thought of is alphabetizing tasks. I don't love it bc my brain is wired in a way finding things in an alphabetic sort isn't the easiest for me. But it's probably an overall improvement (especially while we lack any viable alternatives).
So, do we want to start alphabetizing our WDL files with many tasks? All our task files? Is there a threshold under which it's not worth the effort? Would that look strange, some files sorted, some not?
Opening the floor for proposals!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.