stjudecloud / workflows Goto Github PK

Bioinformatics workflows developed for and used on the St. Jude Cloud project.

License: MIT License

Dockerfile 0.46% WDL 98.09% Python 1.34% Shell 0.11%

bioinformatics bioinformatics-workflows computational-biology cromwell cwl cwl-workflow genomics next-generation-sequencing stjudecloud wdl wdl-workflow workflow workflow-engine workflows

workflows's People

Contributors

Stargazers

Watchers

Forkers

sclan pandurang-kolekar jsunny23 ngenebio-genomics-platform truwl adthrasher lqsae jpcartailler maurya-anand

workflows's Issues

Add CI for checking all tasks pull current docker images

Style Guide rules under review

All quotes that can be double quotes should be (only use single quotes where necessary, such as when nesting quotes)
- I don't like this phrasing, but I'm not laboring over the words ATM
All comma delimited lists/arrays/objects/etc. should have a trailing comma
- again, not a fan of my phrasing here. Should be workshopped before entering the guide
- I can easily find the link if requested, but from memory black has a very good rationale for their stance on trailing commas and I say we follow that example.
  - The summary of that rationale is that it makes adding/reordering/removing items from a list less error prone. They might have additional arguments, but the above is enough to convince me.
All lists/arrays/objects/etc. should have one element per line (i.e. newline separate items). A key/value pair are considered one element.
- Is this controversial? It goes further than black which says short lists with few items should be collapsed to one line.
- I think this is justified by the fact it's easier to read and edit/rearrange IMO. Does anyone disagree?
- In #115 I complain our files have too many lines. This exacerbates that issue (but I think we already do this? I don't think officially adopting this rule would actually require changes to our code)
- Benefit: This rule is trivial to enforce with code. The alternative would be having some line width cut-off or other logic to calculate before deciding to collapse/split lines. Easier to implement via code is better IMO.

The common thread on these 3 rules is that they are all trivial to enforce via an auto-formatter tool (and tedious/annoying to enforce manually). Therefore, I vote we hold off adopting these officially till that auto-formatter has reached fruition. Mainly to keep our codebase compliant with our documents.

NGSderive step assumes plain text

All of the other steps that use the GTF file seem to properly handle compressed input, but this step requires the file to be plain-text. It should handle compressed input appropriately.

workflows/tools/ngsderive.wdl

Line 21 in f19e0d3

sort -k1,1 -k4,4n -k5,5n ~{gtf} | bgzip > annotation.gtf.gz

Split tasks to a new repository

Finally, we plan to migrate the WDL tools to a new repository. There are several tasks required before we can make the change.

investigate appropriate `disk_size`s

workflows/rnaseq: 2.1.0 is not backward compatible with 2.0.0

The recently released RNA-Seq Standard 2.1.0 is not backward compatible with 2.0.0, particularly with input and output naming.

In my case, inputs were changed: rnaseq_standard.strand was renamed to rnaseq_standard.strandedness; and rnaseq_standard.htseq_count.memory_gb changed behavior to rnaseq_standard.htseq_count.added_memory_gb. The feature counts suffix was also changed from .counts.txt to .feature-counts.txt.

These changes are unexpected, as the readme says versioned workflows follow semantic versioning.

tools/htseq: Add override or fix htseq-count max-reads-in-buffer option

Workflow: RNA-Seq Standard 2.0.0

When the input records have mates, htseq-count keeps an arbitrarily-sized buffer to match record pairs. In extreme cases, the default buffer size --max-reads-in-buffer 30000000 is too small, causing the following error:

Error occured when processing SAM input (record #396226907 in file sample.bam):
  Maximum alignment buffer size exceeded while pairing SAM alignments.

I propose either adding an input to override the value (Int max_reads_in_buffer = 30000000) or fixing the value to an infeasibly high record count, e.g., 2^63-1. The latter is then simply bounded by memory.

WDL/1.1

Update tasks and workflows to WDL version 1.1

Convert docker runtime keys to container
Convert sep expressions to sep function

Modularize QC

The analyses run should be toggle-able for users only interested in a subset of the workflow (and to keep costs down)

Remove duplicate marking

From feedback on RFC 0001, we have decided to remove duplicate marking. Since we have no ability to determine whether reads are actually duplicated and considering the results presented in https://www.nature.com/articles/srep25533, the best course of action appears to be no longer marking duplicates to avoid giving downstream tools inaccurate information and allow the tools to make their own determination.

PAPIv2 Key [disk] not supported by GCP backend

Multiple included WDL files have incorrect runtime settings when running on Google Cloud.
For example:
https://github.com/stjudecloud/workflows/blob/92339ee9786e5eb55df80e2cba7b07c4822f2366/tools/star.wdl#L163

The setting should be 'disks' and not 'disk'. The following is the output from Cromwell when running the rnaseq-standard-fastq.wdl workflow:

2021-05-05 18:47:34,785 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,786 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,787 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,787 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,788 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk, dx_instance_type] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,789 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,790 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,791 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.
2021-05-05 18:47:34,792 cromwell-system-akka.dispatchers.backend-dispatcher-58 WARN - PAPIv2 [UUID(953487a6)]: Key/s [disk] is/are not supported by backend. Unsupported attributes will not be part of job executions.

our WDL files are too long

I'll single the 3 worst offenders out: util.wdl, samtools.wdl, and picard.wdl have so many tasks in them it becomes difficult to find the one you're looking for. At least that's my experience. They are each >700 lines long, which I think in just about every other language would be considered a behemoth for maintenance. Most languages have recommended file lengths (guessing an average consensus would be around 500 lines?), I propose we adopt something similar for WDL/this repo.
Although I don't want to base it off line number. I feel that can encourage some sloppy coding when the file in question is around the limit. e.g. collapsing lines that should be separated in order to keep below the length limit.

The below indented section would be me thinking out loud realizing all my ideas have some fatal flaw. I'm stumped as to how to solve the problem. Feel free to skip the indented section, or read it to see the thoughts I've had.

A saner approach to me is a task limit. The exact number might require some trial and error. Rough gut feeling: 5 seems too strict, 10 seems too lenient. I'd say we start looking in the 6-9 range for our task number limit.

"But we currently organize our files by tool. Are we throwing that scheme out?" No! (At least that's not my initial proposal. I'd hear someone out if they have an alternative.)
I say we start with organizing our files by tool, and then once they grow past 6-9 tasks, we make a split. What that split is will depend on some context.
For ex. picard.wdl: could be split into picard-qc.wdl and picard-manipulation.wdl.
picard-qc has all the Picard tasks which generate a report of some kind, and don't change the BAM file.
picard-manipulation has all the Picard tasks which deal with modifying BAM files.

samtools.wdl could be split into... Alright I don't see a great way to split this file.
Let's try util.wdl: could be split into util-python-scripts.wdl, and gosh this is proving more difficult than I expected.

Pivot: what about sorting our tasks? That would also accomplish the goal of making it easier to find a specific task in a long file.

Let's start with a file whose order I like: kraken2.wdl. It's ordered so well I know it off the top of my head: download_taxonomy, download_library, create_library_from_fastas. build_db, kraken. It flows in order that the tasks would be used. It's logical. This seems to be a special case I don't see a way to generalize. Unfortunate...

ngsderive.wdl is roughly in the order that the task/subcommand was created. Chronological ordering makes sense, although I'd say that knowledge is pretty esoteric. I doubt anyone besides me and Clay could rattle off the order that commands were added to ngsderive. So that works for me but is probably not an ordering we should stand by. Now that I think about it, I think we always (or nearly always) add new tasks to the bottom of the file. So really most of our files are ordered in this chronological way. Kinda works for helping us regular maintainers find what we're looking for, but not helpful for anyone else.

Alphabetize? Are there any other sorting choices? Would alphabetized tasks be an improvement? It wouldn't be of very much help to me. My brain is not great at alphabetizing. Not "can't do it" bad, but also it wouldn't be trivial for me to locate what I'm looking for in that sort. I imagine the situation would be roughly the same for me. Maybe a small improvement?

At first I thought this would be a silly suggestion but I don't hate it: shorter tasks at the top of the file, longer tasks the bottom of the file.
Con: really really annoying to get that sorted and maintain that sort (assuming we don't automate it).
Pro: it's the closest to "locate by vibe" that exists 😆

So I'm stumped. I still think our files are too long and they should be broken up. Or sorted, although I'd prefer a scheme for splitting files into smaller chunks. But I don't like any specific implementation I can come up with.

The best thing I thought of is alphabetizing tasks. I don't love it bc my brain is wired in a way finding things in an alphabetic sort isn't the easiest for me. But it's probably an overall improvement (especially while we lack any viable alternatives).
So, do we want to start alphabetizing our WDL files with many tasks? All our task files? Is there a threshold under which it's not worth the effort? Would that look strange, some files sorted, some not?

Opening the floor for proposals!

stjudecloud / workflows Goto Github PK

workflows's People

Contributors

Stargazers

Watchers

Forkers

workflows's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs