elixir-no-nels / selma Goto Github PK

View Code? Open in Web Editor NEW

This project forked from oskarvid/selma

3.0 3.0 2.0 8.07 MB

Germline Variant Calling Pipeline built in Snakemake

Dockerfile 1.37% Python 41.64% Shell 56.99%

containerized gatk4 germline-variant-calling scatter-gather singularity snakemake

selma's People

Contributors

Stargazers

Watchers

Forkers

kjellp irliampa

selma's Issues

Put singularity image and master Selma copy in /tsd/shared/bioinformatics/

A request for write permission in the shared directory has been made, will move the files once write permission has been granted.

Add version numbers where necessary

Selma has two versions now, this needs to be reflected by version numbers in the source code as well.

Enhance documentation of -s option

Suggest to add "to run the workflow analysis in parallel." to the end of the help/usage text. This will also indicate that not using the "-s" flag, will run the samples sequentially on a single node.

Disk quota on Colossus is exceeded during execution

The workflow is failing on p11 now because the scratch disk quota is exceeded. I have opened a ticket at the help desk, I'm waiting for a reply and will update here once I know more.

Upgrade Singularity to the latest supported version on TSD

We are currently using Singularity version 2.6 but TSD has support for version 3.5 now, we should upgrade.

Potential issue if we officially support exome (interval file) data analysis

I did some reading about exome data analysis with the tools that we use in Selma and apparently it would produce sub optimal results according this discussion: https://gatkforums.broadinstitute.org/gatk/discussion/6894/gatk-best-practices-for-exome-targeted-capture-small-region
The important points are the following quotes:

you should not use BQSR on [exome data]
You are probably better off doing hard filtering for a small target region [instead of using VQSR]

This discussion also has good information about why BQSR is not advised for datasets with less than 100 million bases.

We discussed running hap.py on exome (interval file) data analyses but based on these points this may not be a good use of our time given that we shouldn't run the BQSR, VQSR and ApplyVQSR tools on small datasets.

Inform user of which sample/s failed

If a user is running a large number of samples and one or more samples fail, it is very tedious, and potentially difficult, for a user to figure out which samples failed. An error message at the end of the workflow that says something like "sample 1, 2 and 3 failed" would improve the usability of the workflow.

Another issue with failing samples during multi sample executions is that there is no error message during the execution of the workflow saying that the other samples will keep running, this needs to be added too.

Figure out how gvanno can fit in to the workflow

gvanno is a workflow that does genome variant annotation and would be very useful as an addition to Selma. Exactly how it could be added has not been decided, one suggestion is to allow it as an option when Selma is started. Another suggestion is to recommend it to users in the documentation or as part of the end-of-execution message that is shown when the workflow is finished successfully. Feel free to make other suggestions.

Documentation

Update it before release.

Squeue is unstable on Colossus

A central component of the Selma status reporting algorithm is based on the squeue command, unfortunately the connection to the slurm daemon isn't as stable as would be ideal and this causes the squeue command to fail which in turn prematurely kills the Selma workflow execution.

This is a fatal issue that has been encountered before in this issue. One possible solution is to break out the status checking algorithm into its own script and source it from the main script. That way the squeue command can fail without prematurely stopping the workflow execution.

Run one sample per node

The current version will run all samples on one execute node, this leads to potential issues that are resolved by running one sample per node.

TSV file - TSD

A TSV file is needed for the following purposes:

Read group handling

The bam files need to have the read group made properly for identification purposes as well as proper handling by GATK to improve the analysis.
The idea is to implement this by using a TSV file that gets parsed by snakemake to create variables that are used to build the read group.

Input file paths

The input file paths should be relative to -i in start-workflow.sh

Figure out how to handle config options

To simplify configuration of the workflow in general, and slurm in particular, we need to look closer at how to allow the user to use custom config files to customize things like slurm account name, RAM, CPU, time etc.

Change Snakemake working directory for clean input and output file handling

Given the need for a relative path in the tsv file a change in working directory is necessary to keep all file paths as well as file handling clean.

Input folder requirements

Currently the code requires some fastq files resides in the root folder of the input folder, while others may be further organized into subfolders. We need to support a top level input folder, with multiple sub folders (separating samples, runs, flowcells) in a logical way, and no fastq files present at the top level.

Required parameter "-o" not detected as missing if not present.

Is the start workflow script is provided all required parameters except "-o", it will start the run without warning the user, and crash due to failing operations later on (that is not evident for the user is due to the missing "-o" paramter).

Select specific sample from TSV file

Given that a user creates one master TSV file, it should be possible to build functionality that allows the user to select specific samples for analysis.

Support for b37 reference files - TSD

Add support in snakemake config file and the start-workflow.sh file, probably also sbatch file.

Explore building the Singularity image with Travis

CI/CD is cool, let's see if we can build the Singularity image with Travis.

Support for exome interval files

The current interval files are only for germline variant calling.

Upgrade GATK to the latest version

Each GATK version comes with new features, speed improvements, bug fixes and quality improvements. We should upgrade GATK to the latest version.

Reconsider how the -e flag should function

The -e flag is currently a bit unintuitive to use, we need to come up with a viable proposed change in function.

Sbatch script failure handling

If the sbatch script fails there is a risk that the start-workflow.sh script hangs forever.
There needs to be a trap with a die function that catches failures and runs contingency code to stop the start-workflow.sh script from running for ever, but also prints helpful troubleshooting information.

Squeue seems to be crashing the workflow

I've run two full scale 3 sample tests that have failed after 2h 55 min and 2h 27 min respectively. I'm currently running a new test with bash's built in debugging enabled to try to capture the error message for better troubleshooting. This is a fatal issue and must be resolved.

Move current command line options to the settings file

Currently it is required to add a set of flags to define input file directory, samples.tsv file path, output path etc, and all or most of these can be put in the settings file.
The idea is to make it possible to setup everything in the settings file, but to allow the user to override the default settings by using one or more flags as desired.

TSD start-workflow.sh issues

I've come across some issues with the start-workflow.sh script on TSD

One wherever it sends the fastq files on colossus has a limit on how much can be uploaded to there.

The test files I'm using are about 30Gb each and the slurm job gives up before finishing the transfer of the R1 file.

Secondly I've noted that when moving the files to the staging area (and presumably when sending stuff up to colossus) the script takes every file in the input folder not just the ones defined in the sample table tsv. This can make run/upload times needlessly long if there are a lot of fastqs in the input location.

All the best

David

Starting Selma

One thing I have found is that Selma can only be started from the place where it is installed and nothing in the user guide directly states this.

Is this intentional?

If so it seems a bit cumbersome for the end user that you can not start the tool from whatever folder you are working in given that most people would likely be in the folder where they intend the results to be written to rather than the Selma install directory.

Move most or all command line options to the settings file

The number of arguments to the workflow has grown and some flags can be moved to the settings file. These flags are less likely to change between executions and can therefore safely be given default values that are possible to override by using the flags on the command line.
A definite list of proposed flags to be moved to the settings file will be made in time.

Support for local execution

Selma has untested support to be run locally already, we need to test and officially support this mode.

Clarify the purpose of the custom interval file support in the documentation

Using an exome interval file risks reducing the total interval length to such a small amount of data that the workflow does not finish.
The tools aren't for somatic variant calling either.
This needs to be added to the documentation.

File staging - TSD

The following description has been adapted from the rbFlow-germline issue covering the same feature description.

Colossus is not allowed to read or write to /tsd/p172sharedpXX from a compute node, therefore we need this workaround:

1. ./start-workflow.sh (run script) stages the input folder given in -i to the /cluster/projects/p172 area

2. The input data is pulled by the runOnNode sbatch script from /cluster/projects/p172 to the local scratch disk

3. Singularity is started such that the local scratch copy of the input files are being used

4. Output files should be generated on local scratch first

5. ./start-workflow.sh deletes the /cluster/projects/p172 copy of the input data before terminating

6. Output files are copied to /cluster/projects/p172/job-$DATE by the sbatch script.

7. Copy the output files from /cluster/projects/p172 area to the -o option on HNAS

8. Delete the copy of output file from /cluster/project/p172 area

Run hap.py on wgs output files

We need to show a baseline of the quality that Selma is capable of by running hap.py on the vcf files from wgs data.

Snakemake config file - TSD

Make the Snakemake config file handle the following points correctly

Reference files
TSV file handling.

Remove code from the settings file

The settings file currently contains code, this code can be moved to a separate file so that there is no confusion for the user.

elixir-no-nels / selma Goto Github PK

selma's People

Contributors

Stargazers

Watchers

Forkers

selma's Issues

Read group handling

Input file paths

Recommend Projects

Recommend Topics

Recommend Org

Jobs