GithubHelp home page GithubHelp logo

elixir-no-nels / selma Goto Github PK

View Code? Open in Web Editor NEW

This project forked from oskarvid/selma

3.0 3.0 2.0 8.07 MB

Germline Variant Calling Pipeline built in Snakemake

Dockerfile 1.37% Python 41.64% Shell 56.99%
containerized gatk4 germline-variant-calling scatter-gather singularity snakemake

selma's People

Contributors

oskarvid avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

kjellp irliampa

selma's Issues

Enhance documentation of -s option

Suggest to add "to run the workflow analysis in parallel." to the end of the help/usage text. This will also indicate that not using the "-s" flag, will run the samples sequentially on a single node.

Potential issue if we officially support exome (interval file) data analysis

I did some reading about exome data analysis with the tools that we use in Selma and apparently it would produce sub optimal results according this discussion: https://gatkforums.broadinstitute.org/gatk/discussion/6894/gatk-best-practices-for-exome-targeted-capture-small-region
The important points are the following quotes:

  • you should not use BQSR on [exome data]
  • You are probably better off doing hard filtering for a small target region [instead of using VQSR]

This discussion also has good information about why BQSR is not advised for datasets with less than 100 million bases.

We discussed running hap.py on exome (interval file) data analyses but based on these points this may not be a good use of our time given that we shouldn't run the BQSR, VQSR and ApplyVQSR tools on small datasets.

Inform user of which sample/s failed

If a user is running a large number of samples and one or more samples fail, it is very tedious, and potentially difficult, for a user to figure out which samples failed. An error message at the end of the workflow that says something like "sample 1, 2 and 3 failed" would improve the usability of the workflow.

Another issue with failing samples during multi sample executions is that there is no error message during the execution of the workflow saying that the other samples will keep running, this needs to be added too.

Figure out how gvanno can fit in to the workflow

gvanno is a workflow that does genome variant annotation and would be very useful as an addition to Selma. Exactly how it could be added has not been decided, one suggestion is to allow it as an option when Selma is started. Another suggestion is to recommend it to users in the documentation or as part of the end-of-execution message that is shown when the workflow is finished successfully. Feel free to make other suggestions.

Squeue is unstable on Colossus

A central component of the Selma status reporting algorithm is based on the squeue command, unfortunately the connection to the slurm daemon isn't as stable as would be ideal and this causes the squeue command to fail which in turn prematurely kills the Selma workflow execution.

This is a fatal issue that has been encountered before in this issue. One possible solution is to break out the status checking algorithm into its own script and source it from the main script. That way the squeue command can fail without prematurely stopping the workflow execution.

Run one sample per node

The current version will run all samples on one execute node, this leads to potential issues that are resolved by running one sample per node.

TSV file - TSD

A TSV file is needed for the following purposes:

Read group handling

The bam files need to have the read group made properly for identification purposes as well as proper handling by GATK to improve the analysis.
The idea is to implement this by using a TSV file that gets parsed by snakemake to create variables that are used to build the read group.

Input file paths

The input file paths should be relative to -i in start-workflow.sh

Figure out how to handle config options

To simplify configuration of the workflow in general, and slurm in particular, we need to look closer at how to allow the user to use custom config files to customize things like slurm account name, RAM, CPU, time etc.

Input folder requirements

Currently the code requires some fastq files resides in the root folder of the input folder, while others may be further organized into subfolders. We need to support a top level input folder, with multiple sub folders (separating samples, runs, flowcells) in a logical way, and no fastq files present at the top level.

Select specific sample from TSV file

Given that a user creates one master TSV file, it should be possible to build functionality that allows the user to select specific samples for analysis.

Upgrade GATK to the latest version

Each GATK version comes with new features, speed improvements, bug fixes and quality improvements. We should upgrade GATK to the latest version.

Sbatch script failure handling

If the sbatch script fails there is a risk that the start-workflow.sh script hangs forever.
There needs to be a trap with a die function that catches failures and runs contingency code to stop the start-workflow.sh script from running for ever, but also prints helpful troubleshooting information.

Squeue seems to be crashing the workflow

I've run two full scale 3 sample tests that have failed after 2h 55 min and 2h 27 min respectively. I'm currently running a new test with bash's built in debugging enabled to try to capture the error message for better troubleshooting. This is a fatal issue and must be resolved.

Move current command line options to the settings file

Currently it is required to add a set of flags to define input file directory, samples.tsv file path, output path etc, and all or most of these can be put in the settings file.
The idea is to make it possible to setup everything in the settings file, but to allow the user to override the default settings by using one or more flags as desired.

TSD start-workflow.sh issues

Hi

I've come across some issues with the start-workflow.sh script on TSD

One wherever it sends the fastq files on colossus has a limit on how much can be uploaded to there.

The test files I'm using are about 30Gb each and the slurm job gives up before finishing the transfer of the R1 file.

Secondly I've noted that when moving the files to the staging area (and presumably when sending stuff up to colossus) the script takes every file in the input folder not just the ones defined in the sample table tsv. This can make run/upload times needlessly long if there are a lot of fastqs in the input location.

All the best

David

Starting Selma

One thing I have found is that Selma can only be started from the place where it is installed and nothing in the user guide directly states this.

Is this intentional?

If so it seems a bit cumbersome for the end user that you can not start the tool from whatever folder you are working in given that most people would likely be in the folder where they intend the results to be written to rather than the Selma install directory.

Move most or all command line options to the settings file

The number of arguments to the workflow has grown and some flags can be moved to the settings file. These flags are less likely to change between executions and can therefore safely be given default values that are possible to override by using the flags on the command line.
A definite list of proposed flags to be moved to the settings file will be made in time.

File staging - TSD

The following description has been adapted from the rbFlow-germline issue covering the same feature description.

Colossus is not allowed to read or write to /tsd/p172sharedpXX from a compute node, therefore we need this workaround:

1. ./start-workflow.sh (run script) stages the input folder given in -i to the /cluster/projects/p172 area

2. The input data is pulled by the runOnNode sbatch script from /cluster/projects/p172 to the local scratch disk

3. Singularity is started such that the local scratch copy of the input files are being used

4. Output files should be generated on local scratch first

5. ./start-workflow.sh deletes the /cluster/projects/p172 copy of the input data before terminating

6. Output files are copied to /cluster/projects/p172/job-$DATE by the sbatch script.

7. Copy the output files from /cluster/projects/p172 area to the -o option on HNAS

8. Delete the copy of output file from /cluster/project/p172 area

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.