GithubHelp home page GithubHelp logo

snp_context's Introduction

SNP Context


Summary

Corey Carter, St. Paul, MN - 09 January 2018 for Morrell Lab at the University of Minnesota - Twin Cities
This bash script is designed to pull contextual sequence flanking a SNP (variant) from a reference genome.
The window size for contextual sequence can be user specificed.
The program checks for the presence of more than one variant within user specified windows.

Config file

SNP Context can be used with either a config file or the command line, with the config file being the perferred method. Listed below are the config file input parameters:

General input for config file

  • Project name (String): The name of your project. The working directory will hold this name.

  • Input email (String): If and only if you're using this script on the University Of Minnesota's supercomputing cluster (MSI), PBS job updates will be sent to this address.

  • Input file (String): The input .VCF file.

  • referenceGenome (String): The complete path (pwd -P) of your reference genome.

  • Output location (String): The output location of your files. A project directory is created at this location. If no location is specified, then the directory and files will be saved at the same location as the input file.

MSI options for config file

  • MSI Mode (Boolean, true or false): If true, then the script will automatically submit your job to PBS job scheduler and allows for the user to save the output text files to MSI S3 tier two storage.

  • PBS parameters (Integer): Allows you to change the time and amount of resources allocated to the PBS job. Follow the link for more info: MSI PBS job submission

Generated PBS job script location

  • A PBS job script is generated in MSI mode, that file is saved in the SNP Context script directory.

Mutation Motif

  • Window Length (Integer, 0-99): Controls the window length expansion around the SNP location.

  • Flanks (integer, 0-2): Choose between 0 to 2 nucleotides around the SNP location for analysis. The window length must be equal or greater than the Flank size - otherwise if the window is smaller than the flanks, the window length expansion (parameter above) will default to the flank size.

  • Indel maximum amount (Integer, 25-100): The maximum allowed indels in the .FASTA file, beyond the SNP location and 5 base pairs around it. Any indels found within a 2 base pair flanking of the SNP location will be automatically filtered out. Minimum indel threshold is 25%.

Saving Options

Section Notes

When using the various saving options, note that only the individual counts tables, combined counts table and the config are saved to Dropbox, S3 or GitHub.

When data is uploaded, any compromising data (username, passwords, etc) are replaced with asterisks "**".

  • Save all data (Boolean, true or false): There are many intermediate files created during SNP Context. This parameter allows you to save all intermediate files to your output directory. At the bottom of the page there is a complete list of output files if save all data is true:

MSI and Amazon S3

  • Save S3 (Integer, 0 - 3): This allows you to save the mutation motif outputs files (counts tables) and config files to S3. The user can choose between these values (0 - 3):

    0. Don't save to any S3 storage (default setting).

    1. Save to MSI's S3 tier two storage.

    2. Save to Amazons external S3 service. (Needs .aws set up: Setting up amazon S3)

    3. Save to both MSI S3 storage and Amazons S3 storage platforms.

DropBox

  • DropBox (Boolean, true or false): The user can save the unique output files (counts tables and config file) to their personal DropBox app. Follow the link to step up a DropBox app: Setting up DropBox app

  • DropBox auth (String): The authentication code, needed for saving Dropbox files from the script. Follow this link for instructions on setting this up. NOTE: this data is redacted when uploaded to S3, Dropbox or Github.

GitHub

  • Github (Boolean, true or false): Do you want to save the output files (counts tables and config file) to a Github repository?

  • gitUser (String) : Your GitHub username. NOTE: this data is redacted when uploaded to S3, Dropbox or Github.

  • gitPass (String) : Your GitHub password: NOTE: this data is redacted when uploaded to S3, Dropbox or Github.

  • Myrepository (String): The name of the repository you will be saving the output files to.

  • owner (String): The owner (their username) of the repository. Leave blank if you are the owner.

Command Line equivalents to the config file (can be run in any order)

Command line parameters allow this script to be plugged into a larger pipeline or program, but the preferred method is the context.config file.

-w = Window Length (Integer, 0-99)

-em = Input email (String)

-i = Input file (String)

-o = Output location (String)

-msi = MSI Mode (Boolean, true or false)

-p = Project name (String)

-f = Flanks (integer, 0-2)

-r = referenceGenome (String, complete path with 'pwd -P' terminal command)

-n = Indel maximum amount (Integer, 25-100)

-ad = Save all data (Boolean, true or false)

-msi = MSI Mode (Boolean, true or false)

-s3 = Save S3 (Interger, 0 - 3)

-dpa = DropBox (Boolean, true or false)

-dpba = DropBox auth (String) REDACTED WHEN UPLOADED

-gh = Github (Boolean, true or false)

-ghu = gitUser (String) REDACTED WHEN UPLOADED

-ghp = gitPass (String) REDACTED WHEN UPLOADED

-gho = owner (String)

-ghr = Myrepository (String)

Exit Codes

If you're using SNP Context in a larger pipeline or program, the exit codes can provide the larger program needed information:

  • exit code 0: SNP Context operations have been successfully executed.

  • exit code 1: An general error has occurred; check error logs in script execution directory.

  • exit code 2: A PBS job was submitted, MSI mode only.

Outputs

If the parameter all data is true, than all intermediate files are saved in the output directory. If all data is false than only the individual counts tables, combined counts table and rejected files (.VCF, .BAM, .FASTA) are saved in the output directory.

If 'all Data' is TRUE:

  • Parsed .VCF files
  • Parsed .BAM files
  • Parsed .BED files
  • .FASTA files
  • Consolidated rejected nucleotide positions from .BAM file
  • Consolidated rejected indels .FASTA file
  • Word counts tables
  • Combined counts tables

If 'all Data' is FALSE (default):

  • Consolidated rejected nucleotide positions from .BAM file
  • Consolidated rejected indels .FASTA file
  • Word counts tables
  • Combined counts tables

Misc Info

Dependencies

Error Handling

  • SNP Context catches any errors that may occur during runtime and saves them to a file called the SNP_CONTEXT_ERROR_LOG.txt. It is located in the SNP Context script directory.

snp_context's People

Contributors

carte731 avatar pmorrell avatar aerin13 avatar

Stargazers

Chaochih Liu avatar

Watchers

 avatar  avatar

Forkers

pmorrell

snp_context's Issues

Indel maximum amount clarification

In README.md:

Indel maximum amount (Integer, 25-100): The maximum allowed indels in the .FASTA file, beyond the SNP location and 5 base pairs around it. Any indels found within a 2 base pair flanking of the SNP location will be automatically filtered out. Minimum indel threshold is 25%.

Could this be reworded so that it is clear that the "Indel maximum amount" is in units of percent, instead of base pairs or something similar. I suspect that you mean "Maximum percent of indel base pairs allowed within the full window size in base pairs." Also, is there a programatic reason why the minimum is 25% or a biological one?

No documentation about how to call the program with a config file

The README does not have any documentation describing how to call the program. It only lists the various options that are available in command-line mode.

For example:

To run SNP Context, use the following command:
./SNPcontext

If using a config file, use the following command:
./SNPcontext ~/path/to/your/context.config

A usage message for calling the program with a config file should also be added to the output from running ./SNPcontext with no arguments. Currently it only describes the command-line mode.

Window length maximum size discrepancy

In README.md it states that the maximum allowable window size is 99 base pairs.
However, in line 146 of SNPcontext, the code checks if the window size is above 10, and sets it to 10 if it is.
Additionally, the Zhu et al. 2017 paper uses a window size of 600 (300bp on both sides of the SNP).

We should consider what the maximum window size should be (10, 99, or larger), then implement and document the new value. It should also be clarified whether the window size extends to only one side of the SNP or covers both directions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.