GithubHelp home page GithubHelp logo

phylogrok / vcfgenerator Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 1.16 MB

Automated variant calling app for NextGen evolutionary genomics

License: GNU General Public License v3.0

Shell 76.93% Python 23.07%

vcfgenerator's Introduction

DOI

VCFgenerator

Automated variant calling and Shiny dashboard for NextGen evolutionary genomics.

Environment: Ubuntu 20.02 VM configured with required software packages described in OmicsVMconfigure (https://github.com/PhyloGrok/OmicsVMconfigure)

Usage

  1. Clone VCFgenerator repo git clone https://github.com/PhyloGrok/VCFgenerator in your Ubuntu 20.02 Linux user/home directory.
  2. move cd VCFgenerator/Python_Hub
  3. run sudo python Controller.py
  4. User will be prompted for 5 inputs, used to run the workflow:

Workflow-Chart

Workflow Description

  1. download.py Data Retrieval. Downlaods reference genome and BioProject-linked SRA files base on user-provided data. Currently works only with Illumina paired-end .fastq files, sequenced from genomics DNA data from a whole genome sequencing strategy. Uses ncbi EDirect, ncbi-datasets, and sra-toolkit APIs.
  2. trimmomatic.py Data QC - Runs trimmomatic and fastqc on the .fastq sra files.
  3. variants.py Assembly and Variant Calling - Performs alignment of .fastq sequences to the reference genome using bwa. Performed variant calling with SAMtools and BCFtools, generating variant calling format (.vcf) files as output.
  4. annotations.py VCF annotation - Annotates the .vcf files using SNPeff and reference genome .gff/.gtf annotation files.
  5. Shiny Dashboard (nonfunctional, in development) - Transfers annotated .vcf data to a SQLite database, imports into an R dataframe and plots genomes in R Shiny dashboard with a stacked barplot of mutation types by sample, and displays a circos-style plot annotated showing called variants from multiple (up to 5 genomic BioSamples).

Demonstration Data

  1. NCBI BioProject PRJNA541441 (15 .fastq SRA files). "Iron and Acid Adapted Strains of Halobacterium sp. NRC-1 obtained by Experimental Evolution" initial testing
  2. NCBI BioProject PRJNA844510 (67 .fastq SRA files). "Halobacterium mutation acumulation lines. testing for BGIseq and for scaled-up throughput

Future Goals

  1. Incorporate a Mummer branch into the workflow.
  2. Peform Metagenomics and Comparative Genomics.
  3. Publish plots in a Shiny web app as a Science Gateway.

Acknowledgements

  1. Data Carpentry Genomics Workshop (https://datacarpentry.org/genomics-workshop/) was the original template for the QC, alignment and variant calling steps. Here we focused on a command-line implementation, with user specification, and high-throughput automated processing in Linux Ubuntu-based cloud vm.
  2. Lenski Long-Term E. coli Evolution (LTEE) experiment. The analysis of genomic variants follows the concept of Tenaillon et al. 2016 and other publications and content from the LTEE (https://lenski.mmg.msu.edu/ecoli/genomicsdat.html).
  3. See Citations.md for many additional citations and resources.

Funding

This work used Jetstream2 at Indiana University (IU) through research allocation BIO220099 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

This work used Jetstream at Indiana Universery/Texas Advanced Computing Center (IU/TACC) through research startup allocation BIO210100 from the Extreme Science and Engineering Discovery Environment (XSEDE), which was supported by National Science Foundation grant number #1548562.

This work used Jetstream at Indiana Universery/Texas Advanced Computing Center (IU/TACC) through educational allocation MCB200044 from the Extreme Science and Engineering Discovery Environment (XSEDE), which was supported by National Science Foundation grant number #1548562.

UMBC Translational Life Science Technology (TLST) student interns Lloyd Jones III, Nhi Luu, and Jan Le supported by Merck Data Science Fellowship for Observational Research Program and the UMBC College of Natural and Mathematical Sciences. Lloyd Jones III developed the variant calling workflow framework and workflow integration. Nhi Luu developed the annotation scripts and R-Shiny framework and integration. Jan Le prepared the "Iron and Acid Adaptation" analysis, developed EDirect scripts, and troubleshooted throughout the workflow. Additionally, TLST student Gina Hwang contributed to the MUMMER branch of the workflow.

Citations

Erin Alison Becker, Tracy Teal, François Michonneau, Maneesha Sane, Taylor Reiter, Jason Williams, et al. (2019, June). datacarpentry/genomics-workshop: Data Carpentry: Genomics Workshop Overview, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3260309

David Y. Hancock, Jeremy Fischer, John Michael Lowe, Winona Snapp-Childs, Marlon Pierce, Suresh Marru, J. Eric Coulter, Matthew Vaughn, Brian Beck, Nirav Merchant, Edwin Skidmore, and Gwen Jacobs. 2021. “Jetstream2: Accelerating cloud computing via Jetstream.” In Practice and Experience in Advanced Research Computing (PEARC ’21). Association for Computing Machinery, New York, NY, USA, Article 11, 1–8. DOI: https://doi.org/10.1145/3437359.3465565

Stewart, C.A., Cockerill, T.M., Foster, I., Hancock, D., Merchant, N., Skidmore, E., Stanzione, D., Taylor, J., Tuecke, S., Turner, G., Vaughn, M., and Gaffney, N.I., “Jetstream: a self-provisioned, scalable science and engineering cloud environment.” 2015, In Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure. St. Louis, Missouri. ACM: 2792774. p. 1-8. DOI: https://dx.doi.org/10.1145/2792745.2792774

Tenaillon O, Barrick JE, Ribeck N, et al. Tempo and mode of genome evolution in a 50,000-generation experiment. Nature. 2016;536(7615):165-170. doi:10.1038/nature18959

Towns, J, and T Cockerill, M Dahan, I Foster, K Gaither, A Grimshaw, V Hazlewood, S Lathrop, D Lifka, GD Peterson, R Roskies, JR Scott. “XSEDE: Accelerating Scientific Discovery”, Computing in Science & Engineering, vol.16, no. 5, pp. 62-74, Sept.-Oct. 2014, doi:10.1109/MCSE.2014.80

vcfgenerator's People

Contributors

phylogrok avatar lloydjonesiii avatar nluu1 avatar critic-ism avatar ginaah avatar

Watchers

 avatar

vcfgenerator's Issues

Annotation specified by user needs to be implemented

Currently:

  • The snpeff_annotate.sh allows user to specify:
    -d: the database
    -i: the input folder containing the files
    -o: the output folder
  • The command looks like this:
./resources/snpeff_annotate.sh -d <database> -i <input folder containing vcf files> -o <output folder>
  • The script and command have been implemented into Lloyd's workflow within the annotation.py but the database is being hard-coded into the script (i.e. "HsNRC-1").

  • Eventually, we wanted to automatically queue the database from the workflow (if the database is not available, create new one) and be implemented into the scripts in annotation.py.

  • I'm working on implementing the database creation, so I'll get back with an updated shell script.

Generate a collection of dotplots

Use the genomic data we downloaded today, representative of about 60-something Shewanella species. Try to run Mummer on at least 10 of the species, using S. oneidensis genome as the reference genome. Make the dotplots in .png format as we were doing today in the meeting. Collect these dotplots into a single folder.

Annotation step for .VCF data

Here we will need to be able to assign mutations to genes or intergenic regions, and the type of mutation to the data.

Multi-threading for the workflow steps

We noticed that CPU and RAM are not very heavily used during the steps of the workflow. We want to use multi-threading to increase the efficiency of our runs, so that we can use more CPU and RAM, and the runs will take less time. It seems the case that each individual command will have its own multithreading arguments, rather than the entire run overall.

Download RefSeq genomes by strain (rather than by ncbi-datasets designated genome)

Here is a functional esearch line that retrieves reference assembly by Assembly ID. We need one additional previous line that will get the Assembly ID from a taxa id (ie "txid64091")

esearch -db assembly -query GCF_000006805.1 | elink -target nucleotide -name assembly_nuccore_refseq | efetch -format fasta > GCF_000006805.1.fa

Mutation rate analysis

In this task, we need to generate a "moving average" for genomic mutation rates by genomic window, that can be displayed as a track in the circos plots.

Variant calling workflow - script automation

Basically 3 parts, each part should have its own shell script file:

  1. Retrieving SRA files (.fastq format) using EDirect and SRA-toolkit.
  2. Quality control - trimmomatic and fastqc
  3. Assembly and variant calling - bwa, samtools

trimmomatic error from the script

For some reason, the trimmmomatic command is getting an error when it tries to input the first file, it appears that the script referencing the file locations from the user directory is creating an incorrect first argument for the trimmomatic command, where only the filepath but no filename is specified. A bit confusing because the filepaths are referenced correctly according to the datacarpentry example. Somehow, trimmomatic is getting an incomplete filepath/file name associated with referencing from remote directories.

1: Command working correctly outputs the following (Tested with the script in the fastq/directory --Note the first argument includes a correct filename:
TrimmomaticPE: Started with arguments:
SRR9025102_1.fastq.gz SRR9025102_2.fastq.gz

  1. Command working incorrectly ouputs a filenotfound exception (tested the script in the user directory --note that the first argument is missing the terminal filename):
    Exception in thread "main" java.io.FileNotFoundException: ../../media/volume/sdb/pH2/fastq/../../media/volume/sdb/pH2/fastq/SRR9025118_1
    .fastq.gz (No such file or directory)

Make the MUMMER4 runs into a .sh executable

Make sure the mummer4 commands work for the sample data (1 reference genome Shewanella oneidensis, 1 query genome ie. Shewanella frigidimarina).

Run the same command, but in the .sh file context, to validate that the shell script runs the commands correctly.

Next, make a variable inside the script that can take user input to define the query genome. (user inputs the query species)
Finally, make a for-loop that can take a list of query genomes and run them automatically, with each output having a unique name.

Set up connection to the GitHub repo

We need to start implementing the git clone to our repo (assuming it's structured correctly

If Repo is public:

Ref: https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository

  • If the repo is public, it's generally easy for the users to clone the repo to their VM using the HTTPS with the following command:
git clone https://github.com/USERNAME/REPOSITORY

If Repo is private:

NCBI Datasets download configuration

  1. Make sure NCBI datasets runs properly from a user directory terminal.
  2. In the script "RetrieveReference/R01_RetrieveRefData.sh", the input variable should be the taxonomic UID, and the output .zip file should be the taxa UID. For example, "2214" is the taxonomy ID for Halobacterium salinarum, the input of the script would be "2214", and the output file will be "2214.zip"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.