phylogrok / vcfgenerator Goto Github PK

Automated variant calling app for NextGen evolutionary genomics

License: GNU General Public License v3.0

Shell 76.93% Python 23.07%

vcfgenerator's Introduction

VCFgenerator

Automated variant calling and Shiny dashboard for NextGen evolutionary genomics.

Environment: Ubuntu 20.02 VM configured with required software packages described in OmicsVMconfigure (https://github.com/PhyloGrok/OmicsVMconfigure)

Usage

Clone VCFgenerator repo git clone https://github.com/PhyloGrok/VCFgenerator in your Ubuntu 20.02 Linux user/home directory.
move cd VCFgenerator/Python_Hub
run sudo python Controller.py
User will be prompted for 5 inputs, used to run the workflow:

Workflow Description

download.py Data Retrieval. Downlaods reference genome and BioProject-linked SRA files base on user-provided data. Currently works only with Illumina paired-end .fastq files, sequenced from genomics DNA data from a whole genome sequencing strategy. Uses ncbi EDirect, ncbi-datasets, and sra-toolkit APIs.
trimmomatic.py Data QC - Runs trimmomatic and fastqc on the .fastq sra files.
variants.py Assembly and Variant Calling - Performs alignment of .fastq sequences to the reference genome using bwa. Performed variant calling with SAMtools and BCFtools, generating variant calling format (.vcf) files as output.
annotations.py VCF annotation - Annotates the .vcf files using SNPeff and reference genome .gff/.gtf annotation files.
Shiny Dashboard (nonfunctional, in development) - Transfers annotated .vcf data to a SQLite database, imports into an R dataframe and plots genomes in R Shiny dashboard with a stacked barplot of mutation types by sample, and displays a circos-style plot annotated showing called variants from multiple (up to 5 genomic BioSamples).

Demonstration Data

NCBI BioProject PRJNA541441 (15 .fastq SRA files). "Iron and Acid Adapted Strains of Halobacterium sp. NRC-1 obtained by Experimental Evolution" initial testing
NCBI BioProject PRJNA844510 (67 .fastq SRA files). "Halobacterium mutation acumulation lines. testing for BGIseq and for scaled-up throughput

Future Goals

Incorporate a Mummer branch into the workflow.
Peform Metagenomics and Comparative Genomics.
Publish plots in a Shiny web app as a Science Gateway.

Acknowledgements

Data Carpentry Genomics Workshop (https://datacarpentry.org/genomics-workshop/) was the original template for the QC, alignment and variant calling steps. Here we focused on a command-line implementation, with user specification, and high-throughput automated processing in Linux Ubuntu-based cloud vm.
Lenski Long-Term E. coli Evolution (LTEE) experiment. The analysis of genomic variants follows the concept of Tenaillon et al. 2016 and other publications and content from the LTEE (https://lenski.mmg.msu.edu/ecoli/genomicsdat.html).
See Citations.md for many additional citations and resources.

Funding

This work used Jetstream2 at Indiana University (IU) through research allocation BIO220099 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

This work used Jetstream at Indiana Universery/Texas Advanced Computing Center (IU/TACC) through research startup allocation BIO210100 from the Extreme Science and Engineering Discovery Environment (XSEDE), which was supported by National Science Foundation grant number #1548562.

This work used Jetstream at Indiana Universery/Texas Advanced Computing Center (IU/TACC) through educational allocation MCB200044 from the Extreme Science and Engineering Discovery Environment (XSEDE), which was supported by National Science Foundation grant number #1548562.

UMBC Translational Life Science Technology (TLST) student interns Lloyd Jones III, Nhi Luu, and Jan Le supported by Merck Data Science Fellowship for Observational Research Program and the UMBC College of Natural and Mathematical Sciences. Lloyd Jones III developed the variant calling workflow framework and workflow integration. Nhi Luu developed the annotation scripts and R-Shiny framework and integration. Jan Le prepared the "Iron and Acid Adaptation" analysis, developed EDirect scripts, and troubleshooted throughout the workflow. Additionally, TLST student Gina Hwang contributed to the MUMMER branch of the workflow.

Citations

Erin Alison Becker, Tracy Teal, François Michonneau, Maneesha Sane, Taylor Reiter, Jason Williams, et al. (2019, June). datacarpentry/genomics-workshop: Data Carpentry: Genomics Workshop Overview, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3260309

David Y. Hancock, Jeremy Fischer, John Michael Lowe, Winona Snapp-Childs, Marlon Pierce, Suresh Marru, J. Eric Coulter, Matthew Vaughn, Brian Beck, Nirav Merchant, Edwin Skidmore, and Gwen Jacobs. 2021. “Jetstream2: Accelerating cloud computing via Jetstream.” In Practice and Experience in Advanced Research Computing (PEARC ’21). Association for Computing Machinery, New York, NY, USA, Article 11, 1–8. DOI: https://doi.org/10.1145/3437359.3465565

Stewart, C.A., Cockerill, T.M., Foster, I., Hancock, D., Merchant, N., Skidmore, E., Stanzione, D., Taylor, J., Tuecke, S., Turner, G., Vaughn, M., and Gaffney, N.I., “Jetstream: a self-provisioned, scalable science and engineering cloud environment.” 2015, In Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure. St. Louis, Missouri. ACM: 2792774. p. 1-8. DOI: https://dx.doi.org/10.1145/2792745.2792774

Tenaillon O, Barrick JE, Ribeck N, et al. Tempo and mode of genome evolution in a 50,000-generation experiment. Nature. 2016;536(7615):165-170. doi:10.1038/nature18959

Towns, J, and T Cockerill, M Dahan, I Foster, K Gaither, A Grimshaw, V Hazlewood, S Lathrop, D Lifka, GD Peterson, R Roskies, JR Scott. “XSEDE: Accelerating Scientific Discovery”, Computing in Science & Engineering, vol.16, no. 5, pp. 62-74, Sept.-Oct. 2014, doi:10.1109/MCSE.2014.80

vcfgenerator's People

Contributors

Watchers

vcfgenerator's Issues

Automated workflow should run in the system, rather than user home directory.

Its ok for the time being, but we will eventually want to have the workflow as a single command from command line, available for all users. Basically all the executables should be in a system bin/ directory somewhere. We'll have to research how these would be best structured.

Annotation specified by user needs to be implemented

Currently:

The snpeff_annotate.sh allows user to specify:
-d: the database
-i: the input folder containing the files
-o: the output folder
The command looks like this:

./resources/snpeff_annotate.sh -d <database> -i <input folder containing vcf files> -o <output folder>

The script and command have been implemented into Lloyd's workflow within the annotation.py but the database is being hard-coded into the script (i.e. "HsNRC-1").
Eventually, we wanted to automatically queue the database from the workflow (if the database is not available, create new one) and be implemented into the scripts in annotation.py.
I'm working on implementing the database creation, so I'll get back with an updated shell script.

Add a GUI menu for selecting a BioProject and user input using the tkinter python package.

Error running Mummer4 - .postscript doesn't readout

Other outputs work ok, issue generating the .ps file. Looks like may be a missing/deprecated GNU library.

use gitclone to install and troubleshoot the repo

git clone https://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@github.com/PhyloGrok/VCFgenerator.git

R-Shiny/Shinyserver web app for pH experiment circos plots

Error running bcftools (part of samtools package)

May be a missing library, need to check the error messages again.

Generate a collection of dotplots

Use the genomic data we downloaded today, representative of about 60-something Shewanella species. Try to run Mummer on at least 10 of the species, using S. oneidensis genome as the reference genome. Make the dotplots in .png format as we were doing today in the meeting. Collect these dotplots into a single folder.

Annotation step for .VCF data

Here we will need to be able to assign mutations to genes or intergenic regions, and the type of mutation to the data.

Multi-threading for the workflow steps

We noticed that CPU and RAM are not very heavily used during the steps of the workflow. We want to use multi-threading to increase the efficiency of our runs, so that we can use more CPU and RAM, and the runs will take less time. It seems the case that each individual command will have its own multithreading arguments, rather than the entire run overall.

Download RefSeq genomes by strain (rather than by ncbi-datasets designated genome)

Here is a functional esearch line that retrieves reference assembly by Assembly ID. We need one additional previous line that will get the Assembly ID from a taxa id (ie "txid64091")

esearch -db assembly -query GCF_000006805.1 | elink -target nucleotide -name assembly_nuccore_refseq | efetch -format fasta > GCF_000006805.1.fa

Mutation rate analysis

In this task, we need to generate a "moving average" for genomic mutation rates by genomic window, that can be displayed as a track in the circos plots.

IGV display of individual mutant regions

Here we will use the IGV (Integrative Genomics Viewer) to display the immediate regions around each mutation, selectable by mutation.

Variant calling workflow - script automation

Basically 3 parts, each part should have its own shell script file:

Retrieving SRA files (.fastq format) using EDirect and SRA-toolkit.
Quality control - trimmomatic and fastqc
Assembly and variant calling - bwa, samtools

trimmomatic error from the script

For some reason, the trimmmomatic command is getting an error when it tries to input the first file, it appears that the script referencing the file locations from the user directory is creating an incorrect first argument for the trimmomatic command, where only the filepath but no filename is specified. A bit confusing because the filepaths are referenced correctly according to the datacarpentry example. Somehow, trimmomatic is getting an incomplete filepath/file name associated with referencing from remote directories.

1: Command working correctly outputs the following (Tested with the script in the fastq/directory --Note the first argument includes a correct filename:
TrimmomaticPE: Started with arguments:
SRR9025102_1.fastq.gz SRR9025102_2.fastq.gz

Command working incorrectly ouputs a filenotfound exception (tested the script in the user directory --note that the first argument is missing the terminal filename):
Exception in thread "main" java.io.FileNotFoundException: ../../media/volume/sdb/pH2/fastq/../../media/volume/sdb/pH2/fastq/SRR9025118_1
.fastq.gz (No such file or directory)

Mummer runs generically based on a single RefSeq genome and genus uid. Integrated into the overall workflow.

Clean Mummer4 run from the sample data

Should generate the dotplot example from the manual, using the Hpylori reference/query genomies:
https://mummer4.github.io/tutorial/tutorial.html

NexteraPE-PE.fa file should be accessible to the trimmomatic wherever it runs.

Make the MUMMER4 runs into a .sh executable

Make sure the mummer4 commands work for the sample data (1 reference genome Shewanella oneidensis, 1 query genome ie. Shewanella frigidimarina).

Run the same command, but in the .sh file context, to validate that the shell script runs the commands correctly.

Next, make a variable inside the script that can take user input to define the query genome. (user inputs the query species)
Finally, make a for-loop that can take a list of query genomes and run them automatically, with each output having a unique name.

Re-adjust user's project folder file path based on user's preference

Based on the group discussion today (08/04/23):

We need to adjust the /media/volume/sdb (with no end slash) into a variable, as discussed $4
All the scripts have to be updated with the variable for this file path.

Get SRA metadata along with the reference and SRA run downloads

Currently the SraRunTable.txt is coming from the "Metadata" tab in the NCBI SRA Run Table web GUI. We'll need to find a way to automatically download this table.

Set up connection to the GitHub repo

We need to start implementing the git clone to our repo (assuming it's structured correctly

If Repo is public:

Ref: https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository

If the repo is public, it's generally easy for the users to clone the repo to their VM using the HTTPS with the following command:

git clone https://github.com/USERNAME/REPOSITORY

If Repo is private:

When the repo is private, the only way to clone the repo is to either use an SSH Private key, or a PAT (personal access tokens), or even both if the VM is restrictive (like AWS EC2)
- Using PAT: https://www.shanebart.com/clone-repo-using-token/
- Using SSH Private Key: https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent?platform=linux

NCBI Datasets download configuration

Make sure NCBI datasets runs properly from a user directory terminal.
In the script "RetrieveReference/R01_RetrieveRefData.sh", the input variable should be the taxonomic UID, and the output .zip file should be the taxa UID. For example, "2214" is the taxonomy ID for Halobacterium salinarum, the input of the script would be "2214", and the output file will be "2214.zip"