This repo contains code and data for improving viral annotation. It currently covers the members of the Paramyxoviridae and the Bunyavirales. The overall goal is to create a very low-tech solution for calling viral proteins across entire viral families, and to cover cases where we do not have bespoke species-specific annotations from VIGOR4.
This program is not intended to be used as a de novo protein or ORF discovery tool. It is designed to find proteins that we already know to exist. I will explain more about how it works below.
Aquaparamyxovirus
Arenaviridae
Ferlavirus
Hantaviridae
Henipavirus
Jeilongvirus
Metaavulavirus
Morbillivirus
Nairoviridae
Orthoavulavirus
Orthorubulavirus
Peribunyaviridae
Paraavulavirus
Pararubulavirus
Phenuiviridae
Respirovirus
Unless otherwise stated, the programs described in this repo are written and tested in in perl (v5.38.0).
The script(s) have the following dependencies:
External CPAN perl modules:
JSON::XS
File::Slurp
It also uses gjoseqlib.pm
which is perl module that was written by Gary Olsen at the University of Illinois. It is used for sequence manipulation. You can get the latest version of this module by downloading it from Gary's repo here: https://github.com/TheSEED/seed_gjo/
The program(s) run the blast suite of tools from NCBI. The current version requires:
blastn: 2.13.0+
tblastn: 2.13.0+
I have not tested it on other versions of BLAST
For internal ANL users, source:
/vol/patric3/cli/ubuntu-cli/user-env.sh
annotate_by_viral_pssm.pl
the perl script that runs the blasts and calls the proteins.
annotate_by_viral_pssm-GTO.pl
this perl script runs annotate_by_viral_pssm.pl and creates a GTO as output. Note that its help options are slightly different.
It is run in the following way: annotate_by_viral_pssm-GTO.pl -x [file_prefix] -i Input.gto -o Output.gto
with other options in the help menu.
Viral_PSSM.json
the file containing BLAST and ORF calling parameters per protein.
Viral-Rep-Contigs
the directory of representative contigs that guides the program to the closest set of PSSMs.
Viral-PSSMs
the directory of hand curated PSSMS per family or genus. There may be more than one PSSM per protein.
Viral-Alignments
the directory of alignments that corresponds to each PSSM. This is not used by the program, but it is useful for keeping track of the source data used to build each PSSM.
Other-Scripts
is a directory other non-essential but useful scipts and files related to the development and management of this tool. It currently contains a program called, list_annos_from_pssms.pl
which will dump the annotation for each pssm.
annotate_by_viral_pssm.pl [options] -i subject_contig(s).fasta
Options include:
-h help
-i input subject contigs in fasta format
-t declare a temp file (d = random)
-tax declare a taxonomy id (D = 11158 )
-g Genome name (D = Paramyxoviridae);
-k Keep internal stop codons (D = off) if you think that your genome will have stops
within the PSSM, but still want to make a call over that region creating a pseudo gene.
-min minimum contig length (d = 1000) # otherwise the genome is rejected
-max maximum contig length (d = 25000) # for reference Measles is 15894 and beilong is 19,212
-opt Options file in JSON format which carries data for match (D = /home/jjdavis/Viral_PSSM.json)
-l Representative contigs directory (D = /home/jjdavis/bin/Viral-Rep-Contigs)
-p Base directory of PSSMs (D = /home/jjdavis/bin/Viral-PSSMs)
Note that this is set up as a directory of pssms
right now this is hardcoded as: "virus".pssms within this directory.
I plan to eventually change -t and -g to be something more intelligent.
Hard-coded locations currently exist as the defaults for -opt, -l, and -p. Since that is annoying, you
might want to run something like perl -i -pe 's/\/home\/jjdavis\/bin/the path to your bin/g' annotate_by_viral_pssm.pl
, or you could edit lines 70-72 by hand (but note that these are the line numbers at the time I wrote this).
There is also a set of debugging parameters that I use frequently:
-tmp keep temp dir
-no no output files generated, to be used in conjunction with one of the following:
-dna print only genes to STDOUT
-aa print proteins to STDOUT
-tbl print only feature table to STDOUT
-ctbl [file name] concatenate table results to a file (for use with many genomes)
The code is currently designed to work on Paramyxoviridae and most Bunyavirales viruses, although I plan to add more. As depicted in the image below, it first performs a BLASTn against a small set of representative genomes for each genus. Then it sorts the results by bit score and chooses the best match.
For each genus, there is a directory of PSSMs corresponding to each known protein for that genus. The PSSMs are derived from a set of hand curated alignments. In the next step, it cycles through each directory of PSSMs (there may be more than one PSSM per protein), choosing the best tBLASTn match per pssm.
Note that it assumes your genome will have the same set of proteins as the nearest match. This is why it is not intended to be used as a discovery tool. In the event that a new protein is found, a new PSSM must be added to the PSSM directory.
Finally it performs any special rules on the proteins/ORFs. These rules are currently encoded in a JSON file called Viral_PSSM.json
. The following is a description fo the current JSON strucutre.
The perl code reads the JSON file into a hashref, which has this general structure: options->{virus}->{protein}->{option} = value
. So, for the F protein shown below, the options hash will have a bit score cutoff of 100 (fairly relaxed for a pssm), and a coverage cutoff of 65%. Upstream extension is turned on. This functionality extends the ORF upstream to find the closest Met (assuming it doesn't start with an AUG already). Downstream extension is also turned on. This searches for a stop codon in-frame after the PSSM match. These can be turned off by editing the JSON file and setting their values to zero.
"Metaavula": {
"F": {
"bit_cutoff": 100,
"coverage_cutoff": 0.65,
"upstream_ext": 1,
"downstream_ext": 1
},
Another simple rule, which is not shown in the example is start_to_met
: when this is set to 1, the first amino acid matching the PSSM is manually converted to Met.
There are several cases where the rules are more complex. For example, In the Paramyxoviridae, there is a phenomenon called RNA editing, where additional nucleotides can be inserted into transcripts of the phosphoprotein, which cause the RNA-polymerase to jump frames, and translation is continued in a new frame. This means that two separate blast matches are required to capture the resulting protein. The functionality is encoded using the parameters paramyxo_join
, join_partner
, new_anno
, and optionally paramyxo_insert
. paramyxo_join
is set to 1 if it is the first blast match of the pair and 2 if it is the second blast match of the pair. If paramyxo_join
is set to 1, then the parameter join_partner
must be used to denote the other pssm match that it is paired with. In the case below, P is joined with V-ORF2 and W-ORF2 to to merge their ORFs and make a full-length V protein and full-length W protein, respectively. On the second join partner, there is a parameter called new_anno
, which carries the annotation for the newly merged protein (the annotations are typically stored in the pssm title field, but this could be changed). paramyxo_insert
is a parameter that I have been tinkering with to correct the transcirpt, so that we end up with the right amino acid sequence after the merge. Notice also that the upstream_ext
parameter is set to zero in the case of V-ORF2. In this case we do not want it searching up stream for a Met codon. Note that this phospho protein region is still a work in progress. Currently, I have tested this extensively and I have the gene boundaries debuged. In all cases that I am aware of it provides the merged amino acid product from the correct frame. However, I still have several instances where the amino acid at the editing site is incorrect. I also need to split the Respiroviruses
Here is a snippet to describe the overall structure:
"Metaavula": {
"F": {
"bit_cutoff": 100,
"coverage_cutoff": 0.65,
"upstream_ext": 1,
"downstream_ext": 1
},
"P": {
"bit_cutoff": 100,
"coverage_cutoff": 0.65,
"upstream_ext": 1,
"downstream_ext": 1,
"paramyxo_join": 1,
"join_partner": ["V-ORF2", "W-ORF2"]
},
"V-ORF2": {
"bit_cutoff": 100,
"coverage_cutoff": 0.65,
"upstream_ext": 0,
"downstream_ext": 1,
"paramyxo_join": 2,
"new_anno": "V Protein"
The next and final set of rules relate to calling features that are not based on PSSMs. In this case, if you know the location of a feature, and it can be established in reference to the start or stop position of any of the pssm-based matches, then it can be called in the genome. Here is an example. In this case for the Arenaviridae there is a feature called the large segement stemloop. It ocurrs between the stop position of Z protein and the stop position of L protein. First, for L and Z we hadd a new key called non_pssm_partner
which tells the program that one of the coordinates of that protein are important for calling the "Large Segment Stemloop." Then in the object for the Large Segment Stemloop there are several new fields. anno
provides the annotation for the feature. translate
is a Boolean for whether you want the feature translated or not. min_len
and max_len
dictate the size cutoffs for calling the feature. Next, begin
tells it which pssm is the beginning coordinate with begin_pssm
, begin_pssm_loc
in this case is the "STOP" position of "Z". Finally, begin_offset
is 1 indicating that you want to start on the nucleotide after the stop. End
is essentially the mirror image. If either of these had started or stoped on the "START" position then you would simply say "START" for end or begin_pssm_loc.
"Arenaviridae": {
"L": {
"bit_cutoff": 100,
"coverage_cutoff": 0.65,
"upstream_ext": 1,
"downstream_ext": 1,
"non_pssm_partner": ["Large Segment Stemloop"]
},
"Z": {
"bit_cutoff": 100,
"coverage_cutoff": 0.65,
"upstream_ext": 1,
"downstream_ext": 1,
"non_pssm_partner": ["Large Segment Stemloop"]
},
"Large Segment Stemloop": {
"anno": "Large Segment Intergenic Stem Loop Region",
"translate": 0,
"min_len": 50,
"max_len": 200,
"begin": {
"begin_pssm": "Z",
"begin_pssm_loc": "STOP",
"begin_offset": 1
},
"end": {
"end_pssm": "L",
"end_pssm_loc": "STOP",
"end_offset": 1
}
The following pseudocode offeres an explanation of how the program works:
BEGIN
Import required modules
Define usage instructions
Define command line options and their default values
Parse command line options
If help option is provided
Display usage instructions and exit
If input subject file is not provided
Display error message and exit
Generate random temp file name if not provided
Set default values for optional parameters if not provided
Read options from JSON file and store in options variable
Open input subject file
Read sequences from file using gjoseqlib::read_fasta function
Close input file
Calculate total length of sequences
If total length is not within min_len and max_len
Display error message and exit
Create hash of contigs and track order
For each sequence in sequences
Add sequence to contigH with ID as key and sequence as value
Increment length by length of sequence
Add ID to contig_order array
End loop
Get current working directory
Create temporary directory
Copy subject file to temporary directory
Change working directory to temporary directory
Create blast database using makeblastdb command
Open directory of representative contigs
Get list of representative contigs
Close directory
Initialize variables for best_contig_bit and best_virus_match
For each representative contig
Perform blastn against subject file
Decode blastn output
If match bit is greater than or equal to best_contig_bit
Update best_contig_bit and best_virus_match
End loop
Print best_contig_bit and best_virus_match to STDERR
Open directory of PSSM directories
Get list of PSSM directories
Close directory
For each PSSM directory
Open directory of PSSM files
Get list of PSSM files
Close directory
For each PSSM file
Perform tblastn against subject file using PSSM file
Decode tblastn output
Calculate best matching result based on bit score
End loop
Print best_pssm and best_bit to STDERR
For each result in best_results
Process matching sequence and coordinates
Generate protein sequence
If required, set up Paramyxo Join
If required, set up calling non-pssm features anchored to PSSM coordinates
Add matching sequence as a tuple to all_seqs array
End loop
End loop
If join information exists
Generate tuples for joining ORFs
Add tuples to all_seqs array
End loop
If non-pssm features exist
Generate tuples for non-pssm features anchored to PSSM coordinates
Add tuples to all_seqs array
End loop
Sort all_seqs array by contig and then start position
Initialize variables for prot_seqs, gene_seqs, and count
For each contig in contig_order
Filter features for current contig
Sort features by start position
For each feature in sorted features
Generate unique protein ID
Add protein sequence and gene sequence to respective arrays
Print feature information to TBL file
End loop
End loop
If output files are not suppressed
Print gene sequences and protein sequences to respective output files
End loop
If only DNA output is requested
Print gene sequences to output
End loop
If only amino acid output is requested
Print protein sequences to output
End loop
Change working directory back to base directory
If keep_temp option is not specified
Remove temporary directory