GithubHelp home page GithubHelp logo

joseph7e / assign-taxonomy-with-blast Goto Github PK

View Code? Open in Web Editor NEW
17.0 5.0 8.0 65 KB

Assign taxonomy with blast, can be used for qiime

Python 100.00%
taxonomy-assignment qiime blast ncbi-taxonomy ncbi-blast ncbi-database

assign-taxonomy-with-blast's Introduction

Assign-Taxonomy-with-BLAST

This program was primarily designed for the taxonomic assignment of amplicon sequence variants (ASVs), but it works with any sequence data in FASTA format. Many output files are generated, several of which can be directly imported into qiime or other common workflows.

Dependencies

Python3 v3.4.3+
modules: Biopython, argsparse

Compute ASVs/OTUs (optional)

If your starting data is fastq data from an amplicon based experiment you will need to compute ASVs/OTUs prior to running this program. Here are some great options.
dada2 - https://benjjneb.github.io/dada2/tutorial.html
qiime2 -
mothur -

Prepare sequence database and taxonomic lookup

ncbi's nt database - ftp://ftp.ncbi.nlm.nih.gov/blast/db/

#####To download this database copy and paste the line below (it will take a bit)
mkdir ncbi_nt_database && cd ncbi_nt_database
wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt*.gz'
tar -xvf nt*

SILVA ssuRNA sequence and taxonomy database: https://www.arb-silva.de/download/archive/qiime/

Constructing an updated taxonomy database for use with NCBI nt.

Locate and download the latest NCBI taxonomy database.
You will need the names and nodes file to construct the expanded taxonomy lookup.
Link updated December, 2020

mkdir ncbi_taxonomy/ && cd ncbi_taxonomy/
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
tar xvzf *.tar.gz

Expand the taxonomic lineages into a simplified tsv lookup.

python3 genbank_nodes_and_names_to_taxonomy.py  names.dmp nodes.dmp

How It Works

Consensus assignment

Rather than relying on a single blast hit the program takes the top X (user defined with --hits_to_consider) blast hits for each sequence.
It then computes the classification for each of these blast hits and determines taxonomy for the query sequence based on the best consensus taxonomy. i.e. If all ten blast hits agree on the same taxonomy than it will give the value to the species level, however if they only agree to the family level then it will stop there.

This aspect has many options, including setting the maximum number of blast hits to consider and the percent sway from the best blast hit. Blast hits are not all treated the same. If your blast provided 10 blast hits but only 5 of them are within '--percent_sway' the others will not be considered.
For example, if one blast hit has a percent identity of 99% while the others are only 95%, only the top hit will be considered (unless of course you set the --percent_sway option to 4.0 or above). The default --percent_sway value is 0.5, this will gather very similiar hits and ignore those that are less similiar.

If you want to only consider the best blast hit, just set --hits_to_consider to '1'.

Best taxonomy based on percent identity

The program provides another level of taxonomic certainty based on the blast percentage. For example, if the best blast hit is 90% you can be fairly confident that you cannot provide the species of the organisms but maybe you can provide the phylum. Right now there are three levels that you can set with the program, --cutoff_species, --cutoff_family, and --cutoff_phylum. The phylum level cutoff is also used as an ultimate filter for the blast hits percent identity.

If you want to only keep sequences that are identified to the species level, just set all cutoffs to '97' or '99'.

If you want to leave it to the consensus taxonomy to decide 'best estimated taxonomy', set all three cutoffs low, maybe '80'.

These threshold values were arbitrarily chosen in line with diverse literature sources and investigations that were mostly based on mock communities of selected taxa (Brown et al., 2015; Holovachov 2016; Leasi et al., 2018).

Note: The consensus taxonomy usually does a pretty good job of weeding out incorrect taxonomy, setting all these cutoff values to 80 actually provides pretty great taxonomy in a lot of cases. Especially if you set a high --hits_to_consider and --percent_sway.

Masking uncultured taxonomy

By default the program will mask blast hits that have a taxonomic assignment of uncultured, or unclassified. These hits will only be considered as a last resort.

Other blast options

You have the option to set the minimum length coverage for the blast hit (defined by 'length of query'/'length of hit'). The default is 0.8. I like to be stringent here to avoid tiny insignificant blast hits.

You already did YOUR OWN BLAST?? We will import that and save you the step. Just make sure the format is right.

use this format --> -outfmt '6 qseqid qlen sseqid pident length qstart qend sstart send evalue bitscore staxids'
If running your own blast command make sure you run it with the out_seqs.fasta file. I have not added an option to input seqs.fna blast directly into qiime yet… but will if requested.

USAGE

usage: taxonomy_assignment_BLAST_V2.py [-h] [-v]  
                                       [--cutoff_species CUTOFF_SPECIES]  
                                       [--cutoff_family CUTOFF_FAMILY]  
                                       [--cutoff_phylum CUTOFF_PHYLUM]  
                                       [--length_percentage LENGTH_PERCENTAGE]  
                                       [--length_cutoff LENGTH_CUTOFF]  
                                       [--hits_to_consider HITS_TO_CONSIDER]  
                                       [--percent_sway PERCENT_SWAY]  
                                       [--blast_evalue BLAST_EVALUE]  
                                       [--blast_threads BLAST_THREADS]  
                                       [--blast_flavor BLAST_FLAVOR]  
                                       [--blast_database BLAST_DATABASE]  
                                       [--blast_file BLAST_FILE] [--ncbi_nt]  
                                       [--output_dir OUTPUT_DIR]  
                                       [--config_file CONFIG_FILE]  
                                       sequence_file tax_file  

positional arguments:  
  sequence_file         seqs.fna file from qiime or any multifasta, just make  
                        sure header has unique id with a space  
  tax_file              path to silva or customized blast database taxonomy  
                        file  
  
optional arguments:  
  -h, --help            show this help message and exit  
  -v, --verbose         increase output verbosity (default: False)  
  --cutoff_species CUTOFF_SPECIES  
                        cutoff for finest taxonomic level (default: 97)  
  --cutoff_family CUTOFF_FAMILY  
                        cutoff for family taxonomic level (default: 90)  
  --cutoff_phylum CUTOFF_PHYLUM  
                        cutoff for phylum taxonomic level, also acts as  
                        ultimate cutoff value for blast (default: 80)  
  --length_percentage LENGTH_PERCENTAGE  
                        cutoff for query_hit/length_of_query i.e. query  
                        coverage (default: 0.0)  
  --length_cutoff LENGTH_CUTOFF  
                        primary cutoff for length of hit (default: 0)  
  --hits_to_consider HITS_TO_CONSIDER  
                        number of hits to consider when gathering consensus  
                        taxonomy (default: 3)  
  --percent_sway PERCENT_SWAY  
                        when comparing greater than 1 blast hit, value is the  
                        percent from best hit considered. i.e. if best blast  
                        hit is 99.5 ID, a value of 0.5 will consider  
                        everything 99 and greater when creating the consensus  
                        (default: 0.5)  
  --blast_evalue BLAST_EVALUE  
                        setting for e-value cutoff for blast, must be in form  
                        1e-X (default: 1e-10)  
  --blast_threads BLAST_THREADS  
                        set the number of threads for blast (default: 24)  
  --blast_flavor BLAST_FLAVOR  
                        select blastp, tblastn etc, must have correct database  
                        and query sequence formats (default: blastn)  
  --blast_database BLAST_DATABASE  
                        path tp blast database, be sure to run makeblastdb on  
                        the database first, type 'IGNORE' if precomputed blast  
                        is given (default: None)  
  --blast_file BLAST_FILE  
                        precomputed blast results, MUST BE MY CUSTOMIZED  
                        FORMAT, '6 qseqid qlen sseqid pident length qstart  
                        qend sstart send evalue bitscore' (default: None)  
  --ncbi_nt             REQUIRED flag for use of ncbi nt database (default:False)  
  --output_dir OUTPUT_DIR  
                        output directory name (default: Assigned_Taxonomy)  
  --config_file CONFIG_FILE  
                        NOT YET IMPLEMENTED, an optional parameters file, any  
                        arguments in this file will override other options  
                        (default: None)  ```

assign-taxonomy-with-blast's People

Contributors

joseph7e avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

assign-taxonomy-with-blast's Issues

error with length percentage cutoff using 0.7 and higher

Hello,

The taxonomy_assignment_BLAST_V2.py script works great for taxonomic assignment using default parameters. However when I use it with the filtering option "length_percentage" with values equal to or > than 0.7 I get this error:

Traceback (most recent call last):
File "taxonomy_assignment_BLAST_V2.py", line 363, in
best_level_taxonomy, blast_percent = Assign_Taxonomy(current_query, current_best_hits)
File "taxonomy_assignment_BLAST_V2.py", line 248, in Assign_Taxonomy
if max(top_hits) >= args.cutoff_family:
ValueError: max() arg is an empty sequence

For some reason, it works well with values < 0.7.

Once again, thanks heaps for your time in advance.

Error when trying to run script

File "taxonomy_assignment_BLAST.py", line 196, in
do_blast(args.sequence_file,args.blast_database,blast_file)
File "taxonomy_assignment_BLAST.py", line 188, in do_blast
log_and_print(err)
File "taxonomy_assignment_BLAST.py", line 76, in log_and_print
log_file_name_handle.writelines(statement+'\n')
TypeError: can't concat str to bytes

No idea what to do.

taxonomy_assignment_BLAST.py: error: the following arguments are required: tax_file

Hello, thanks for this useful resource!

I am trying to run the taxonomy_assignment_BLAST.py in linux as follows:

python3 taxonomy_assignment_BLAST.py --length_percentage 70 --hits_to_consider 20 --blast_file ncbi_nt --output_dir dada_seqs_ITS2.fa expanded_ncbi_taxonomy.tsv

So the dada_seqs_ITS2.fa file is my sequence_file and the expanded_ncbi_taxonomy.tsv is my tax_file.

I get this error: the following arguments are required: tax_file

So it doesn't appear to be recognising the .tsv? Have I formatted the code wrong somehow?

Many thanks,

Amy

issue with grabbing taxonomy for blast hits using blast input file

Hello! I'm extremely new to bioinformatics/python/github, so I apologize if this is a super easy fix! Thank you in advance for any help/input.

I am trying assign taxonomy to trnL (plant chloroplast) OTUs with blast results already run and output in the required format (outfmt '6 qseqid qlen sseqid pident length qstart qend sstart send evalue bitscore staxids'). I'll provide a file example of what my blast output looks like below in case that is the issue, but it's a tab delimited file with 10 top hits per OTU.

The code I use when running is:
python3 taxonomy_assignment_BLAST.py TLotus.fa ./ncbi_taxonomy/expanded_ncbi_taxonomy.tsv --blast_database IGNORE --blast_file TLotus_taxhits.txt

where TLotus.fa is my file with my OTU sequences (not using since I already have a blast file)
expanded_ncbi_taxonomy.tsv is taxonomy file, built as instructed (preview of what file looks like attached)
TLotus_taxhits.txt is my blast_output with custom formatting

The command runs without error, but the returned taxonomy is the default for when there is no match in the taxonomy file. For example, this is what every OTU looks like this for every OTU:

#BLAST LINE : Otu4 94 NC_047481.1 100 94 1 94 52771 52864 1.52E-39 174 94 354624
#BLAST LINE : Otu4 94 MN308055.1 100 94 1 94 52771 52864 1.52E-39 174 94 354624
#BLAST LINE : Otu4 94 MK105463.1 100 94 1 94 52769 52862 1.52E-39 174 94 3512

ASSIGNING TAXONOMY FOR Otu4 total hits passing initial filters = 3
NC_047481.1 100 --> CAPTURED after percent sway filter
MN308055.1 100 --> CAPTURED after percent sway filter
MK105463.1 100 --> CAPTURED after percent sway filter

Providing consensus taxonomy up to level 14 : tmp6

X100_1 superkingdom;subkingdom;sub_subkingdom;kingdom;tmp1;tmp2;phylum;class;family;genus;species;tmp3;tmp4;tmp5;tmp6
X100_2 superkingdom;subkingdom;sub_subkingdom;kingdom;tmp1;tmp2;phylum;class;family;genus;species;tmp3;tmp4;tmp5;tmp6
X100_3 superkingdom;subkingdom;sub_subkingdom;kingdom;tmp1;tmp2;phylum;class;family;genus;species;tmp3;tmp4;tmp5;tmp6

Taxonomy Assignment for Otu4 = superkingdom:subkingdom:sub_subkingdom:kingdom:tmp1:tmp2:phylum:class:family:genus:species:tmp3:tmp4:tmp5:tmp6

It looks like it's reading my blast file correctly, as it's pulling the accession number and percent ID correctly. When I search the taxonomy file for the taxids in the blast file, they are present with normal taxonomy information.

Is this some sort of basic formatting error? I've tried looking over the python code but have little to no knowledge of python coding and cannot find the issue myself!

ncbi_expanded_taxonomy_EXAMPLE.txt

TLotus_taxhits_EXAMPLE.txt

using script on custom database

Hi Joe, I am trying to use the script to assign taxonomy to some 18S sequences. For this, I have generated my own blast database from a blast search of first 10 hits for each of my sequences and tax file that connects my blast database sequences with their taxonomy.
My tax database has been curated to have 7 levels only k,p,c,o,f,g,s...

I tried to run the script and I get this message:
Traceback (most recent call last):
File "taxonomy_assignment_BLAST_V2.py", line 340, in
best_level_taxonomy, blast_percent = Assign_Taxonomy(current_query, current_best_hits)
File "taxonomy_assignment_BLAST_V2.py", line 288, in Assign_Taxonomy
s.add(j[i])
IndexError: list index out of range

Could it be that the script is not recognizing the categories in my tax file?

blast_output_custom_format.txt
log_file.txt

remote blast output

I've yet to try the script myself, but curious how i could use it with output from a remote blast.
it looks like the input fasta and database files are compulsory inputs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.