GithubHelp home page GithubHelp logo

Comments (8)

mtisza1 avatar mtisza1 commented on July 28, 2024

Hi Roli,

Thanks for opening the issue. I think I understand your question.

Let's say you have a putative virus contig where CENOTE_NAME = mycontig1 and END_FEATURE = None. The BLASTP and taxonomy info will be in the file: no_end_contigs_with_viral_domain/mycontig1.tax_guide.blastx.out. With a few special exceptions this file will have a line with the BLASTP information of the best hit in RefSeq, and the second line will be the hierarchical taxonomy of that hit. (If the END_FEATURE = DTR, the file will be DTR_contigs_with_viral_domain/mycontig1.tax_guide.blastx.out).

I see how this can be a little confusing as this file name has "blastx" in it, but I basically just have the script overwrite the "provisional taxonomy" blastx search file that was created earlier.
Some putative virus contigs don't encode genes that are "useful" for taxonomy, e.g. terminase, major capsid protein, RdRp. These use taxonomical information from the "provisional taxonomy" blastx search.

More broadly, virus taxonomy is a really big challenge, and while Cenote-Taker 2 does a pretty good job, it's not perfect or particularly sophisticated. By default, only family-level taxonomy is "guessed". However, if you use the BLASTN settings with GenBank nt database (e.g. --known_strains blast_knowns --blastn_db /path/to/nt), you can get species level taxonomy for contigs that are closely related to viruses deposited in Genbank.

Let me know if this was helpful.

Mike

from cenote-taker2.

Roli-Wilhelm avatar Roli-Wilhelm commented on July 28, 2024

from cenote-taker2.

mtisza1 avatar mtisza1 commented on July 28, 2024

Roli,

Thanks for the follow up. This seems concerning. I could answer your questions better if you could compress and send me your output directory (and a log file of the run if you made one). Is this possible?

Sending to my email [email protected] would be easy, or you could include a link to a file upload. I can take a look and make sure I can answer your questions.

Mike

from cenote-taker2.

mtisza1 avatar mtisza1 commented on July 28, 2024

Hi Roli,

I got your email in which you sent me your output files. There are 2 things going on here.

  1. Your efetch tool is not working. Since you didn't have a log file for me, I'm not totally sure the nature of the error. Efetch pings the NCBI server to pull the taxonomy information of a blast hit, and this info was not present in your "tax_guide.blastx.out" files. Maybe you know the reason for this? Internet issues?

  2. Indeed, my RNA virus hallmark models failed on 2 of the 3 RdRps! I already have an update to these models on my to-do list. In short, my initial screening of putative hallmark gene HMMs was culling a lot of polyproteins (which means most RdRps) because the HMM of the whole polypeptide hits a lot of non-virus genes. I tried to go back and extract just the RdRp region for many of these, but it looks like I missed some. I am aiming to have an updated version of the HMMs available next week. Sorry about that.

I'll let you know when the new HMMs are live.

Mike

from cenote-taker2.

Roli-Wilhelm avatar Roli-Wilhelm commented on July 28, 2024

from cenote-taker2.

mtisza1 avatar mtisza1 commented on July 28, 2024

OK Roli,

Thanks for checking efetch. My next hypothesis is that krona tools are not working, or your krona taxonomy database did not set up correctly.

Can you try to run this command in the directory testers/no_end_contigs_with_viral_domain/:

conda activate cenote-taker2_env
ktClassifyBLAST -o testers2.tax_guide.blastx.tab testers2.tax_guide.blastx.out
cat testers2.tax_guide.blastx.tab

should produce this:

#queryID	taxID	Avg. log e-value
testers2	1811230	-450

If this is throwing some error about the taxonomy database, try this series of commands with a job with 4 or more CPUs (it will take about 20-40 minutes, if I recall):

KRONA_DIR=$( which python | sed 's/bin\/python/opt\/krona/g' )
cd ${KRONA_DIR}
sh updateTaxonomy.sh
cd ${KRONA_DIR}
sh updateAccessions.sh

from cenote-taker2.

Roli-Wilhelm avatar Roli-Wilhelm commented on July 28, 2024

from cenote-taker2.

mtisza1 avatar mtisza1 commented on July 28, 2024

Hi Roli,

I've updated the Hallmark database to include several additional RdRp models. I was able to find your sequences with this update. Here's what to do:

conda activate cenote-taker2_env
cd Cenote-Taker2
git pull
python update_ct2_databases.py --hmm True

Let me know if you have any more questions. Thanks again for your interest and follow up.

All the best,

Mike

from cenote-taker2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.