Comments (8)
This is great :) Thank you very much, Iva. I don't think it is a concern -- this is not a method that will be 100% accurate for many reasons, but it will get very close to the truth (as it is the case now).
from anvio.
Lessons from testing
I've implemented the basic solution described above and have been testing it on SYNTH_META_00000000000000000120
, my synthetic metagenome with 18 Bacteria and 2 Archaea. Unfortunately, it doesn't work as expected in this case.
Currently the solution is to get rid of all hits to genes with hits from multiple HMM sources. In this sample, 455 gene calls have hits to multiple SCG domains, which results in the removal of 925 hits (out of 1954 total, or 47%), leaving 1029 hits to be used for estimating the number of populations. Most of the genes that were removed from consideration had hits to both 'Bacteria_71' and 'Archaea_76', but I also counted ~15 gene calls that had hits to all three domains ('Bacteria_71', 'Protista_83', 'Archaea_76'), and 1 gene call that had hits to both 'Protista_83' and 'Archaea_76'.
Unfortunately, so many genes have multiple domain hits in this case that most of the bacterial and archaeal SCG counts are now 0, which means we predict 0 total genomes from each domain:
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'},
'Bacteria_71': {'num_genomes': 0, 'domain': 'bacteria'},
'Archaea_76': {'num_genomes': 0, 'domain': 'archaea'}}
I also tested this on the other 3 metagenomes in the table above, and got similar results.
- For the sample ending in
002
: 493 genes affected, 1008 hits out of 2239 were removed, and 0 total genomes predicted - For the sample ending in
004
: 500 genes affected, 1021 hits out of 2311 removed, and 0 genomes predicted - For the sample ending in
006
: 482 genes affected, 985 hits out of 2181 removed, and 0 genomes predicted
I'm willing to bet that this strategy also ruins the calculation for the metagenomes that had good estimates before. In short, this removal is too drastic, and we need a more nuanced solution.
from anvio.
I thought I understood the solution suggested, but reading these results I'm a bit confused :)
Instead of looking at hits, I thought we were going to look at models. I.e., if a given HMM is used in the SCG collections of both Bacteria and Archaea, then we do not include that model in our calculations of the mode of SCG frequencies. That should solve it without a big issue? :)
from anvio.
Ah, yes, that is a better way to do it :)
I changed the implementation of get_gene_hit_counts_per_hmm_source()
so that it accepts a flag variable called dont_include_models_with_multiple_domain_hits
. If that flag variable is True, the function doesn't count hits to any models for which genes could get hits from two different SCG collections. That is:
- we go through all genes and make a list of the HMM models that each gene has hits to
- if a given gene has a hit to models in more than one HMM source (ie, SCG collection), we take note of which models those are
- later when we go through the HMM hits list, we don't count hits to those models; ie, those models are never placed into the
gene_hit_counts
dictionary that is returned to theget_num_genomes_from_SCG_sources_dict()
function.
When I tested it on the four synthetic metagenomes, I got the following output (the 'Hello there' messages will be self.run.warning()
statements when seen from the command line:
>>> from anvio.hmmops import NumGenomesEstimator
>>> NumGenomesEstimator('SYNTH_META_00000000000000000002-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 60 models to be removed from our counts, more specifically: 4 from Protista_83, 27 from Bacteria_71, 29 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 19, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 1, 'domain': 'archaea'}}
>>> NumGenomesEstimator('SYNTH_META_00000000000000000004-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 62 models to be removed from our counts, more specifically: 4 from Protista_83, 29 from Bacteria_71, 29 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 17, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 2, 'domain': 'archaea'}}
>>> NumGenomesEstimator('SYNTH_META_00000000000000000006-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 60 models to be removed from our counts, more specifically: 4 from Protista_83, 28 from Bacteria_71, 28 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 16, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 3, 'domain': 'archaea'}}
>>> NumGenomesEstimator('SYNTH_META_00000000000000000120-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 59 models to be removed from our counts, more specifically: 4 from Protista_83, 27 from Bacteria_71, 28 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 16, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 2, 'domain': 'archaea'}}
To summarize the results:
name | Bacteria | Archaea | Protista | Total | true # Bacteria | true # Archaea |
---|---|---|---|---|---|---|
SYNTH_META_00000000000000000002 | 19 | 1 | 0 | 20.0 | 19 | 1 |
SYNTH_META_00000000000000000004 | 17 | 2 | 0 | 19.0 | 18 | 2 |
SYNTH_META_00000000000000000006 | 16 | 3 | 0 | 19.0 | 17 | 3 |
SYNTH_META_00000000000000000120 | 16 | 2 | 0 | 18.0 | 18 | 2 |
So we have now solved the overestimation problem, and one of these sample has a correct estimate now while the other 3 underestimate the number of bacteria by 1 (which is not as bad of an issue).
from anvio.
@meren, I have a question regarding how we should update anvi-summarize
and anvi-display-contigs-stats
. The data for anvi-display-contigs-stats
is coming from a call to summarizer.ContigSummarizer.get_summary_dict_for_assembly()
in summarizer.py
, which in turn calls the updated hmmops.py
functions.
Currently, any call to hmmops.get_num_genomes_from_SCG_sources_dict()
uses the new flag variable, so the estimates reflect the new way of doing things:
But the SCG histograms at the top of the page are generated via a direct call to hmmops.get_gene_hit_counts_per_hmm_source()
without the flag variable, so they still show all hits (even to those models that we are ignoring):
I could change the latter function call to use the new flag so that the histograms in anvi-display-contigs-stats
accurately reflect our num genome estimates, BUT because this data is generated via anvi-summarize
, this would force anvi-summarize
to report no SCG hits for the affected models. I don't think this is a good idea. Can I leave the histograms as-is?
from anvio.
Let's leave the histogram as is. Much less headache :)
But I think would've been awesome if we had a short paragraph that describes this workflow somewhere on the docs and a link to that from the anvi-display-contigs-stats
page.
from anvio.
Done :) There is now a section in the anvi-display-contigs-stats
help page that looks like this:
It is linked from both the interactive page (replacing the link to the old blog post) and from the contigs-db section about using the NumGenomesEstimator
class.
Please take a look, and let me know if there is anything to be changed :)
Also, I noticed while updating the docs that the predictions for the Infant Gut dataset have changed. We now predict 10 bacterial genomes in that data, rather than 9. It is because of overlap from the following models:
WARNING
===============================================
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source()
function. Just so you know, someone asked for SCG HMMs that belong to multiple
sources *not* to be counted, and this will result in 70 models to be removed
from our counts, more specifically: 29 from Bacteria_71, 31 from Archaea_76, 10
from Protista_83. You can run this program with the `--debug` flag if you want
to see a list of the models that we will ignore from each HMM source.
* Models to be ignored for source Bacteria_71: Ribosomal_S8, Ribosomal_L13,
Ribosomal_S9, Ribosomal_L3, Ribosomal_S19, Ribosomal_L14, Ribosomal_S15,
Ribosomal_S2, Ribosomal_L4, Ribosomal_L16, SecY, Ribosom_S12_S23,
Ribosomal_S13, Ribosomal_L22, Adenylsucc_synt, Ribosomal_S7, RNA_pol_Rpb6,
Ribosomal_L2, Ribosomal_S11, Ribosomal_L29, Ribosomal_L23, tRNA-synt_1d,
Ribosomal_S17, Ribosomal_L6, eIF-1a, Ribosomal_L1, Ham1p_like, Ribosomal_L27A,
RNA_pol_L
* Models to be ignored for source Archaea_76: Ribosomal_S8, Ribosomal_L13,
Ribosomal_S9, Ribosomal_L3, Ribosomal_S19, Ribosomal_L14, RNA_pol_L_2,
Ribosomal_S15, Ribosomal_S24e, ATP-synt_F, Ribosomal_S2, Ribosomal_S8e,
Ribosomal_L4, ATP-synt_D, Ribosomal_L16, SecY, Ribosom_S12_S23, Ribosomal_S13,
Ribosomal_L22, Diphthamide_syn, Adenylsucc_synt, Ribosomal_S7, RNA_pol_Rpb6,
Ribosomal_S11, Ribosomal_L29, Ribosomal_L23, tRNA-synt_1d, Ribosomal_S17,
Ribosomal_L6, Ribosomal_L1, Ham1p_like
* Models to be ignored for source Protista_83: EPrGT00050000005732,
EPrGT00050000006117, EPrGT00050000005182, EPrGT00050000005852,
EPrGT00050000005482, EPrGT00050000005111, EPrGT00050000006007,
EPrGT00050000006107, EPrGT00300000062292, EPrGT00050000006045
If this is cause for concern, let me know! We could refine the implementation a bit more. One alternative idea that @FlorianTrigodet discussed with me is that instead of throwing away the models, we could keep the best hit (ie, best e-value) to each gene with multi-domain hits.
from anvio.
Yay! I will open a PR, then :)
from anvio.
Related Issues (20)
- [FEATURE REQUEST] Annotate which KOfams were added by our bitscore relaxation heuristic
- [FEATURE REQUEST] A conda package for anvi'o with a minimal installation option HOT 1
- [BUG] `--prodigal-single-mode` breaks metagenomic workflow HOT 5
- [BUG] anvio-cluster-contigs fails with Generate input data
- [BUG] anvi-meta-pan-genome does not allow me to use gene calls not from prodigal HOT 10
- [BUG] Missing USearch in installation instruction and workflow DAG declarations HOT 1
- [BUG] contigs.db has issues when inputing aa_sequence in the external-gene-call file HOT 9
- [BUG] MaxBin2 failing due to one missing coverage HOT 2
- [BUG] anvio installation HOT 15
- [BUG] Pandas dataframe has no attribute 'append' HOT 4
- [BUG] Pandas error in `anvi-get-codon-frequencies` HOT 6
- [BUG] ERROR in running anvi-script-gen_stats_for_single_copy_genes.R HOT 1
- [FEATURE REQUEST] Using external gene clusters for anvi-pan-genome HOT 1
- [FEATURE REQUEST] Case-sensitive search in `anvi-export-locus` HOT 6
- [BUG] The CONTIGS.db file is 0kb and slurm job is still running HOT 2
- Interactive interface, TypeError: Cannot read properties of undefined (reading 'angle') when trying to organize by length HOT 5
- [FEATURE REQUEST] adding contigs database names to deflines of exported genes/proteins fasta HOT 2
- [BUG] cannot import vbgmm in concoct
- [BUG] anvi-pan-genome - diamond BUG
- Running anvio through snakemake, no errors but the job submission stops running HOT 12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anvio.