Detailed deion of the issue We have this handy class <code c

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[BUG] 'NumGenomesEstimator' sometimes double-counts some genomes as both Bacteria and Archaea about anvio HOT 8 CLOSED

ivagljiva commented on August 30, 2024

[BUG] 'NumGenomesEstimator' sometimes double-counts some genomes as both Bacteria and Archaea

from anvio.

Comments (8)

meren commented on August 30, 2024 2

This is great :) Thank you very much, Iva. I don't think it is a concern -- this is not a method that will be 100% accurate for many reasons, but it will get very close to the truth (as it is the case now).

from anvio.

ivagljiva commented on August 30, 2024

Lessons from testing

I've implemented the basic solution described above and have been testing it on SYNTH_META_00000000000000000120, my synthetic metagenome with 18 Bacteria and 2 Archaea. Unfortunately, it doesn't work as expected in this case.

Currently the solution is to get rid of all hits to genes with hits from multiple HMM sources. In this sample, 455 gene calls have hits to multiple SCG domains, which results in the removal of 925 hits (out of 1954 total, or 47%), leaving 1029 hits to be used for estimating the number of populations. Most of the genes that were removed from consideration had hits to both 'Bacteria_71' and 'Archaea_76', but I also counted ~15 gene calls that had hits to all three domains ('Bacteria_71', 'Protista_83', 'Archaea_76'), and 1 gene call that had hits to both 'Protista_83' and 'Archaea_76'.

Unfortunately, so many genes have multiple domain hits in this case that most of the bacterial and archaeal SCG counts are now 0, which means we predict 0 total genomes from each domain:

{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 
 'Bacteria_71': {'num_genomes': 0, 'domain': 'bacteria'}, 
 'Archaea_76': {'num_genomes': 0, 'domain': 'archaea'}}

I also tested this on the other 3 metagenomes in the table above, and got similar results.

For the sample ending in 002: 493 genes affected, 1008 hits out of 2239 were removed, and 0 total genomes predicted
For the sample ending in 004: 500 genes affected, 1021 hits out of 2311 removed, and 0 genomes predicted
For the sample ending in 006: 482 genes affected, 985 hits out of 2181 removed, and 0 genomes predicted

I'm willing to bet that this strategy also ruins the calculation for the metagenomes that had good estimates before. In short, this removal is too drastic, and we need a more nuanced solution.

from anvio.

meren commented on August 30, 2024

I thought I understood the solution suggested, but reading these results I'm a bit confused :)

Instead of looking at hits, I thought we were going to look at models. I.e., if a given HMM is used in the SCG collections of both Bacteria and Archaea, then we do not include that model in our calculations of the mode of SCG frequencies. That should solve it without a big issue? :)

from anvio.

ivagljiva commented on August 30, 2024

Ah, yes, that is a better way to do it :)

I changed the implementation of get_gene_hit_counts_per_hmm_source() so that it accepts a flag variable called dont_include_models_with_multiple_domain_hits. If that flag variable is True, the function doesn't count hits to any models for which genes could get hits from two different SCG collections. That is:

we go through all genes and make a list of the HMM models that each gene has hits to
if a given gene has a hit to models in more than one HMM source (ie, SCG collection), we take note of which models those are
later when we go through the HMM hits list, we don't count hits to those models; ie, those models are never placed into the gene_hit_counts dictionary that is returned to the get_num_genomes_from_SCG_sources_dict() function.

When I tested it on the four synthetic metagenomes, I got the following output (the 'Hello there' messages will be self.run.warning() statements when seen from the command line:

>>> from anvio.hmmops import NumGenomesEstimator
>>> NumGenomesEstimator('SYNTH_META_00000000000000000002-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 60 models to be removed from our counts, more specifically: 4 from Protista_83, 27 from Bacteria_71, 29 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 19, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 1, 'domain': 'archaea'}}
>>> NumGenomesEstimator('SYNTH_META_00000000000000000004-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 62 models to be removed from our counts, more specifically: 4 from Protista_83, 29 from Bacteria_71, 29 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 17, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 2, 'domain': 'archaea'}}
>>> NumGenomesEstimator('SYNTH_META_00000000000000000006-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 60 models to be removed from our counts, more specifically: 4 from Protista_83, 28 from Bacteria_71, 28 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 16, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 3, 'domain': 'archaea'}}
>>> NumGenomesEstimator('SYNTH_META_00000000000000000120-contigs.db').estimates_dict
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source() function. Just so you know, someone asked for SCG HMMs that belong to multiple sources *not* to be counted, and this will result in 59 models to be removed from our counts, more specifically: 4 from Protista_83, 27 from Bacteria_71, 28 from Archaea_76. You can run this program with the `--debug` flag if you want to see a list of the models that we will ignore from each HMM source.
{'Protista_83': {'num_genomes': 0, 'domain': 'eukarya'}, 'Bacteria_71': {'num_genomes': 16, 'domain': 'bacteria'}, 'Archaea_76': {'num_genomes': 2, 'domain': 'archaea'}}

To summarize the results:

name	Bacteria	Archaea	Total	true # Bacteria	true # Archaea
SYNTH_META_00000000000000000002	19	1	20.0	19	1
SYNTH_META_00000000000000000004	17	2	19.0	18	2
SYNTH_META_00000000000000000006	16	3	19.0	17	3
SYNTH_META_00000000000000000120	16	2	18.0	18	2

So we have now solved the overestimation problem, and one of these sample has a correct estimate now while the other 3 underestimate the number of bacteria by 1 (which is not as bad of an issue).

from anvio.

ivagljiva commented on August 30, 2024

@meren, I have a question regarding how we should update anvi-summarize and anvi-display-contigs-stats. The data for anvi-display-contigs-stats is coming from a call to summarizer.ContigSummarizer.get_summary_dict_for_assembly() in summarizer.py, which in turn calls the updated hmmops.py functions.

Currently, any call to hmmops.get_num_genomes_from_SCG_sources_dict() uses the new flag variable, so the estimates reflect the new way of doing things:

But the SCG histograms at the top of the page are generated via a direct call to hmmops.get_gene_hit_counts_per_hmm_source() without the flag variable, so they still show all hits (even to those models that we are ignoring):

I could change the latter function call to use the new flag so that the histograms in anvi-display-contigs-stats accurately reflect our num genome estimates, BUT because this data is generated via anvi-summarize, this would force anvi-summarize to report no SCG hits for the affected models. I don't think this is a good idea. Can I leave the histograms as-is?

from anvio.

meren commented on August 30, 2024

Let's leave the histogram as is. Much less headache :)

But I think would've been awesome if we had a short paragraph that describes this workflow somewhere on the docs and a link to that from the anvi-display-contigs-stats page.

from anvio.

ivagljiva commented on August 30, 2024

Done :) There is now a section in the anvi-display-contigs-stats help page that looks like this:

It is linked from both the interactive page (replacing the link to the old blog post) and from the contigs-db section about using the NumGenomesEstimator class.

Please take a look, and let me know if there is anything to be changed :)

Also, I noticed while updating the docs that the predictions for the Infant Gut dataset have changed. We now predict 10 bacterial genomes in that data, rather than 9. It is because of overlap from the following models:

WARNING
===============================================
Hello there from the SequencesForHMMHits.get_gene_hit_counts_per_hmm_source()
function. Just so you know, someone asked for SCG HMMs that belong to multiple
sources *not* to be counted, and this will result in 70 models to be removed
from our counts, more specifically: 29 from Bacteria_71, 31 from Archaea_76, 10
from Protista_83. You can run this program with the `--debug` flag if you want
to see a list of the models that we will ignore from each HMM source.

* Models to be ignored for source Bacteria_71: Ribosomal_S8, Ribosomal_L13,
  Ribosomal_S9, Ribosomal_L3, Ribosomal_S19, Ribosomal_L14, Ribosomal_S15,
  Ribosomal_S2, Ribosomal_L4, Ribosomal_L16, SecY, Ribosom_S12_S23,
  Ribosomal_S13, Ribosomal_L22, Adenylsucc_synt, Ribosomal_S7, RNA_pol_Rpb6,
  Ribosomal_L2, Ribosomal_S11, Ribosomal_L29, Ribosomal_L23, tRNA-synt_1d,
  Ribosomal_S17, Ribosomal_L6, eIF-1a, Ribosomal_L1, Ham1p_like, Ribosomal_L27A,
  RNA_pol_L
* Models to be ignored for source Archaea_76: Ribosomal_S8, Ribosomal_L13,
  Ribosomal_S9, Ribosomal_L3, Ribosomal_S19, Ribosomal_L14, RNA_pol_L_2,
  Ribosomal_S15, Ribosomal_S24e, ATP-synt_F, Ribosomal_S2, Ribosomal_S8e,
  Ribosomal_L4, ATP-synt_D, Ribosomal_L16, SecY, Ribosom_S12_S23, Ribosomal_S13,
  Ribosomal_L22, Diphthamide_syn, Adenylsucc_synt, Ribosomal_S7, RNA_pol_Rpb6,
  Ribosomal_S11, Ribosomal_L29, Ribosomal_L23, tRNA-synt_1d, Ribosomal_S17,
  Ribosomal_L6, Ribosomal_L1, Ham1p_like
* Models to be ignored for source Protista_83: EPrGT00050000005732,
  EPrGT00050000006117, EPrGT00050000005182, EPrGT00050000005852,
  EPrGT00050000005482, EPrGT00050000005111, EPrGT00050000006007,
  EPrGT00050000006107, EPrGT00300000062292, EPrGT00050000006045

If this is cause for concern, let me know! We could refine the implementation a bit more. One alternative idea that @FlorianTrigodet discussed with me is that instead of throwing away the models, we could keep the best hit (ie, best e-value) to each gene with multi-domain hits.

from anvio.

ivagljiva commented on August 30, 2024

Yay! I will open a PR, then :)

from anvio.

[BUG] 'NumGenomesEstimator' sometimes double-counts some genomes as both Bacteria and Archaea about anvio HOT 8 CLOSED

Comments (8)

Lessons from testing

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs