GithubHelp home page GithubHelp logo

Comments (4)

Askarbek-orakov avatar Askarbek-orakov commented on August 23, 2024

Hallo Silas,

I run gunc on a collection of MAGs and wanted to find out what is the difference between the two dbs progenomes and gtdb.

The differences are really only in the number and quality of genomes.

What I saw is first that more MAGs fail when using GTDB.

That must be due to more reference genomes/information available in GTDB which helps to reach a CSS score above the cutoff, i.e. more confidence in chimerism.

I also checked that more genomes are evaluated at the genus level. Which makes sense as I expect GTDB to have much more genera clusters to evaluate on. But then there are also more genomes evaluated at the Kindom level. Which Doesn't make sense to me?

What do you exactly mean by "evaluate"? GUNC reports scores based on taxonomic distribution patterns at each level. If you mean the level at which max value of CSS occurs then the answer would be that there are more genomes in GTDB at all levels which increases CSS values regardless of the level. As a result, you also have more genomes labelled as chimeric with the max CSS at the kingdom level as well.

Do you have any explanation? Is the taxonomic placement more complicated?

Hope the above helps. Otherwise, I would be happy to give a more detailed answer if you clarify what you meant by "evaluate".

What do you generally recommend gtdb or progenomes?

We believe that progenomes is cleaner as we have applied a harsher filtering to it while we took GTDB as is. While GTDB has more genomes it also potentially contains a higher proportion of chimeric genomes, the effect of which is difficult to estimate.

Hope this is helpful!

from gunc.

defleury avatar defleury commented on August 23, 2024

Hi Silas!

I agree with everything that @Askarbek-orakov wrote. I can just two more points.

First, we did a rather thorough benchmark comparing GTDB and NCBI taxonomies for the original study. In the paper you'll find the results in Figures S4 & S5. Askarbek tried several things:

  • GUNC db sequences, but with GTDB taxonomy (Fig S4)
  • GTDB sequences with GTDB taxonomy (Fig S5)

As you will see from those tests, the performance on the various simulated genomes was really comparable. The biggest drawback of using GTDB is efficacy: the db is larger and each run takes longer and is more resource-hungry, while the results are not noticeably better. Askarbek has outlined possible reasons for that above.

Second, I can maybe comment on the 'kingdom' level issue you observed. I don't know what type of data you're processing, but the default GUNC db is certainly biased against several archaeal and CPR phyla which are much better represented in the GTDB. So if you expect loads of such genomes, GUNC with default db would give you cautious results (low reference-representation scores, basically signifying that these are outside of GUNC's comfort zone), whereas GTDB may resolve them better. On the flipside, the taxonomy in those particular parts of the tree also tends to be more shaky, so I'd expect more false positive chimerism calls (inflated CSS scores with inflated confidence). But we haven't systematically explored this so far.

from gunc.

SilasK avatar SilasK commented on August 23, 2024

Thank you for your answers.

If I understood it correctly the taxonomic_level indicates on which taxonomic level the CSS score was been calculated, isn't it? That's what I mean by "evaluated". The CSS core on the genus level would therefore be more precise/informative than at the kingdom level. Am I right?

As you say there is the potential for contamination if you took the GTDB as is, especially if the CSS is calculated for the genus level. But I don't understand why many genomes could only be evaluated at the Kindom level using GTDB.

from gunc.

defleury avatar defleury commented on August 23, 2024

Hi Silas!

The CSS is calculated at every taxonomic level, accessible via the --detailed_output flag. In the default output, the tax level you see is the one at which CSS went above the threshold. Could you paste an example output (ideally using the --detailed_output flag) where you are surprised by a kingdom-level chimerism call?

from gunc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.