GithubHelp home page GithubHelp logo

Comments (4)

Askarbek-orakov avatar Askarbek-orakov commented on August 23, 2024 1

Dear Taylor,

When simulating chimeric genomes I encoded the information about the genome in its name. For example, the first file in type3a.genomes.tar.gz looks like this type3a.genomes/10/class/type3a_class_0000_1388475.SAMN02325599_1.0_2893517_1121425.SAMN02745218_0.1_2881643_.fa

The filename delimited by underscore contains this info for type3a genomes:
type3a - simulation scenario
class - divergence level between chimera sources
0000 - id for genomes with the same simulation parameters
1388475.SAMN02325599 - acceptor genome id
1.0 - acceptor genome portion contribution
2893517 - acceptor genome size in bps
1121425.SAMN02745218 - donor genome id
0.1 - donor genome portion contribution
2881643 - donor genome size in bps
It varies slightly for each type so please let me know if you need more info on others.

So, contributing genome ids can be derived as described above and their taxonomy is in the attached table which is for proGenomes2.0 database. Currently, the proGenomes website provides a taxonomy table for v2.1 but that one misses some genomes that were used for simulations.

And finally, contig headers contain: example >1388475.SAMN02325599.KI969747_0-51549
1388475.SAMN02325599 - genome id
KI969747 - contig id
0-51549 - bp range of the original contig that ended up in the simulated chimeric genome.

Cheers,
Askarbek

proGenomes2.genome_taxonomy.csv

from gunc.

defleury avatar defleury commented on August 23, 2024

Hi Chiara!

Thanks for your interest in the tool and data :-)

We didn't realize that the simulated genomes would be useful in themselves, but you're already the second person asking for them. Is there an aspect about the (different types of) simulations that you're particularly interested in, or do you mainly want the genomes with defined levels of contamination and "shorn" MAG-like contig size distribution?

We (that is, @Askarbek-orakov) are working on cleaning that data to release it asap via https://grp-bork.embl-community.io/gunc/datasets . Likewise, we plan to release the Python code that was used to generate them, but that also still requires some work first (mundane stuff like removing hard-coded file paths etc).

So watch this space for updates!

Sebastian

from gunc.

fullama avatar fullama commented on August 23, 2024

Hi,
@Askarbek-orakov has put the data together and we have made it available on https://grp-bork.embl-community.io/gunc/datasets

Let me know if you have any questions/issues!

from gunc.

taylorreiter avatar taylorreiter commented on August 23, 2024

Thank you for making these data sets available! is there any accompanying metadata besides the fasta headers and folder names (e.g. manifests for genomes the contigs came from, or dominant taxonomy in each genome, etc.)? If not, can you provide a key for how to interpret the fasta headers to back-infer this information?

from gunc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.