GithubHelp home page GithubHelp logo

Identify contaminant contigs about gunc HOT 4 CLOSED

grp-bork avatar grp-bork commented on August 23, 2024
Identify contaminant contigs

from gunc.

Comments (4)

mw55309 avatar mw55309 commented on August 23, 2024 1

edit

I see there are 4 columns and the fourth is space separated!

OK!

Hello!

Thank you so much for doing this!

One small thing - in the output, some lines have 5 columns and others only have four..... e.g. species here:

contig  tax_level       assignment      count_of_genes_assigned
single_ERR2027929.96_k87_3848   kingdom 2 Bacteria      4
single_ERR2027929.96_k87_3848   phylum  1239 Firmicutes 4
single_ERR2027929.96_k87_3848   family  186803 Lachnospiraceae  3
single_ERR2027929.96_k87_3848   family  31979 Clostridiaceae    1
single_ERR2027929.96_k87_3848   genus   841 Roseburia   2
single_ERR2027929.96_k87_3848   genus   1855714 Anaerobium      1
single_ERR2027929.96_k87_3848   genus   1485 Clostridium        1
single_ERR2027929.96_k87_3848   species specI_v3_07704  1
single_ERR2027929.96_k87_3848   species specI_v3_08779  1
single_ERR2027929.96_k87_3848   species specI_v3_11370  1
single_ERR2027929.96_k87_3848   species specI_v3_10000  1

Can this be fixed somehow? :-D

Cheers
Mick

from gunc.

defleury avatar defleury commented on August 23, 2024

Hi Mick!

I am confused by the output of gunc - I thought it would be able to identify those contigs which do not match with the rest of the genome - can gunc not do that?

We are currently working on this as a feature. Our aim is to have an automated 'chopping away' of problematic contigs, but we want to do it properly and benchmark before letting lose a tool that wreaks havoc with your previous MAGs...

Or at least I thought it would be able to tell me the taxonomic assignments of each contig so I could make the decision myself - does gunc not do this?

It certainly looks from the visualisation (https://grp-bork.embl-community.io/gunc/_images/GUNC_PLOT_example.png) that gunc is able to label contigs - can I get those labels out as a text file?

The visualisation module uses a heuristic to limit the number of displayed contigs (to avoid cluttering). But we'll add the option to get flat files with per-contig tax labels – basically, all labels assigned to a given contig at a taxonomic level, plus frequency. This would still require some further parsing to do what you intend, but it's the least biased output we can provide. This will come as a new option to gunc run, called --contig_taxonomy_output. @fullama is working to release this feature asap :-)

from gunc.

mw55309 avatar mw55309 commented on August 23, 2024

Thank you!

I guess in the meantime I can parse the diamond file to add taxonomic labels per protein and then use whatever summarisation algorithm I need to get per-contig taxonomies

Cheers
Mick

from gunc.

fullama avatar fullama commented on August 23, 2024

Hi, I just released GUNC v1.0.4... (available via pip now and conda in a little while when it goes through)

you can now use gunc with the --contig_taxonomy_output option which hopefully gives you the kind of output you are looking for..

it will out put a tsv of the form:

contig tax_level assignment count_of_genes_assigned
k141_21019_1 kingdom 2 Bacteria 1
k141_21019_1 phylum 200795 Chloroflexi 1
k141_21019_1 family 475964 Caldilineaceae 1
k141_21019_1 genus 233191 Caldilinea 1

Any questions just let us know!

from gunc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.