GithubHelp home page GithubHelp logo

Scrubby updates about ctgap HOT 7 CLOSED

esteinig avatar esteinig commented on August 20, 2024 1
Scrubby updates

from ctgap.

Comments (7)

esteinig avatar esteinig commented on August 20, 2024 1

@gokeson Thanks for letting me know! Great to hear it's been useful. There is a lot of things that can be improved in terms of runtime and resource management. Without going to deep into the weeds, but we found that for our deep short read data the depletion step can still be quite slow when there is overwhelming host material.

I'd be very curious (if you don't mind communicating publicly about this, otherwise always happy to change to email) - are you trying to retrieve whole genomes or are these very low abundance sample types you are sequencing, are you using short or long reads?

We have been building a technically fairly complex clinical diagnostic stack (including interface for interpretation and reporting, host genome analysis if you have consent and a few other nifty things) - it is not quite ready for people to use yet, but it's been used in production on challenging samples with expected low abundance of pathogenic agents from neurological conditions (strongly depending on wet-lab protocols in our experience). Happy to share if that might be useful for you as well - but given the recent push for this across public health labs, you probably have your own system going :)

@ammaraziz yes absolutely! If you remember vaguely from last year, there is something in the works for Cerebro (which includes host indices). I think it's a good suggestion in the interim and a simple downloader with a list of links is probably not too onerous to maintain (the indices are thankfully not as large as taxonomic databases)

from ctgap.

ammaraziz avatar ammaraziz commented on August 20, 2024 1

@esteinig I have to confess that @gokeson knows about Cerebro. I spilled the beans about it last year when chatting with him. He was very keen to test it but I didn't mention anything because you weren't ready to share and it was undergoing the big change at the time. I could run him through the installation and usage for Cerebro if I have your blessing. If I remember correctly the metagenomic project is related to this pipeline but not exactly the same.

I think worth discussing this outside of this repo but I'm not opposed to continuing the discussion here. We could have a zoom meeting to discuss Cerebro and actually I wanted to pick your brain on the best approach for Chlyamdia assembly, there are a few oddities we could use help with.

P.S Sola (Gokeson) is in QLD so our timezones are very close.

from ctgap.

esteinig avatar esteinig commented on August 20, 2024

Michael and Lachlan are publishing an updated version of their human pangenome database for depletion assessment (https://github.com/mbhall88/classification_benchmark, see preprint linked there). I am assessing it for clinical metagenomics data at the moment.

Michael's benchmark shows that besides the obvious performance of long reads, (simulated) Illumina reads are depleted with high sensitivity and specificity with Kraken2 and the pangenome DB, and that minimap2 is a great follow-up from that with the alignment, essentially the process that Scrubby follows to speed things up in our high-depth clinical samples. I need to assess this under more realistic conditions for low abundance pathogens, but that may not be relevant to you.

So... all this to say it looks like the approach has decent performance at least for getting rid of human reads in these simulated conditions ^^

from ctgap.

gokeson avatar gokeson commented on August 20, 2024

@esteinig Thank you very much for developing Scrubby. I can tell you that it has greatly improved this pipeline. I was using bbsplit and kneaddata in the past for hg depletion. Neither of the two offers the efficiency that scrubby offers. The opportunity to extract at a set taxa level has made my life a lot easier and assembly faster especially with clinical samples (less so with isolates, unless there's heavy contamination).

Scrubby wishlist:

  • Distribution via BioConda or at least private channel in the meantime

This is going to make CtGAP easier to use for BioConda fans, so I appreciate your help with this. Our aim is to get this pipeline ready for publishing around June (still waiting on one more ref genome to be sequenced and included). So I guess we will have plenty of time to test this pipeline with all Scrubby updates.

  • HPRG reference genome database for depletion (see next comment)

Thank you so much for including this in the next version. It will be mighty useful when we start our comparative analyses of global Chlamydia trachomatis genomes from clinical samples (mostly metagenomes) later in the year.

from ctgap.

ammaraziz avatar ammaraziz commented on August 20, 2024

HPRG reference genome database for depletion (see next comment)

Will you distribute the HPRG with scrubby or include a subcommand for downloading/preprocessing? That'd make it much easier for end users.But at the same time it means supporting the downloading which can be a huge pain - see the kraken2 repo issues which are filled with issues of downloading+building the reference. Another option is to have some docs for the end user that specifies best practices, eg download this reference, run this minimap2 command then point scrubby to it.

We could help there if you want!

from ctgap.

esteinig avatar esteinig commented on August 20, 2024

Lmao no drama man! It's still not properly validated with clinical data and it's a bit of a construction site. I am a little hesitant to let people try and use it - it's absolutely gonna break for someone else and the database thing is a pain point ^^ I'm more than happy to share when it's usable of course, will let you know ASAP.

It's also very very much focused on low abundance sample types and short reads (at the moment) simply because we don't have many other datasets for diagnostics and doing something for the scope of ✨ metagenomics ✨ i.e. complex natural communities with diverse stuff hanging out, is not in scope for Cerebro. There's probably better MAG related pipeline from the ACE people at UQ.

Yeah agree, we can catch up on Zoom sometime on this! :)

from ctgap.

gokeson avatar gokeson commented on August 20, 2024

from ctgap.

Related Issues (16)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.