GithubHelp home page GithubHelp logo

Comments (8)

meren avatar meren commented on July 26, 2024

Thanks for the reproducible file, @bpeacock44 (it was missing the index file for the BAM, but I generated one).

Even though it is not related to the error you're getting, one of the first things I realized was how big your contigs-db is. So I removed the contigs that were less than 1,000 nucleotides long,

anvi-script-reformat-fasta contigs.fa -o contigs-filtered.fa --min-len 1000
Input ........................................: contigs.fa
Output .......................................: contigs-filtered.fa

WHAT WAS THERE
===============================================
Total num contigs ............................: 3,603,413
Total num nucleotides ........................: 3,031,559,677

WHAT WAS ASKED
===============================================
Simplify deflines? ...........................: No
Add prefix to sequence names? ................: No
Minimum length of contigs to keep ............: 1,000
Max % gaps allowed ...........................: 100.00%
Max num gaps allowed .........................: 1,000,000
Exclude specific sequences? ..................: No
Keep specific sequences? .....................: No
Enforce sequence type? .......................: No

WHAT HAPPENED
===============================================
Contigs removed ..............................: 2,961,725 (82.19% of all)
Nucleotides removed ..........................: 1,204,932,609 (39.75% of all)
Nucleotides modified .........................: 0 (0.00000% of all)
Deflines simplified ..........................: False

And it turned out that they make up a very significant fraction of your contigs (i.e., over 80%). I think you shouldn't use every single contig that is reported by your assembler for anything.

So, coming back to the real error: I run your command in anvio-dev branch rather than v7.1, and I was able to reproduce this problem. Anvi'o perhaps could give a less cryptic message, but I think this problem is not related to anvi'o, but indeed the BAM file itself (and it is quite a thorny one to even explain).

The BAM file may be structurally 'OK' as far as samtools quickcheck is concerned, but it is certainly not OK when you start looking at the details. For instance, the error is coming from a particular read in the BAM file that maps to a reference that represents a 76 nucleotides long stretch, but the start / end positions for that sequence is reported as 2233 and 2333 in the BAM file, which suggests a 100 nucleotides long stretch. This data either erroneously stored in the BAM file (because the mapping software screwed something up) or erroneous reported by samtools (because who knows why), and anvi'o is the victim here.

I am not sure how this could happen. But here is my suggestion:

  • Remove reads that are less than 1,000 nts long from your FASTA file (you can use the command I shared above).
  • Re-do your mapping from scratch.
  • Re-create your contigs-db.
  • Re-run anvi-profile.

I'm almost certain that this problem will resolve itself after that. Please feel free to come back and re-open this issue if you go that far and still run into an issue :)

Best wishes,

from anvio.

bpeacock44 avatar bpeacock44 commented on July 26, 2024

from anvio.

meren avatar meren commented on July 26, 2024

Where are these 24 metagenomes come from? Can you tell more about the samples?

Normalization: it really should be avoided, IMO. Modern assemblers don't need that kind of 'help'. Quality filtering is essential, but 'trimming' is not a good idea, IMO (as it introduces length variation to short reads that can reduce the specificity of reads during mapping) -- if a read is crappy, one should get rid of the entire paired-end sequence. Removing host contamination is a good idea, but probably it's positive/negative impact is negligible apart from being kinder to memory (by reducing the kmer space for things that will not assemble anyway) :) My 2 cents.

from anvio.

bpeacock44 avatar bpeacock44 commented on July 26, 2024

from anvio.

bpeacock44 avatar bpeacock44 commented on July 26, 2024

from anvio.

meren avatar meren commented on July 26, 2024

I see -- the removal of eukaryotic host via the NEB kit, and not post-sequencing.

I wonder if an individual assembly strategy, rather than a co-assembly may have given you a better performance. If the individual fibrous root samples are very distinct from one another (as far as microbial population structures are concerned), a co-assembly strategy may reduce the quality of the final assembly. An alternative is to assemble each sample individually (to decrease complexity and contig fragmentation), then recruit reads for each assembly from all metagenomes (to generate differential coverage signal), then reconstruct genomes from that sample, repeat this process for every sample, then do a redundancy analysis across all genomes from all samples to have a final catalog of genomes. This sounds more involved, but in fact it is not that complicated with some automation. But the key question is how deeply sequenced each sample and whether individual assemblies would have enough reads to actually do better than the co-assembly. That can't easily be answered without some preliminary assembly / characterization of contigs since the answer is not just a function of the depth of sequencing.

Good luck, and best wishes.

from anvio.

bpeacock44 avatar bpeacock44 commented on July 26, 2024

from anvio.

meren avatar meren commented on July 26, 2024

you mentioned that it would be good to do some preliminary characterization of the contigs. Is there a method/program that you'd recommend for this? Or would it be better to go ahead and recruit reads from all the samples, reconstruct genomes, etc. and see what I come up with? Since the samples are all fairly similar I imagine I could just do this for one or two samples without too much trouble.

One thing I would do is to compare the entire co-assembly and a couple individual assemblies using anvi-display-contigs-stats to see if there is anything immediately look obvious. But as you suggested, I think going through the entire workflow for one or two samples and having insights into the gains / losses before committing a lot more time to do a comprehensive analysis of all would be a good practice IMO.

Best wishes,

from anvio.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.