Comments (8)
Thanks for the reproducible file, @bpeacock44 (it was missing the index file for the BAM, but I generated one).
Even though it is not related to the error you're getting, one of the first things I realized was how big your contigs-db is. So I removed the contigs that were less than 1,000 nucleotides long,
anvi-script-reformat-fasta contigs.fa -o contigs-filtered.fa --min-len 1000
Input ........................................: contigs.fa
Output .......................................: contigs-filtered.fa
WHAT WAS THERE
===============================================
Total num contigs ............................: 3,603,413
Total num nucleotides ........................: 3,031,559,677
WHAT WAS ASKED
===============================================
Simplify deflines? ...........................: No
Add prefix to sequence names? ................: No
Minimum length of contigs to keep ............: 1,000
Max % gaps allowed ...........................: 100.00%
Max num gaps allowed .........................: 1,000,000
Exclude specific sequences? ..................: No
Keep specific sequences? .....................: No
Enforce sequence type? .......................: No
WHAT HAPPENED
===============================================
Contigs removed ..............................: 2,961,725 (82.19% of all)
Nucleotides removed ..........................: 1,204,932,609 (39.75% of all)
Nucleotides modified .........................: 0 (0.00000% of all)
Deflines simplified ..........................: False
And it turned out that they make up a very significant fraction of your contigs (i.e., over 80%). I think you shouldn't use every single contig that is reported by your assembler for anything.
So, coming back to the real error: I run your command in anvio-dev
branch rather than v7.1
, and I was able to reproduce this problem. Anvi'o perhaps could give a less cryptic message, but I think this problem is not related to anvi'o, but indeed the BAM file itself (and it is quite a thorny one to even explain).
The BAM file may be structurally 'OK' as far as samtools quickcheck is concerned, but it is certainly not OK when you start looking at the details. For instance, the error is coming from a particular read in the BAM file that maps to a reference that represents a 76 nucleotides long stretch, but the start / end positions for that sequence is reported as 2233 and 2333 in the BAM file, which suggests a 100 nucleotides long stretch. This data either erroneously stored in the BAM file (because the mapping software screwed something up) or erroneous reported by samtools (because who knows why), and anvi'o is the victim here.
I am not sure how this could happen. But here is my suggestion:
- Remove reads that are less than 1,000 nts long from your FASTA file (you can use the command I shared above).
- Re-do your mapping from scratch.
- Re-create your contigs-db.
- Re-run
anvi-profile
.
I'm almost certain that this problem will resolve itself after that. Please feel free to come back and re-open this issue if you go that far and still run into an issue :)
Best wishes,
from anvio.
from anvio.
Where are these 24 metagenomes come from? Can you tell more about the samples?
Normalization: it really should be avoided, IMO. Modern assemblers don't need that kind of 'help'. Quality filtering is essential, but 'trimming' is not a good idea, IMO (as it introduces length variation to short reads that can reduce the specificity of reads during mapping) -- if a read is crappy, one should get rid of the entire paired-end sequence. Removing host contamination is a good idea, but probably it's positive/negative impact is negligible apart from being kinder to memory (by reducing the kmer space for things that will not assemble anyway) :) My 2 cents.
from anvio.
from anvio.
from anvio.
I see -- the removal of eukaryotic host via the NEB kit, and not post-sequencing.
I wonder if an individual assembly strategy, rather than a co-assembly may have given you a better performance. If the individual fibrous root samples are very distinct from one another (as far as microbial population structures are concerned), a co-assembly strategy may reduce the quality of the final assembly. An alternative is to assemble each sample individually (to decrease complexity and contig fragmentation), then recruit reads for each assembly from all metagenomes (to generate differential coverage signal), then reconstruct genomes from that sample, repeat this process for every sample, then do a redundancy analysis across all genomes from all samples to have a final catalog of genomes. This sounds more involved, but in fact it is not that complicated with some automation. But the key question is how deeply sequenced each sample and whether individual assemblies would have enough reads to actually do better than the co-assembly. That can't easily be answered without some preliminary assembly / characterization of contigs since the answer is not just a function of the depth of sequencing.
Good luck, and best wishes.
from anvio.
from anvio.
you mentioned that it would be good to do some preliminary characterization of the contigs. Is there a method/program that you'd recommend for this? Or would it be better to go ahead and recruit reads from all the samples, reconstruct genomes, etc. and see what I come up with? Since the samples are all fairly similar I imagine I could just do this for one or two samples without too much trouble.
One thing I would do is to compare the entire co-assembly and a couple individual assemblies using anvi-display-contigs-stats to see if there is anything immediately look obvious. But as you suggested, I think going through the entire workflow for one or two samples and having insights into the gains / losses before committing a lot more time to do a comprehensive analysis of all would be a good practice IMO.
Best wishes,
from anvio.
Related Issues (20)
- Process stops midway and does not create the PROFILES.db on Anvio HOT 2
- [BUG] Anvio v8 install python version errors - snakemake, fastani HOT 5
- Error running `anvi-reaction-network` HOT 22
- anvio 7.1: anvi-setup-ncbi-cogs --reset HOT 1
- [BUG] anvi-help not working anymore HOT 1
- [BUG] Most annotations from anvi-run-cazymes have undefined ('-') accession numbers HOT 5
- [BUG] anvi-script-augustus-output-to-external-gene-calls 0 gens parsed
- [DISCUSSION] Methylation processing in anvio HOT 2
- [BUG] Search with operators in the interactive interface HOT 1
- [BUG] anvio-cluster-contigs fails with CONCOCT HOT 13
- [BUG] anvi-get metabolic-model-file ImportError: Numba needs NumPy 1.25 or less HOT 2
- [FEATURE REQUEST] Outputs for function- and pathway-level variability analysis HOT 2
- [BUG] Insert a short but descriptive title (leave the '[BUG]' part) HOT 1
- DAStool finishes without errors but output not recgnized by Anvi'o HOT 3
- [BUG] Issue with --pre-computed-inversions HOT 2
- Performing `anvi-self-test` but the interactive operator did not load in Chrome HOT 1
- [FEATURE REQUEST] A command to make new HMM sources from a list of COG IDs. HOT 8
- Problems about Structure Display with anvi-display-structure HOT 7
- [BUG] anvi-interactive crashes when using a collection and a external tree HOT 8
- [BUG] anvi-get-sequences-for-hmm-hits in combination with --gene-names silently removes genomes HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anvio.