Short deion of the problem About 2 minutes after beginning,

Thanks for the reproducible file, <a class="user-mention notranslate" data-hovercard-t

"ValueError: cannot assign slice from input of different size" running anvi-profile about anvio HOT 8 CLOSED

bpeacock44 commented on July 26, 2024

"ValueError: cannot assign slice from input of different size" running anvi-profile

from anvio.

Comments (8)

meren commented on July 26, 2024

Thanks for the reproducible file, @bpeacock44 (it was missing the index file for the BAM, but I generated one).

Even though it is not related to the error you're getting, one of the first things I realized was how big your contigs-db is. So I removed the contigs that were less than 1,000 nucleotides long,

anvi-script-reformat-fasta contigs.fa -o contigs-filtered.fa --min-len 1000
Input ........................................: contigs.fa
Output .......................................: contigs-filtered.fa

WHAT WAS THERE
===============================================
Total num contigs ............................: 3,603,413
Total num nucleotides ........................: 3,031,559,677

WHAT WAS ASKED
===============================================
Simplify deflines? ...........................: No
Add prefix to sequence names? ................: No
Minimum length of contigs to keep ............: 1,000
Max % gaps allowed ...........................: 100.00%
Max num gaps allowed .........................: 1,000,000
Exclude specific sequences? ..................: No
Keep specific sequences? .....................: No
Enforce sequence type? .......................: No

WHAT HAPPENED
===============================================
Contigs removed ..............................: 2,961,725 (82.19% of all)
Nucleotides removed ..........................: 1,204,932,609 (39.75% of all)
Nucleotides modified .........................: 0 (0.00000% of all)
Deflines simplified ..........................: False

And it turned out that they make up a very significant fraction of your contigs (i.e., over 80%). I think you shouldn't use every single contig that is reported by your assembler for anything.

So, coming back to the real error: I run your command in anvio-dev branch rather than v7.1, and I was able to reproduce this problem. Anvi'o perhaps could give a less cryptic message, but I think this problem is not related to anvi'o, but indeed the BAM file itself (and it is quite a thorny one to even explain).

The BAM file may be structurally 'OK' as far as samtools quickcheck is concerned, but it is certainly not OK when you start looking at the details. For instance, the error is coming from a particular read in the BAM file that maps to a reference that represents a 76 nucleotides long stretch, but the start / end positions for that sequence is reported as 2233 and 2333 in the BAM file, which suggests a 100 nucleotides long stretch. This data either erroneously stored in the BAM file (because the mapping software screwed something up) or erroneous reported by samtools (because who knows why), and anvi'o is the victim here.

I am not sure how this could happen. But here is my suggestion:

Remove reads that are less than 1,000 nts long from your FASTA file (you can use the command I shared above).
Re-do your mapping from scratch.
Re-create your contigs-db.
Re-run anvi-profile.

I'm almost certain that this problem will resolve itself after that. Please feel free to come back and re-open this issue if you go that far and still run into an issue :)

Best wishes,

from anvio.

bpeacock44 commented on July 26, 2024

Thank you so much! I appreciate your quick analysis/reply. Do you mind if I pick your brain? The assembly was created by combining 24 large sequencing files of metagenome data and then assembling with megahit. I'm still learning how assembly (and the evaluation thereof) should go. I've read of numerous adjustments that can be made to the dataset prior to assembly: 1. basic cleanup/trimming (which I did with fastp) 2. alignment to host genome to remove any host reads (I did not - my advisor is concerned that we'd potentially lose microbial reads that would align to host) 3. normalization (again, I did not - I read very differing views regarding this and I still don't know how to determine whether it applies to my case or not) In my case, where so many contigs are under 1000 bp, is that a signal that it would be best to go back and adjust the dataset before assembly in a way listed above (or other?) Or should I just disregard those contigs and continue as you suggested? I appreciate your time, once again. It's rare to come across any kind of resource like this that is so thoroughly supported and documented. It's been a joy to use so far!

…

-Beth

On Sat Apr 15, 2023, 10:10 AM GMT, A. Murat Eren (Meren) ***@***.***> wrote: Closed #2067 <#2067> as completed. — Reply to this email directly, view it on GitHub <#2067 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHWZYN4VR67RKWWNS3AHEFLXBJXZDANCNFSM6AAAAAAW6JOARA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

from anvio.

meren commented on July 26, 2024

Where are these 24 metagenomes come from? Can you tell more about the samples?

Normalization: it really should be avoided, IMO. Modern assemblers don't need that kind of 'help'. Quality filtering is essential, but 'trimming' is not a good idea, IMO (as it introduces length variation to short reads that can reduce the specificity of reads during mapping) -- if a read is crappy, one should get rid of the entire paired-end sequence. Removing host contamination is a good idea, but probably it's positive/negative impact is negligible apart from being kinder to memory (by reducing the kmer space for things that will not assemble anyway) :) My 2 cents.

from anvio.

bpeacock44 commented on July 26, 2024

Thank you! The samples are taken from the fibrous roots of 24 different citrus trees. We are trying to correlate data with the health of the trees, as they are all infected with a disease but some are seemingly more tolerant of it.

…

On Sat Apr 15, 2023, 07:08 PM GMT, A. Murat Eren (Meren) ***@***.***> wrote: Where are these 24 metagenomes come from? Can you tell more about the samples? Normalization: it really should be avoided, IMO. Modern assemblers don't need that kind of 'help'. Quality filtering is essential, but 'trimming' is not a good idea, IMO (as it introduces length variation to short reads that can reduce the specificity of reads during mapping) -- if a read is crappy, one should get rid of the entire paired-end sequence. Removing host contamination is a good idea, but probably it's positive/negative impact is negligible apart from being kinder to memory (by reducing the kmer space for things that will not assemble anyway) :) My 2 cents. — Reply to this email directly, view it on GitHub <#2067 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHWZYN3DAOB6QFOVGALDMTTXBLW2VANCNFSM6AAAAAAW6JOARA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

from anvio.

bpeacock44 commented on July 26, 2024

We extracted DNA and build libraries using a basic NEB kit - we didn’t try to target or remove anything specific.

…

On Sat Apr 15, 2023, 11:47 PM GMT, Beth Peacock ***@***.***> wrote: Thank you! The samples are taken from the fibrous roots of 24 different citrus trees. We are trying to correlate data with the health of the trees, as they are all infected with a disease but some are seemingly more tolerant of it. On Sat Apr 15, 2023, 07:08 PM GMT, A. Murat Eren (Meren) ***@***.***> wrote: > Where are these 24 metagenomes come from? Can you tell more about the samples? > Normalization: it really should be avoided, IMO. Modern assemblers don't need that kind of 'help'. Quality filtering is essential, but 'trimming' is not a good idea, IMO (as it introduces length variation to short reads that can reduce the specificity of reads during mapping) -- if a read is crappy, one should get rid of the entire paired-end sequence. Removing host contamination is a good idea, but probably it's positive/negative impact is negligible apart from being kinder to memory (by reducing the kmer space for things that will not assemble anyway) :) My 2 cents. > — > Reply to this email directly, view it on GitHub <#2067 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHWZYN3DAOB6QFOVGALDMTTXBLW2VANCNFSM6AAAAAAW6JOARA>. > You are receiving this because you were mentioned.Message ID: ***@***.***>

from anvio.

meren commented on July 26, 2024

I see -- the removal of eukaryotic host via the NEB kit, and not post-sequencing.

I wonder if an individual assembly strategy, rather than a co-assembly may have given you a better performance. If the individual fibrous root samples are very distinct from one another (as far as microbial population structures are concerned), a co-assembly strategy may reduce the quality of the final assembly. An alternative is to assemble each sample individually (to decrease complexity and contig fragmentation), then recruit reads for each assembly from all metagenomes (to generate differential coverage signal), then reconstruct genomes from that sample, repeat this process for every sample, then do a redundancy analysis across all genomes from all samples to have a final catalog of genomes. This sounds more involved, but in fact it is not that complicated with some automation. But the key question is how deeply sequenced each sample and whether individual assemblies would have enough reads to actually do better than the co-assembly. That can't easily be answered without some preliminary assembly / characterization of contigs since the answer is not just a function of the depth of sequencing.

Good luck, and best wishes.

from anvio.

bpeacock44 commented on July 26, 2024

Hi Meren, You've given me a lot of direction here, thank you! I am going to go back and assemble a couple of them individually so I can see if individual assembly would work. I hope you don't mind one more request for help - you mentioned that it would be good to do some preliminary characterization of the contigs. Is there a method/program that you'd recommend for this? Or would it be better to go ahead and recruit reads from all the samples, reconstruct genomes, etc. and see what I come up with? Since the samples are all fairly similar I imagine I could just do this for one or two samples without too much trouble. Many many thanks, -Beth

…

On Mon Apr 17, 2023, 06:59 AM GMT, A. Murat Eren (Meren) ***@***.***> wrote: I see -- the removal of eukaryotic host via the NEB kit, and not post-sequencing. I wonder if an individual assembly strategy, rather than a co-assembly may have given you a better performance. If the individual fibrous root samples are very distinct from one another (as far as microbial population structures are concerned), a co-assembly strategy may reduce the quality of the final assembly. An alternative is to assemble each sample individually (to decrease complexity and contig fragmentation), then recruit reads for each assembly from all metagenomes (to generate differential coverage signal), then reconstruct genomes from that sample, repeat this process for every sample, then do a redundancy analysis across all genomes from all samples to have a final catalog of genomes. This sounds more involved, but in fact it is not that complicated with some automation. But the key question is how deeply sequenced each sample and whether individual assemblies would have enough reads to actually do better than the co-assembly. That can't easily be answered without some preliminary assembly / characterization of contigs since the answer is not just a function of the depth of sequencing. Good luck, and best wishes. — Reply to this email directly, view it on GitHub <#2067 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHWZYN6KE3KM3SOIIPNFQJLXBTS6RANCNFSM6AAAAAAW6JOARA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

from anvio.

meren commented on July 26, 2024

you mentioned that it would be good to do some preliminary characterization of the contigs. Is there a method/program that you'd recommend for this? Or would it be better to go ahead and recruit reads from all the samples, reconstruct genomes, etc. and see what I come up with? Since the samples are all fairly similar I imagine I could just do this for one or two samples without too much trouble.

One thing I would do is to compare the entire co-assembly and a couple individual assemblies using anvi-display-contigs-stats to see if there is anything immediately look obvious. But as you suggested, I think going through the entire workflow for one or two samples and having insights into the gains / losses before committing a lot more time to do a comprehensive analysis of all would be a good practice IMO.

Best wishes,

from anvio.

"ValueError: cannot assign slice from input of different size" running anvi-profile about anvio HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs