Hello, I've problems using the pr2_version_4.11.1_dada2.fasta with t

pr2_version_4.11.1_dada2 problem with v9 seqs about pr2database HOT 8 CLOSED

soluna1 commented on June 18, 2024

pr2_version_4.11.1_dada2 problem with v9 seqs

from pr2database.

Comments (8)

vaulot commented on June 18, 2024 1

Hi Soluna and Peter

I have been using dada2 version 1.8.0 in R and did not have this problem (16 Gb of memory). I just saw that there is a new version on Bioconductor 1.10.0 (https://bioconductor.org/packages/release/bioc/html/dada2.html). I do not know which version you are using Soluna (Peter is using 1.10.0) ? Maybe there has been a change in the Memory allocation between the two versions.

Soluna, I saw that you also posted on the github from dada2 (benjjneb/dada2#691). I think selecting planktonic organisms is OK.

What you could do also is using mothur pcr.seqs or another program extract the V9 region from the pr2 database using the primer that you are using (allowing 1 or 2 mismatches on each side). You will need though to do this with the mothur format of the pr2 database and then convert back to the dada2 format. But for sure this will reduce very likely the memory allocation.

Same thing for Peter but with the V4 region ?

I will be working on pr2 in the coming month and this is something I could provide, a dada2 database limited to the V4 region and another limited to the V9 version.

Cheers. Daniel

from pr2database.

vaulot commented on June 18, 2024

Hi Soluna

Do you exactly where the program gets stuck ? Are you using dada2 under R ?

Cheers. Daniel

from pr2database.

pdcountway commented on June 18, 2024

I'm having a similar problem with AssignTaxonomy using dada2 v. 1.10, the pr2 v.4.72 reference database, and the associated data for the metabarcode tutorial described here: https://github.com/vaulot/metabarcodes_tutorials/tree/master/R_dada2. All was working perfectly until the AssignTaxonomy step.
In my case AssignTaxonomy failed after about 10 seconds with:
Error in C_assign_taxonomy(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int, :
Memory allocation failed.
I'm running this in RStudio under R 3.5.3, on a Windows PC with 16Gb or memory. There was at least 10Gb of memory available when this process failed.

from pr2database.

soluna1 commented on June 18, 2024

Hi Daniel,

yes, I'm using dada2 tool (and pr2_version_4.11.1_dada2.fasta) database. The command is:

taxa <- assignTaxonomy(seqtab, "pr2_version_4.11.1_dada2.fasta", multithread=T, minBoot=80, tryRC=T, taxLevels = c("Kingdom","Supergroup","Division","Class","Order","Family","Genus","Species"))

The run doesn't end after 10 min. It is true that the database is very large compared to the silva 16 one (rdp_train_set_16), so I'm thinking about creating a shorter version of pr2. I've detected that there are some sequences repeated, I mean same taxon same sequence, e.g.

Eukaryota;Archaeplastida;Streptophyta;Embryophyceae;Embryophyceae_X;Embryophyceae_XX;Zea;Zea_mays;

Do you think that I can reduce the size of the database by removing those repetitions and selecting mostly planktonic organisms? Would it still be consistent to assign the taxonomy to 18s v9 region samples?
Thanks for your help.
best,

Soluna

from pr2database.

soluna1 commented on June 18, 2024

Hi Daniel,

Thank you so much for your replay. I'm using 1.10.1 dada2 version. So, I you said (and also from the Dada2 GitHub) perhaps I was too optimistic about my computer capabilities, or I should keep on the running more time, more than 10'. I'm going to move the process to a cluster computer next week.

I've managed to compile a shorter version of pr2, selecting those groups which I think could appear in the samples, and I added also 16S sequences from the[ RDP training set] (https://zenodo.org/record/801828#.XKNX4y2B3OQ) and it works fine. It took a couple of minutes to end, so, now I'm including more groups in the database (I think something close to 80000 sequences could work fine, as now most sequences of my samples have been assigned).

Regarding your comment about a pr2 version limited to v4 and/or v9 region it would be fantastic. In fact, I'm working with v9 region sequences, and at the beginning I though that I wouldn't work with a complete 18s database, but fortunately I was wrong, the taxonomy assignation worked fine.

Yes, thanks for the idea, I'll try to use mothur function and the primers to create a v9 database based on pr2. It certainly will be easier to manage for my computer.
I'll let you know if this works.

thanks a lot!!

Soluna

from pr2database.

vaulot commented on June 18, 2024

Hi Soluna

In the end, I decided not to release a V4 or V9 version of PR2. I used quite universal primers to cut the pr2 sequence to V4 and then used dada2 to assign some metabarcoding datasets. To my surprise many sequences became unassigned despite the fact that the corresponding reference sequence was in the PR2 V4 sequence dataset. So I will need to explore more until I provide such data so that users do not encounter the same problem.

from pr2database.

soluna1 commented on June 18, 2024

Hi Daniel,
Thank you for the information. It is not exactly the same problem, but I also found "weird" results from dada2 AssignTaxonomy function. I commented this in dada2 Github https://github.com/benjjneb/dada2/issues/750. That was that dada2 returned different results depending on the reference database used, assigning the same sequence to two different taxa with a 100% bootstrapping in both cases.
Testing a little more on that, but using EukDiv w2_v9 reference database, I found that dada2 taxonomic assignation algorithm seems to depend on the number of time a particular taxa is repeated in the database. So, a particular asv was assigned to the most frequent taxa (c. closterium or B. paxillifer) depending on whether I put more copies of one or the other. Once all the options where balanced (no repetitions) the assignation was done to a higher taxonomic level, i.e. correct assignation.
However, using pr2 the same asv was correctly assigned to a higher taxonomic level in the first try, even if there are more "copies" of C. closterium than the rest of options (N. fonticola, B. paxillifer, C. fusifurmis, etc). I supposed that in the case of pr2 the proportion of nucleotides matching the reference sequence played a role in the assignation, although I haven't checked this out yet.

So, I really don't know which reference database is better in my case, having only v9 (very short) ASVs. I obtain a higher proportion of assigned ASVs with EukDiv w2 v9 than with pr2, but they could have errors like the one mentioned in some very common sequences. So...

Thanks again for your feedback.

from pr2database.

vaulot commented on June 18, 2024

In the next version of PR2, there will be a link to EukRibo which has annotation for the V9 region. So it will be possible to select only sequences that contain the full V9 region. However such sequences represent only 10% of the full PR2 database which means that the taxonomic sampling will be reduced compared to the full PR2 database.

from pr2database.

pr2_version_4.11.1_dada2 problem with v9 seqs about pr2database HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs