Comments (8)
Hi Soluna and Peter
I have been using dada2 version 1.8.0 in R and did not have this problem (16 Gb of memory). I just saw that there is a new version on Bioconductor 1.10.0 (https://bioconductor.org/packages/release/bioc/html/dada2.html). I do not know which version you are using Soluna (Peter is using 1.10.0) ? Maybe there has been a change in the Memory allocation between the two versions.
Soluna, I saw that you also posted on the github from dada2 (benjjneb/dada2#691). I think selecting planktonic organisms is OK.
What you could do also is using mothur pcr.seqs or another program extract the V9 region from the pr2 database using the primer that you are using (allowing 1 or 2 mismatches on each side). You will need though to do this with the mothur format of the pr2 database and then convert back to the dada2 format. But for sure this will reduce very likely the memory allocation.
Same thing for Peter but with the V4 region ?
I will be working on pr2 in the coming month and this is something I could provide, a dada2 database limited to the V4 region and another limited to the V9 version.
Cheers. Daniel
from pr2database.
Hi Soluna
Do you exactly where the program gets stuck ? Are you using dada2 under R ?
Cheers. Daniel
from pr2database.
I'm having a similar problem with AssignTaxonomy using dada2 v. 1.10, the pr2 v.4.72 reference database, and the associated data for the metabarcode tutorial described here: https://github.com/vaulot/metabarcodes_tutorials/tree/master/R_dada2. All was working perfectly until the AssignTaxonomy step.
In my case AssignTaxonomy failed after about 10 seconds with:
Error in C_assign_taxonomy(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int, :
Memory allocation failed.
I'm running this in RStudio under R 3.5.3, on a Windows PC with 16Gb or memory. There was at least 10Gb of memory available when this process failed.
from pr2database.
Hi Daniel,
yes, I'm using dada2 tool (and pr2_version_4.11.1_dada2.fasta) database. The command is:
taxa <- assignTaxonomy(seqtab, "pr2_version_4.11.1_dada2.fasta", multithread=T, minBoot=80, tryRC=T, taxLevels = c("Kingdom","Supergroup","Division","Class","Order","Family","Genus","Species"))
The run doesn't end after 10 min. It is true that the database is very large compared to the silva 16 one (rdp_train_set_16), so I'm thinking about creating a shorter version of pr2. I've detected that there are some sequences repeated, I mean same taxon same sequence, e.g.
Eukaryota;Archaeplastida;Streptophyta;Embryophyceae;Embryophyceae_X;Embryophyceae_XX;Zea;Zea_mays;
Do you think that I can reduce the size of the database by removing those repetitions and selecting mostly planktonic organisms? Would it still be consistent to assign the taxonomy to 18s v9 region samples?
Thanks for your help.
best,
Soluna
from pr2database.
Hi Daniel,
Thank you so much for your replay. I'm using 1.10.1 dada2 version. So, I you said (and also from the Dada2 GitHub) perhaps I was too optimistic about my computer capabilities, or I should keep on the running more time, more than 10'. I'm going to move the process to a cluster computer next week.
I've managed to compile a shorter version of pr2, selecting those groups which I think could appear in the samples, and I added also 16S sequences from the[ RDP training set] (https://zenodo.org/record/801828#.XKNX4y2B3OQ) and it works fine. It took a couple of minutes to end, so, now I'm including more groups in the database (I think something close to 80000 sequences could work fine, as now most sequences of my samples have been assigned).
Regarding your comment about a pr2 version limited to v4 and/or v9 region it would be fantastic. In fact, I'm working with v9 region sequences, and at the beginning I though that I wouldn't work with a complete 18s database, but fortunately I was wrong, the taxonomy assignation worked fine.
Yes, thanks for the idea, I'll try to use mothur function and the primers to create a v9 database based on pr2. It certainly will be easier to manage for my computer.
I'll let you know if this works.
thanks a lot!!
Soluna
from pr2database.
Hi Soluna
In the end, I decided not to release a V4 or V9 version of PR2. I used quite universal primers to cut the pr2 sequence to V4 and then used dada2 to assign some metabarcoding datasets. To my surprise many sequences became unassigned despite the fact that the corresponding reference sequence was in the PR2 V4 sequence dataset. So I will need to explore more until I provide such data so that users do not encounter the same problem.
from pr2database.
Hi Daniel,
Thank you for the information. It is not exactly the same problem, but I also found "weird" results from dada2 AssignTaxonomy function. I commented this in dada2 Github https://github.com/benjjneb/dada2/issues/750. That was that dada2 returned different results depending on the reference database used, assigning the same sequence to two different taxa with a 100% bootstrapping in both cases.
Testing a little more on that, but using EukDiv w2_v9 reference database, I found that dada2 taxonomic assignation algorithm seems to depend on the number of time a particular taxa is repeated in the database. So, a particular asv was assigned to the most frequent taxa (c. closterium or B. paxillifer) depending on whether I put more copies of one or the other. Once all the options where balanced (no repetitions) the assignation was done to a higher taxonomic level, i.e. correct assignation.
However, using pr2 the same asv was correctly assigned to a higher taxonomic level in the first try, even if there are more "copies" of C. closterium than the rest of options (N. fonticola, B. paxillifer, C. fusifurmis, etc). I supposed that in the case of pr2 the proportion of nucleotides matching the reference sequence played a role in the assignation, although I haven't checked this out yet.
So, I really don't know which reference database is better in my case, having only v9 (very short) ASVs. I obtain a higher proportion of assigned ASVs with EukDiv w2 v9 than with pr2, but they could have errors like the one mentioned in some very common sequences. So...
Thanks again for your feedback.
from pr2database.
In the next version of PR2, there will be a link to EukRibo which has annotation for the V9 region. So it will be possible to select only sequences that contain the full V9 region. However such sequences represent only 10% of the full PR2 database which means that the taxonomic sampling will be reduced compared to the full PR2 database.
from pr2database.
Related Issues (20)
- more chimera detected HOT 7
- entries that may be in the wrong orientation (reverse, complement or both) HOT 6
- amphibian wallaby HOT 2
- removed ncbi entries HOT 1
- How to use PR2 with assignTaxonomy from DADA2? HOT 2
- problem with Fragilariopsis HOT 2
- pr2_version_4.14.0_SSU.decipher.trained.rds HOT 4
- Typo in taxonomy HOT 2
- Training a custom database for classification HOT 1
- Cryothecomonas aestivalis HOT 1
- non-ascii character in PR2 5.0 HOT 3
- DADA2 assignSpecies/addSpecies HOT 1
- Ranks vector missing in Decipher trainset version 5.0 HOT 1
- makeblastdb fails with full database
- Query fails on online DB HOT 5
- PR2database annotated species information not working properly in RStudio HOT 1
- Arthropoda level seems to be incorrect HOT 4
- Errors in taxonomy ranks HOT 3
- PR2 uses a different taxonomy ID code than NCBI taxonomy? HOT 2
- Assigning taxonomy to ASVs by blastn HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pr2database.