pr2database / pr2database Goto Github PK

View Code? Open in Web Editor NEW

76.0 12.0 7.0 778.62 MB

Protist Ribosomal Reference database (PR2) - SSU rRNA gene database

License: MIT License

R 100.00%

rrna database eukaryotes metabarcoding taxonomy 18s-rrna

pr2database's Introduction

Protist Ribosomal Reference database (PR²)

SSU rRNA gene database

The PR² database was initiated in 2010 in the frame of the BioMarks project from work that had developed in the previous ten years in the Plankton Group of the Station Biologique of Roscoff. Its aim is to provide a reference database of carefully annotated 18S rRNA sequences using eight unique taxonomic fields (from domain to species). At present it contains about 220,000 sequences. A number of metadata fields are available for many sequences, including geo-localisation, whether it originates from a culture or a natural sample, host type etc… The annotation of PR2 is performed by experts from each taxonomic groups. One very important project in this respect is EukRef which has recently decided to merge its effort with PR². EukRef has built bioinformatics pipelines that have been used during three workshops dedicated to specific taxonomic groups.

Current version

Version: 5.0.0
Released: 2023-04-06
DOI: 10.5281/zenodo.7805244

Accessing PR2

About PR2

Core Team

Daniel VAULOT, Asian School of the Environment, Nanyang Technological University, SINGAPORE
Javier del CAMPO, University of Miami, USA

Scientific committee and contributors

PR² team

Please cite

Guillou, L., Bachar, D., Audic, S., Bass, D., Berney, C., Bittner, L., Boutte, C. et al. 2013. The Protist Ribosomal Reference database (PR²): a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated taxonomy. Nucleic Acids Res. 41:D597–604.

Related Projects

18S rRNA primer database

The PR² primer database is a compilation of primers found in the litterature with an in silico analysis against the PR² database.

metaPR2

The metaPR² metabarcode database is a compilation of metabarcode datasets processed by the dada2 R package and assigned against PR2.

Questions ?

Report issues

Please report any issue on GitHub

pr2database's People

Contributors

Stargazers

Watchers

Forkers

ramalok zhangmengjia666 davan690 panyanshuo zz7521 dmsiast genostack

pr2database's Issues

PR2database annotated species information not working properly in RStudio

setwd("H:/pr2database/pr2database-master/pr2database-master")
library(shiny)
Warning message:
程辑包‘shiny’是用R版本4.3.2 来建造的
runApp("H:/pr2database/pr2database-master/pr2database-master")
Warning in loadSupport(appDir, renv = sharedEnv, globalrenv = NULL) :
Loading R/ subdirectory for Shiny application, but this directory appears to contain an R package. Sourcing files in R/ may cause unexpected behavior.
Cannot use system.file
Cannot use full path
[1] "Using cloud bucket"
[1] "global.R done"
ℹ Loading pr2database
Cannot use system.file
Cannot use full path
[1] "Using cloud bucket"
[1] "global.R done"
Error after opening the shiny app and entering a gene sequence:Must supply .init when .x is empty.

PR2 uses a different taxonomy ID code than NCBI taxonomy?

For example, the taxo_id of the species Halteria grandinella recorded in the table pr2_version_5.0.0_merged.csv is 1512, whereas 1512 stands for cellular organisms in NCBI taxonomy is Bacteria. Terrabacteria group; Bacillota; Clostridia; Lachnospirales; Lachnospiraceae; Lachnoclostridium.

All the other ids also differ from the one in Taxonomy.

Is there an appropriate tool that will allow me to convert this to the NCBI taxonomy ID?

Correct PR2 id that are not standards

A few PR2 id do not follow the standard rule for constructing PR2 id (Genbank.start.end_X). This needs to be corrected (see sequences id 176963 -> 178133 in particular).

removed ncbi entries

Hey ho,
for the following pr2 entries, the NCBI entries were removed:
AY745555.1.1854_U, AY745597.1.1844_U, EF209781.1.1956_U, EF209774.1.1835_U, EF209794.1.1834_U

Plastid sequences

Hi! Congrats for the great work!

From the README, we can see that plastid sequences are included in PR2. At the bottom, a reference to a plastid-only DB (PhytoREF) is provided. I guess it is not clear to what extent plastid sequences in PR2 overlap with PhytoREF, or whether PR2 is a superset of PhytoREF. In case it is not, and one wants to merge the two DBs, would dereplication be suggested?

Thank you.

makeblastdb fails with full database

Hello, I downloaded the full PR2 database as a fasta file (" pr2_version_5.0.0_SSU_taxo_long.fasta") which according to the website is suitable for making a local BLAST database. However, when I run makeblastdb using BLAST+ version 2.6.0, it fails to generate a database.

This command

makeblastdb -in pr2_version_5.0.0_SSU_taxo_long.fasta -dbtype nucl -title "PR2 Full Database"

generates only .nin, .nhr, and .nsq files, and shows the following output:

Building a new DB, current time: 10/18/2023 14:05:45
New DB name: /Users/nastassiapatin/DBs/PR2/PR2_full-MagicBlastDB/pr2_version_5.0.0_SSU_taxo_long.fasta
New DB title: PR2 Full Database
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 221085 sequences in 7.21831 seconds.

While this command (with -parse_seqids)

makeblastdb -in pr2_version_5.0.0_SSU_taxo_long.fasta -dbtype nucl -parse_seqids -title "PR2 Full Database"

also generates only .nin, .nhr, and .nsq files, but these seem to be incomplete as they have ".00" before the .n** suffixes. This command produces the following output:

Building a new DB, current time: 10/18/2023 14:04:02
New DB name: /Users/nastassiapatin/DBs/PR2/PR2_full-MagicBlastDB/pr2_version_5.0.0_SSU_taxo_long.fasta
New DB title: PR2 Full Database
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
[1] 6917 abort makeblastdb -in pr2_version_5.0.0_SSU_taxo_long.fasta -dbtype nucl -title

Do you have any ideas why the full PR2 fasta might be failing to build a BLAST database? When I build one from only a select group of taxa (order Dinophyceae) it works fine.

Thanks,
Nastassia

Typo in taxonomy

Hi,

Within the group of Colpodellida (Alveolata) there is a genus named "Aphamonas" which, if I'm correct, should be named as 'Alphamonas'.

Best,
Máté

Lengths of sequences

Hello, I am afraid this may be a naive query rather than an issue. If the former, then I am sure it can be closed and discarded quickly. I have been running a large dataset against SILVA and PR2 (an older version, but my question still pertains) and looking to reconcile the two results for each. I ran into a few cases where I could not understand how I could be getting high identity matches to such different things in the two databases. In several of these cases, I retrieved the sequences from Genbank to align with the relevant OTUs and found a set of cases where our OTU was not matching as well to the Genbank entry as compared to the sequence with that same Genbank accession from PR2. I traced several of our problem cases to similar instances to situations where the PR2 sequence with AccessionX was longer than the sequence associated with AccessionX in Genbank. The PR2-version4.9.2 appears to have more than 1000 sequences that are greater than 100 bp longer than the associated Genbank sequence. Is this expected? I apologize if this is a silly question. I am just new to this type of analysis.

Problems with version 4.9.0

Dear Daniel,

I hope you are well. I am currently analyzing a number of 18S MiSeq datasets with the software mothur (v. 1.39.5) and am using the PR2 database for the classification of sequences and OTUs. I have used PR2 version 4.7.2 for mothur in the past and found the classification much better in comparison to SILVA NR v. 132 – as was to be expected. 😊 I now tried to use the new version PR2 v. 4.9.0 for mothur but unfortunately encountered some errors that in the end cause mothur to abort. I checked the error messages I got by simply opening the fasta and tax files in Notepad and found that it was mainly formatting errors that caused the problems. I think/hope that these can easily be fixed.

FJ402948.1.1186_U & FJ402949.1.1210_U are both found twice in the tax file. As mothur requires each file to be unique, these doublets cause an error.

“LC054937.1.1751_U is missing the final ';', ignoring. [ERROR]: ;Plagiodinium;Plagiodinium_belizeanum; is missing the final ';', ignoring.”
The sequence LC054937.1.1751_U has an accidental hard return after the name Prorocentraceae, therefore mothur does not recognize that the ; and the final part of the taxonomy (“;Plagiodinium;Plagiodinium_belizeanum;”) belong together.

“'AB353770.1.1740_U' is in your template file and is not in your taxonomy file. Please correct.”
This error is shown for each sequence in the PR2 database and it is what cause mothur to abort. I think that this has to do with incorrect hard returns and TABs in the new version. If I look at v 4.7.2 using NotePad, the fasta file looks like

AB353770.1.1740_U
ATGCTTGTCTCAAAGATTAAGCCATGCATGTCTCAGTATAAGC…
KC672520.1.1801_U
TACCTGGTTGATTCTGCCCCTATTCATATGCTTGTCTCAAAGATTAAGCC…

And in the taxonomy file the lines are all like

AB353770.1.1740_U Eukaryota;Alveolata;Dinophyta;Dinophyceae;Dinophyceae_X;Dinophyceae_XX;Peridiniopsis;Peridiniopsis_kevei;
KC672520.1.1801_U Eukaryota;Opisthokonta;Fungi;Ascomycota;Pezizomycotina;Leotiomycetes;Leotiomycetes_X;Leotiomycetes_X_sp.;

In version 4.9.0 the fasta file looks like

AB353770.1.1740_UATGCTTGTCTCAAAGATTAAGCCATGCATGTCTCAGTATAAGC…>KC672520.1.1801_UTACCTGGTTGATTCTGCCCCTATTCATATGCTTGTCTCAAAGATTAAGCC…

Similarly, the tax file looks like

AB353770.1.1740_U Eukaryota;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Kryptoperidiniaceae;Unruhdinium;Unruhdinium_kevei;KC672520.1.1801_U Eukaryota;Opisthokonta;Fungi;Ascomycota;Pezizomycotina;Leotiomycetes;Leotiomycetes_X;Leotiomycetes_X_sp.;AB284159.1.1765_U Eukaryota;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Protoperidiniaceae;Protoperidinium;Protoperidinium_bipes;AY123745.1.924_UC …

I assume that these formatting errors cause the mothur software to not recognize the corresponding names in the fasta and tax files.

René Groben

non-ascii character in PR2 5.0

Hello,

in the entry MF423350 (Heterocapsa steinii), the 'space' in the species name is not an ascii space. It is encoded by a 16-bit character $C2A0, where it should be a simple 8-bit value $20. That creates an error when processing this release with cutadapt:

>MF423350.1.1769_U;tax=k:Eukaryota,d:TSAR,p:Alveolata-Dinoflagellata,c:Dinophyceae,o:Peridiniales,f:Heterocapsaceae,g:Heterocapsa,s:Heterocapsa steinii

It seems to be the only non-ascii character in that release:

zgrep --color='auto' -P -n '[^\x00-\x7F]' pr2_version_5.0.0_SSU_UTAX.fasta.gz

Microsporidia

There are seven sequences tagged as Microsporidia in PR2 4.12:

GU130407
EU589246
EF990668
GQ246188
GQ258752
JQ268567
KF830273

all with the following taxonomic assignment:

Eukaryota|Opisthokonta|Fungi|Microsporidiomycota|Microsporidiomycotina|Microsporidiomycotina_X|Microsporidia|Microsporidia_sp.

Using figure 1 from Bass et al., (2018), new reference sequences for Microsporidia could be added to PR2, with improved taxonomic assignments.

For instance, Bass et al. placed AB534337 in the "expanded Microsporidia" clade, and sub-clade "Laz X (Lazarus and James, 2015); 'Mitosporidium' (Corsaro et al 2016)".

Present assignment of AB534337 in PR2 is:

1         2            3     4       5        6         7          8
Eukaryota|Opisthokonta|Fungi|Fungi_X|Fungi_XX|Fungi_XXX|Fungi_XXXX|Fungi_XXXX_sp.

Assignment could be:

1         2            3     4            5             6                   7             8
Eukaryota|Opisthokonta|Fungi|Cryptomycota|Microsporidia|Microsporidia_Laz_X|Mitosporidium|Mitosporidium_sp.

Note that none of the seven sequence accessions tagged as Microsporidia in PR2 4.12 are present in figure 1 of Bass et al.

pr2_version_4.11.1_dada2 problem with v9 seqs

Hello,

I've problems using the pr2_version_4.11.1_dada2.fasta with the AssignTaxonomy function with a seqtab (dim =5 x 1625) built upon seqs of the v9 region 18s length between 100-125 bp. I'm working on a MacBook Pro 8gb i5 (8th generation, 4 cores), but the run gets stuck at some point. Doing the same with Silva 16 s database "rdp_train_set_16.fa" the run goes ok.

Could be something related to the short size of seqs? Any other option?

Thanks a lot

Soluna

Ranks vector missing in Decipher trainset version 5.0

Hi,

Thank you for making this information available.

I am using Decipher IDtaxa algorithm to assign taxonomic annotations to my taxonomic units, using the trainset of PR2 v5.0 as the reference. All worked fine, but upon revision of the output, I noticed that the string vector that should contain the headers of the taxonomic ranks is missing. I checked within the training set provided here, and as of today $rank: NULL. I can add this manually, but you might want to check this.

Regards,

Troubleshoot for generating reference database for blast analysis

Dear Users,
There is an error in the PR2 dataset (pr2_version_4.10.0_mothur.fasta).
The sequence Id CP000499.0.0_U is without a nucleotide sequence.
This creates an error for generating reference database for blast analysis.
You have to eliminate this sequence id from the dataset before you carry on with the make local dataset.
Hope this helps,
Regards,
Chetan

Training a custom database for classification

Hello,
I used your script #26 to train a database for the IdTaxa function in the DECIPHER package. However, after classification, the majority of sequences are classified as a specific species that does not match the results from NCBI blast.

My database is built with around 1.5 million COI sequences from NCBI. I would appreciate your advice on:

Determining the best values for maxGroupSize and maxIterations for database training.
Understanding whether the length of the sequence in the database has any impact on the classification result. ( some of the sequences collected from NCBI in my database are over 1 million in length)

Thank you in advance for your assistance.

sqlite taxon_trophic_mode all 'None'

Hello,

I downloaded the sqlite database (pr2_version_4.12.0.sqlite) to be able to query it easily on an ec2 instance. However, when I query the pr2_taxonomy table and look at taxon_trophic_mode all entries are (None, ).

Here is more information if useful:
I am using sqlite3 to query the database through Python.

When I run:
for row in cur.execute('SELECT DISTINCT taxon_trophic_mode FROM pr2_taxonomy'): print(row)

I get solely:
(None,)

This set-up otherwise works fine to interact with the database.

Thanks,

Ben

DADA2 assignSpecies/addSpecies

We noticed that the latest PR2 release contains only one file for DADA2 annotation, compatible with assignTaxonomy based on a naive Bayesian classifier. However, we wanted to get species-level identification using exact matching implemented in the assignSpecies and addSpecies functions of DADA2. We made in-house files to perform the analysis and run it successfully. However, the following issues remain open:

We have been suggested to remove sequences not annotated to a specific species, i.e. the ones ending with "_sp.". However, we do get sequences with an exact match to one of these reference sequences, as well as another for which we have a species name. If the sequences annotated as ..._sp. would be removed, we would classify these sequences to the named species, even though based on current results we know there are other, unnamed species out there with exactly matching sequences as well. Should the "_sp." annotations be removed after the taxonomic classification with DADA2 is done? (The ones which were the only exact matches to a sequence)
Some sequences which were annotated to one species using assignTaxonomy had exact matches to more than one species, e.g. to Skeletonema marinoi and Skeletonema costatum. (This might be an issue more relevant for DADA2 developers, but still important in the context of best practices for PR2-based analysis using DADA2)

amphibian wallaby

ABQO010458413.64.1383_U corresponds to a wallaby genome (https://www.ncbi.nlm.nih.gov/nuccore/ABQO010458413) but is assigned as Amphibia, ...

more chimera detected

Dear pr2 collaborators,
Thank you for the great work! It's the best database for identifying eukaryotes and protists in particular!
I was surprised to find sea slugs, a bryozoan and a pea as the most abundant eukaryotic reads in samples from the top of some mountains...
Checking these sequences (blast + aligned to similar sequences), I found them to be chimeric:

FJ917445 | Berthella californica bacterial insert, bases 619 to 815, and a large conserved part of the 18S missing.
FJ917457 | Berthella martensi first 141 bases are bacterial. Then nearly identical B. martensi MF958319.
EU650324 | Plumatella sp. bacterial (Mycoplasma) insert, bases 879 to 1278.
HO777700 | Phaseolus acutifolius is the sequence of a chloroplast.

I found other problematic sequences, are you interested in getting the whole list?
Best regards,
AM

Query fails on online DB

Hello, when I enter a 349-nt sequence into the query field online, I consistently get the following message: "An error has occurred. Check your logs or contact the app author for clarification."

I don't see anywhere to download or view the log files. The sequence is annotated on NCBI as belonging to the Order Dinophyceae. Any thoughts?

Training set derived from PR2 for IDTAXA (DECIPHER)

Hi Daniel,

I was wondering whether there was a training set available derived from the PR2 database adapted for the idtaxa function (DECIPHER package) or whether there was a tutorial/ some documentation providing clues on how to create it from the original database.

Thanks for your help!

Caroline

problem with Fragilariopsis

When reannotating the Tara Oceans V9 metabarcodes using PR2 v4.14, we noted the lack of Fragilariopsis in the new annotation. Using the most abundant Fragilariopsis OTU in the original dataset, we found that the best hits in PR2 v4.14 correspond to 100% identity to a group of sequences including a Dinophyceae (obviously an error in PR2 v4.14) and a bunch of diatom reference sequences poorly assigned as 'Raphid-pennate_X_sp', in addition to a Fragilariopsis sequence:

query:

1533bc0882cb58d28b5c56f49269c1c2
gtcgcacctaccgattgaatggtccggtgaagcctcgggattgtggttagtttcctttattggaagttagtcgcgagaacttgtctaaaccttatcatttagaggaaggtgaagtcgtaacaaggtttcc

Subjects (100 % id):
KC771185.1.1787_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
EF100371.1.1232_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KC771174.1.1777_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KC771149.1.1790_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KC771193.1.1786_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KC771190.1.1789_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KC771168.1.1787_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
EF140623.1.1779_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Fragilariopsis,s:Fragilariopsis_curta
KJ757846.1.1801_U;tax=k:Eukaryota,d:Alveolata,p:Dinoflagellata,c:Dinophyceae,o:Dinophyceae_X,f:Dinophyceae_XX,g:Dinophyceae_XXX,s:Dinophyceae_XXX_sp.
KJ757881.1.1791_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KJ758129.1.1800_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KJ758140.1.1795_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KJ758229.1.1792_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Fragilariopsis,s:Fragilariopsis_sp.

I wonder whether there are more mistakes in the last PR2 version. Maybe a way of checking this is by running some re-assignment test, or random checks to spot other mistakes, or maybe by detecting cases where the taxonomy of the best hit sequences (with identical %id) are strongly contradictory - a dinoflagellate followed by diatoms as in this case.

Remove PR2 sequences shorter than 500 bp

173 sequences are shorter than the min length (500 bp). Remove in version 4.11.0

Fungi

Nous utilisons régulièrement votre base de données et on nous a remonté une erreur concernant les références ‘alternaria et aspergillus’ ci-dessous qui sont des ascomycota et non basidiomycota
Merci de votre retour,
Bonne journée

Marina Moletta-Denat

KF747355.1.1082_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Alternaria;Alternaria+arborescens;
KF747361.1.1189_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Aspergillus;Aspergillus+tubingensis;
KF747362.1.1038_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Aspergillus;Aspergillus+niger;
KF747363.1.1166_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Aspergillus;Aspergillus+tubingensis;
KF747365.1.1153_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Alternaria;Alternaria+alternata;
KF747366.1.1138_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Alternaria;Alternaria+citri;

Cryothecomonas aestivalis

Hi there, I've noticed PR2 incorporates only 1 of 2 strains of Cryothecomonas aestivalis reference sequences from Kuhn et al. 2000 (https://www.sciencedirect.com/science/article/pii/S1434461004700322).

Strain 2 from that study (genbank accession AF290541.1) does not appear to be included in PR2 version 5.0.0 nor in 4.14.0.

I can't find any follow-up studies to suggest this reference should be excluded so just pointing out in case this exclusion is unintentional.

Thanks for all your hard work developing and maintaining this excellent resource.

taxon_trophic_mode all NA

Hi,

I'm trying to retrieve data from the taxon_trophic_mode field in PR2 v4.12.0, but all the values are NA?

Table pr2_taxonomy - add fields
taxon_trophic_mode - detailed trophic mode (e.g. "C-fixation constitutive; Mixotroph")

> library("pr2database")
> 
> data("pr2")
> colnames(pr2)

> unique(pr2$taxon_trophic_mode)
[1] NA

R sessionInfo:

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pr2database_4.12.0

loaded via a namespace (and not attached):
[1] compiler_3.5.0   tools_3.5.0      pillar_1.3.1     rstudioapi_0.9.0 tibble_2.0.1     yaml_2.2.0       crayon_1.3.4     pkgconfig_2.0.2 
[9] rlang_0.3.1

pr2_version_4.14.0_SSU.decipher.trained.rds

Hello professor!

I tried using this trained dataset for my analyses, but its seems corrupt.

Thank you! :D

Chrysophyceae

Check Chyrsophyceae clades

Andersen RA, Graf L, Malakhov Y, Yoon HS. (2017). Rediscovery of the Ochromonas type species Ochromonas triangulata (Chrysophyceae) from its type locality (Lake Veysove, Donetsk region, Ukraine). Phycologia 56:591–604.

How to use PR2 with assignTaxonomy from DADA2?

Hi!
I'm trying to use the PR2 database with the assignTaxonomy function from DADA2 to assign taxonomy to rRNA sequences in a fastq file. I found a note about using the taxLevels c("Kingdom","Supergroup","Division","Class","Order","Family","Genus","Species") but I wonder how this compares to taxonomic assignments for 16S sequences (which I suspect may also be included in my sequence file) where the taxLevels only go down to genus and then a separate training fasta is used to assign species?

If I want to make taxonomic assignments consistent between 18S and 16S, should I skip species from the PR2 training fasta header and add it separately with the addSpecies function?

Thanks in advance for any help and for a good resource!

entries that may be in the wrong orientation (reverse, complement or both)

While working with cutadapt (v3.4) to extract the SSU V4 region, I've noticed that some entries are not in the expected orientation (compared to the reference Saccharomyces cerevisiae). I wrote a script to search for all possible orientations:

## primers from Stoeck et al. 2010
ORIGINAL_PRIMER_F="CCAGCASCYGCGGTAATTCC"
ORIGINAL_PRIMER_R="ACTTTCGTTCTTGATYRA"


complement() {
    # complement a DNA/RNA IUPAC string
    [[ -z "${1}" ]] && { echo "error: empty string" ; exit 1 ; }
    local -r nucleotides="acgturykmbdhvswACGTURYKMBDHVSW"
    local -r complements="tgcaayrmkvhdbswTGCAAYRMKVHDBSW"
    tr "${nucleotides}" "${complements}" <<< "${1}"
}


trim_primers() {
    # trim and keep only entries with both primers
    OPTIONS="--minimum-length ${MIN_LENGTH} --discard-untrimmed --error-rate ${ERROR_RATE} --quiet"
    CUTADAPT="$(which cutadapt) ${OPTIONS}"
    MIN_LENGTH="32"
    ERROR_RATE="0.2"
    MIN_F=$(( ${#PRIMER_F} * 2 / 3 ))
    MIN_R=$(( ${#PRIMER_R} * 2 / 3 ))

    zcat "${SOURCE}.gz" | \
        dos2unix | \
        ${CUTADAPT} \
            --front "${PRIMER_F}" \
            --overlap "${MIN_F}" - | \
        ${CUTADAPT} \
            --adapter "${PRIMER_R}" \
            --overlap "${MIN_R}" - > "${OUTPUT}"
}

## download PR2 (UTAX version)
URL="https://github.com/pr2database/pr2database/releases/download"
VERSION="4.13.0"
SOURCE="pr2_version_${VERSION}_18S_UTAX.fasta"
[[ -e "${SOURCE}.gz" ]] || wget "${URL}/v${VERSION}/${SOURCE}.gz"


## search rev-comp entries
OUTPUT="revcomp.fas"
PRIMER_F="${ORIGINAL_PRIMER_R}"
PRIMER_R="$(complement "${ORIGINAL_PRIMER_F}" | rev)"
trim_primers


## search reverse entries
OUTPUT="reverse.fas"
PRIMER_F="$(complement "${ORIGINAL_PRIMER_R}")"
PRIMER_R="$(rev <<< "${ORIGINAL_PRIMER_F}")"
trim_primers


## search complement entries
OUTPUT="complement.fas"
PRIMER_F="$(complement "${ORIGINAL_PRIMER_F}")"
PRIMER_R="$(rev <<< "${ORIGINAL_PRIMER_R}")"
trim_primers

The script allows 20% of errors in primer matching but forces a match greater or equal to 2/3rd of the primer length. That parameter should limit false positives.

The results are:

115 possible case of 'reverse-complement' sequences (a frequent mistake),
30 possible cases of 'complement' sequences (rare, but I've seen 'complement' 18S entries on GenBank),
1 possible case of 'reverse' sequence

I join the corresponding fasta files: misoriented_entries.zip

replicate entries in pr2 4.12.0

I'm getting replicate entries in v 4.12.0. Taxa entry is identical, but sequence entry is different. I'm trying to use the mothur fasta and tax files in a pipeline to train PR2 for use in QIIME2 ASV determination and tax assignment - these duplicate entries throw an error.

Replicate entries: KC486130.1.1651_U, AB275104.1.1590_U, EU087243.1.813_U, EU087247.1.847_U, FJ459738.1.1634_U

I know this was an issue from the last release and it was updated quickly. Is there an additional script you can include with each release that performs this de-duplication?
https://github.com/shu251/db-build-microeuks

Dereplicate PR2 sequences

Dereplicate PR2 sequences
Tag the longest
Create a dereplication table that make the correspondance between the representative sequences and the other ones (use long format)

Missing genus

Hi,

I've just noticed that the diatom genus Bacteriastrum is not represented in the database. Some examples that could be included are:
Accession MG972355.1 Bacteriastrum hyalinum
Accession GQ330314 Bacteriastrum hyalinum
Accession MG972357.1 Bacteriastrum jadranum

I don't know about other areas, but is quiet common in the south mediterranean basin.
Many thanks,

Taxonomy file

Dear Daniel,

I am trying to do taxonomy classification using KrakenUniq with the PR2 database.
In order to build a custom KrakenUniq database, a taxonomy database, specifically nodes.dmp and names.dmp, is required.
Could you please tell me where to download the PR2 taxonomy files?
Thank you in advance.

Cheers,
Yue

Identical entries in tax-file

Dear Daniel,
In the most recent version of the "tax"-file for mothur (downloaded 07.11.2018), GU824068.1.1173_U appears four times.
I also get the "'XXXX' is in your template file, and is not in your taxonomy file." for all the entries in PR2. I am running mothur on a unix-based computer cluster.

Thanks for keeping PR2 updated!

Best,
Elianne

Arthropoda level seems to be incorrect

Dear developer,

I'm using PR2 and mothur to do taxonomic assignment on my 18S metabarcoding data. I found that Maxillopoda, Branchiopoda, and Insecta etc. were assigned to the Family level, but they should be in the Class level. The entire Metazoa seems to be shifted... Could you please have a look at the issue? Thanks and have a good one!

Best,
Michelle

version 4.12 : presence of unrecognize char in sequence ID

Hello,

I downloaded the fasta file (pr2_version_4.12.0_18S_taxo_long.fasta.gz) to format it for the FROGS pipeline (basically formatting for BLAST and RDP classifier).

I encountered some troubles with makeblastdb because of unrecognized char in sequences identifier:

Ã» that I change with a simple u, my guess is that it was a french û

problematic sequences ID are:

FJ660684.1.1753_U|18S_rRNA|nucleus|strain_14-fÃ©vr.
FJ609422.1.1913_U|18S_rRNA|nucleus|specimen_10-aoÃ»t
FJ660668.1.1752_U|18S_rRNA|nucleus|strain_02-fÃ©vr.
FJ660680.1.1752_U|18S_rRNA|nucleus|strain_13-fÃ©vr.
FJ660676.1.1751_U|18S_rRNA|nucleus|strain_12-fÃ©vr.
EU667999.1.1679_U|18S_rRNA|nucleus|strain_12-fÃ©vr.
EU667995.1.1682_U|18S_rRNA|nucleus|strain_07-fÃ©vr.
FJ660672.1.1752_U|18S_rRNA|nucleus|strain_11-fÃ©vr.
EU667999.1.1679_UC|18S_rRNA|nucleus|strain_12-fÃ©vr.
EU667995.1.1682_UC|18S_rRNA|nucleus|strain_07-fÃ©vr.

just to let you known

Kindly

Maria

Missing Sequence

Hello there,

first of all, thanks for making this database available!

This is not a big problem but just something I noticed. I am using file: pr2_version_4.10.0_UTAX.fasta and got a warning formatting it:

WARNING: Empty sequence at line 2598276 in FASTA file /cluster/project/gdc/people/jwalser/db/pr2_version_4.10.0_UTAX.fasta, label >CP000499.0.0_U;tax=k:Eukaryota,d:Opisthokonta,p:Fungi,c:Ascomycota,o:Saccharomycotina,f:Saccharomycetales,g:Scheffersomyces,s:Scheffersomyces_stipitis

Adding the trimmed sequences (or removing the record it) solved it.

Best

Errors in taxonomy ranks

Hi,

I have noticed a lot of errors in the taxonomy ranks when looking at the names. At first, I used the already trained version (4.13) when I noticed the problem. I decided to use the pr2_version_5.0.0_SSU_dada2 and train it myself, thinking the newest trained set might have give better results. However, I did not see any changes in the taxonomy ranks, as they appear to be mixed up.

This is what my results look like:

If we look as the ASV1, Dreissena (genus) is correct, but the family is wrong, it should be Dreissenidae, the order should be Myida, Bivalvia is the class (and not the order), Mollusca is the phylum (not the class) and so on. So one would think to only change the names of the ranks, but it is not possible, because:

If we look at ASV6, Dinophycea is a class (correct), Dinoflagellata is a superclass and Alveolota is a superphylum.

So the column names are incorrect, but the taxonomic ranks chosen are not even consistent. When looking back at the database, "Mollusca" and "Dinophycea" are already in the same column (one is a phylum, the other is a class), which probably means that the database is mixed up, even before doing the training part.

I am wondering how are the names used for the taxonomic rank chosen? For some ASVs, it seems to be very random. Is there a way to fix this kind of issue?

Best,

Chlorophyta

Fix the following classes

Palmophyloophyceae
Picochlorophyceae
Micromonas species

PR2 database filtering criteria

Hello,

I am wondering if there is more information available somewhere on the filtering criteria that PR2 uses to include sequences? I have read the 2013 paper and I see that there have been some changes to the sequence lengths included and the number of ambiguous bases permitted. I am a little bit confused because I am also using the Silva 18S database and there are many sequences in Silva-Ref that are not in PR2. For example, the accession ID KP404780 is absent from PR2 v. 4.12 (but is included in Silva-Ref v. 132) and is annotated in GenBank as an 18S eukaryotic sequence and meets the PR2 criteria for sequence length and ambiguous bases.

Sorry if this is a silly question and thank you in advance!

dvutils package not available for R version 3.5.2

Hi,

I'm trying to use pr2 database and it fails when plotting with dvutils. I've tried to install dvutils package but I've got the following answer:

Warning in install.packages :
package ‘dvutils’ is not available (for R version 3.5.2)

Is there a solution for this issue?

thanks a lot.

Assigning taxonomy to ASVs by blastn

Hello PR2 team,

I made the pr2 custom database, like:
$ makeblastdb -in <input_fasta> -parse_seqids -blastdb_version 5 -title "custom_db_title" -dbtype nucl
now, I want to assign taxonomy to ASVs by blastn, the command as below:

#blastn
blastn -task megablast -db /../db/pr2/species_taxid.fasta -query ../ASV_sequences.fasta -out 1_ASV.blastn -perc_identity 70 -outfmt '7 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore' -max_target_seqs 50 -num_threads $(($processes*$threads))

Add tax info

awk 'BEGIN {OFS=FS="\t"} NR==FNR{map[$1]=$2;next} {for(i=1;i<=NF;i++)$i=($i in map)?map[$i]:$i}1' /../db/pr2/taxonomy.tsv 1_ASV.blastn > 2_ASV_lineage.blastn
I used like the tax_id like 1006 in taxonomy.tsv to match emu_db:1006, it has 111 replicates. If two ASVs have the same first field (i.e. 1006 in this example), they will have the same classification information, which may result in two ASVs being incorrectly assigned to the same classification information.
Do you have any suggestions to add taxonomy to the ASV table?

Thank you very much!
Best, Wang

pr2database / pr2database Goto Github PK

pr2database's Introduction

Protist Ribosomal Reference database (PR2)

SSU rRNA gene database

Current version

Accessing PR2

About PR2

Core Team

Scientific committee and contributors

Please cite

Related Projects

18S rRNA primer database

metaPR2

Questions ?

Report issues

pr2database's People

Contributors

Stargazers

Watchers

Forkers

pr2database's Issues

Add tax info

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Protist Ribosomal Reference database (PR²)