pr2database / pr2database Goto Github PK
View Code? Open in Web Editor NEWProtist Ribosomal Reference database (PR2) - SSU rRNA gene database
License: MIT License
Protist Ribosomal Reference database (PR2) - SSU rRNA gene database
License: MIT License
Hi Daniel,
I was wondering whether there was a training set available derived from the PR2 database adapted for the idtaxa function (DECIPHER package) or whether there was a tutorial/ some documentation providing clues on how to create it from the original database.
Thanks for your help!
Caroline
Hi,
I'm trying to use pr2 database and it fails when plotting with dvutils. I've tried to install dvutils package but I've got the following answer:
Warning in install.packages :
package ‘dvutils’ is not available (for R version 3.5.2)
Is there a solution for this issue?
thanks a lot.
We noticed that the latest PR2 release contains only one file for DADA2 annotation, compatible with assignTaxonomy based on a naive Bayesian classifier. However, we wanted to get species-level identification using exact matching implemented in the assignSpecies and addSpecies functions of DADA2. We made in-house files to perform the analysis and run it successfully. However, the following issues remain open:
Hello,
I am wondering if there is more information available somewhere on the filtering criteria that PR2 uses to include sequences? I have read the 2013 paper and I see that there have been some changes to the sequence lengths included and the number of ambiguous bases permitted. I am a little bit confused because I am also using the Silva 18S database and there are many sequences in Silva-Ref that are not in PR2. For example, the accession ID KP404780 is absent from PR2 v. 4.12 (but is included in Silva-Ref v. 132) and is annotated in GenBank as an 18S eukaryotic sequence and meets the PR2 criteria for sequence length and ambiguous bases.
Sorry if this is a silly question and thank you in advance!
Hello PR2 team,
I made the pr2 custom database, like:
$ makeblastdb -in <input_fasta> -parse_seqids -blastdb_version 5 -title "custom_db_title" -dbtype nucl
now, I want to assign taxonomy to ASVs by blastn, the command as below:
#blastn
blastn -task megablast -db /../db/pr2/species_taxid.fasta -query ../ASV_sequences.fasta -out 1_ASV.blastn -perc_identity 70 -outfmt '7 qseqid sseqid pident qcovs mismatch gapopen qstart qend sstart send evalue bitscore' -max_target_seqs 50 -num_threads $(($processes*$threads))
awk 'BEGIN {OFS=FS="\t"} NR==FNR{map[$1]=$2;next} {for(i=1;i<=NF;i++)$i=($i in map)?map[$i]:$i}1' /../db/pr2/taxonomy.tsv 1_ASV.blastn > 2_ASV_lineage.blastn
I used like the tax_id like 1006 in taxonomy.tsv to match emu_db:1006, it has 111 replicates. If two ASVs have the same first field (i.e. 1006 in this example), they will have the same classification information, which may result in two ASVs being incorrectly assigned to the same classification information.
Do you have any suggestions to add taxonomy to the ASV table?
Thank you very much!
Best, Wang
Dear pr2 collaborators,
Thank you for the great work! It's the best database for identifying eukaryotes and protists in particular!
I was surprised to find sea slugs, a bryozoan and a pea as the most abundant eukaryotic reads in samples from the top of some mountains...
Checking these sequences (blast + aligned to similar sequences), I found them to be chimeric:
FJ917445 | Berthella californica bacterial insert, bases 619 to 815, and a large conserved part of the 18S missing.
FJ917457 | Berthella martensi first 141 bases are bacterial. Then nearly identical B. martensi MF958319.
EU650324 | Plumatella sp. bacterial (Mycoplasma) insert, bases 879 to 1278.
HO777700 | Phaseolus acutifolius is the sequence of a chloroplast.
I found other problematic sequences, are you interested in getting the whole list?
Best regards,
AM
Hi there, I've noticed PR2 incorporates only 1 of 2 strains of Cryothecomonas aestivalis reference sequences from Kuhn et al. 2000 (https://www.sciencedirect.com/science/article/pii/S1434461004700322).
Strain 2 from that study (genbank accession AF290541.1) does not appear to be included in PR2 version 5.0.0 nor in 4.14.0.
I can't find any follow-up studies to suggest this reference should be excluded so just pointing out in case this exclusion is unintentional.
Thanks for all your hard work developing and maintaining this excellent resource.
A few PR2 id do not follow the standard rule for constructing PR2 id (Genbank.start.end_X). This needs to be corrected (see sequences id 176963 -> 178133 in particular).
Hello,
in the entry MF423350
(Heterocapsa steinii), the 'space' in the species name is not an ascii space. It is encoded by a 16-bit character $C2A0
, where it should be a simple 8-bit value $20
. That creates an error when processing this release with cutadapt
:
>MF423350.1.1769_U;tax=k:Eukaryota,d:TSAR,p:Alveolata-Dinoflagellata,c:Dinophyceae,o:Peridiniales,f:Heterocapsaceae,g:Heterocapsa,s:Heterocapsa steinii
It seems to be the only non-ascii character in that release:
zgrep --color='auto' -P -n '[^\x00-\x7F]' pr2_version_5.0.0_SSU_UTAX.fasta.gz
Hi,
I have noticed a lot of errors in the taxonomy ranks when looking at the names. At first, I used the already trained version (4.13) when I noticed the problem. I decided to use the pr2_version_5.0.0_SSU_dada2 and train it myself, thinking the newest trained set might have give better results. However, I did not see any changes in the taxonomy ranks, as they appear to be mixed up.
This is what my results look like:
If we look as the ASV1, Dreissena (genus) is correct, but the family is wrong, it should be Dreissenidae, the order should be Myida, Bivalvia is the class (and not the order), Mollusca is the phylum (not the class) and so on. So one would think to only change the names of the ranks, but it is not possible, because:
If we look at ASV6, Dinophycea is a class (correct), Dinoflagellata is a superclass and Alveolota is a superphylum.
So the column names are incorrect, but the taxonomic ranks chosen are not even consistent. When looking back at the database, "Mollusca" and "Dinophycea" are already in the same column (one is a phylum, the other is a class), which probably means that the database is mixed up, even before doing the training part.
I am wondering how are the names used for the taxonomic rank chosen? For some ASVs, it seems to be very random. Is there a way to fix this kind of issue?
Best,
173 sequences are shorter than the min length (500 bp). Remove in version 4.11.0
Dear developer,
I'm using PR2 and mothur to do taxonomic assignment on my 18S metabarcoding data. I found that Maxillopoda, Branchiopoda, and Insecta etc. were assigned to the Family level, but they should be in the Class level. The entire Metazoa seems to be shifted... Could you please have a look at the issue? Thanks and have a good one!
Best,
Michelle
Dear Users,
There is an error in the PR2 dataset (pr2_version_4.10.0_mothur.fasta).
The sequence Id CP000499.0.0_U is without a nucleotide sequence.
This creates an error for generating reference database for blast analysis.
You have to eliminate this sequence id from the dataset before you carry on with the make local dataset.
Hope this helps,
Regards,
Chetan
Check Chyrsophyceae clades
Hello,
I downloaded the fasta file (pr2_version_4.12.0_18S_taxo_long.fasta.gz) to format it for the FROGS pipeline (basically formatting for BLAST and RDP classifier).
I encountered some troubles with makeblastdb because of unrecognized char in sequences identifier:
é
that I change with a simple e
, my guess is that it was a french é
û
that I change with a simple u
, my guess is that it was a french û
problematic sequences ID are:
FJ660684.1.1753_U|18S_rRNA|nucleus|strain_14-févr.
FJ609422.1.1913_U|18S_rRNA|nucleus|specimen_10-août
FJ660668.1.1752_U|18S_rRNA|nucleus|strain_02-févr.
FJ660680.1.1752_U|18S_rRNA|nucleus|strain_13-févr.
FJ660676.1.1751_U|18S_rRNA|nucleus|strain_12-févr.
EU667999.1.1679_U|18S_rRNA|nucleus|strain_12-févr.
EU667995.1.1682_U|18S_rRNA|nucleus|strain_07-févr.
FJ660672.1.1752_U|18S_rRNA|nucleus|strain_11-févr.
EU667999.1.1679_UC|18S_rRNA|nucleus|strain_12-févr.
EU667995.1.1682_UC|18S_rRNA|nucleus|strain_07-févr.
just to let you known
Kindly
Maria
ABQO010458413.64.1383_U corresponds to a wallaby genome (https://www.ncbi.nlm.nih.gov/nuccore/ABQO010458413) but is assigned as Amphibia, ...
Hey ho,
for the following pr2 entries, the NCBI entries were removed:
AY745555.1.1854_U, AY745597.1.1844_U, EF209781.1.1956_U, EF209774.1.1835_U, EF209794.1.1834_U
Hello,
I've problems using the pr2_version_4.11.1_dada2.fasta with the AssignTaxonomy function with a seqtab (dim =5 x 1625) built upon seqs of the v9 region 18s length between 100-125 bp. I'm working on a MacBook Pro 8gb i5 (8th generation, 4 cores), but the run gets stuck at some point. Doing the same with Silva 16 s database "rdp_train_set_16.fa" the run goes ok.
Could be something related to the short size of seqs? Any other option?
Thanks a lot
Soluna
Fix the following classes
While working with cutadapt (v3.4) to extract the SSU V4 region, I've noticed that some entries are not in the expected orientation (compared to the reference Saccharomyces cerevisiae). I wrote a script to search for all possible orientations:
## primers from Stoeck et al. 2010
ORIGINAL_PRIMER_F="CCAGCASCYGCGGTAATTCC"
ORIGINAL_PRIMER_R="ACTTTCGTTCTTGATYRA"
complement() {
# complement a DNA/RNA IUPAC string
[[ -z "${1}" ]] && { echo "error: empty string" ; exit 1 ; }
local -r nucleotides="acgturykmbdhvswACGTURYKMBDHVSW"
local -r complements="tgcaayrmkvhdbswTGCAAYRMKVHDBSW"
tr "${nucleotides}" "${complements}" <<< "${1}"
}
trim_primers() {
# trim and keep only entries with both primers
OPTIONS="--minimum-length ${MIN_LENGTH} --discard-untrimmed --error-rate ${ERROR_RATE} --quiet"
CUTADAPT="$(which cutadapt) ${OPTIONS}"
MIN_LENGTH="32"
ERROR_RATE="0.2"
MIN_F=$(( ${#PRIMER_F} * 2 / 3 ))
MIN_R=$(( ${#PRIMER_R} * 2 / 3 ))
zcat "${SOURCE}.gz" | \
dos2unix | \
${CUTADAPT} \
--front "${PRIMER_F}" \
--overlap "${MIN_F}" - | \
${CUTADAPT} \
--adapter "${PRIMER_R}" \
--overlap "${MIN_R}" - > "${OUTPUT}"
}
## download PR2 (UTAX version)
URL="https://github.com/pr2database/pr2database/releases/download"
VERSION="4.13.0"
SOURCE="pr2_version_${VERSION}_18S_UTAX.fasta"
[[ -e "${SOURCE}.gz" ]] || wget "${URL}/v${VERSION}/${SOURCE}.gz"
## search rev-comp entries
OUTPUT="revcomp.fas"
PRIMER_F="${ORIGINAL_PRIMER_R}"
PRIMER_R="$(complement "${ORIGINAL_PRIMER_F}" | rev)"
trim_primers
## search reverse entries
OUTPUT="reverse.fas"
PRIMER_F="$(complement "${ORIGINAL_PRIMER_R}")"
PRIMER_R="$(rev <<< "${ORIGINAL_PRIMER_F}")"
trim_primers
## search complement entries
OUTPUT="complement.fas"
PRIMER_F="$(complement "${ORIGINAL_PRIMER_F}")"
PRIMER_R="$(rev <<< "${ORIGINAL_PRIMER_R}")"
trim_primers
The script allows 20% of errors in primer matching but forces a match greater or equal to 2/3rd of the primer length. That parameter should limit false positives.
The results are:
I join the corresponding fasta files: misoriented_entries.zip
Hi,
I'm trying to retrieve data from the taxon_trophic_mode field in PR2 v4.12.0, but all the values are NA?
Table pr2_taxonomy - add fields
taxon_trophic_mode - detailed trophic mode (e.g. "C-fixation constitutive; Mixotroph")
> library("pr2database")
>
> data("pr2")
> colnames(pr2)
> unique(pr2$taxon_trophic_mode)
[1] NA
R sessionInfo:
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pr2database_4.12.0
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0 pillar_1.3.1 rstudioapi_0.9.0 tibble_2.0.1 yaml_2.2.0 crayon_1.3.4 pkgconfig_2.0.2
[9] rlang_0.3.1
There are seven sequences tagged as Microsporidia in PR2 4.12:
GU130407
EU589246
EF990668
GQ246188
GQ258752
JQ268567
KF830273
all with the following taxonomic assignment:
Eukaryota|Opisthokonta|Fungi|Microsporidiomycota|Microsporidiomycotina|Microsporidiomycotina_X|Microsporidia|Microsporidia_sp.
Using figure 1 from Bass et al., (2018), new reference sequences for Microsporidia could be added to PR2, with improved taxonomic assignments.
For instance, Bass et al. placed AB534337 in the "expanded Microsporidia" clade, and sub-clade "Laz X (Lazarus and James, 2015); 'Mitosporidium' (Corsaro et al 2016)".
Present assignment of AB534337 in PR2 is:
1 2 3 4 5 6 7 8
Eukaryota|Opisthokonta|Fungi|Fungi_X|Fungi_XX|Fungi_XXX|Fungi_XXXX|Fungi_XXXX_sp.
Assignment could be:
1 2 3 4 5 6 7 8
Eukaryota|Opisthokonta|Fungi|Cryptomycota|Microsporidia|Microsporidia_Laz_X|Mitosporidium|Mitosporidium_sp.
Note that none of the seven sequence accessions tagged as Microsporidia in PR2 4.12 are present in figure 1 of Bass et al.
Hello, I downloaded the full PR2 database as a fasta file (" pr2_version_5.0.0_SSU_taxo_long.fasta") which according to the website is suitable for making a local BLAST database. However, when I run makeblastdb using BLAST+ version 2.6.0, it fails to generate a database.
This command
makeblastdb -in pr2_version_5.0.0_SSU_taxo_long.fasta -dbtype nucl -title "PR2 Full Database"
generates only .nin, .nhr, and .nsq files, and shows the following output:
Building a new DB, current time: 10/18/2023 14:05:45
New DB name: /Users/nastassiapatin/DBs/PR2/PR2_full-MagicBlastDB/pr2_version_5.0.0_SSU_taxo_long.fasta
New DB title: PR2 Full Database
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 221085 sequences in 7.21831 seconds.
While this command (with -parse_seqids)
makeblastdb -in pr2_version_5.0.0_SSU_taxo_long.fasta -dbtype nucl -parse_seqids -title "PR2 Full Database"
also generates only .nin, .nhr, and .nsq files, but these seem to be incomplete as they have ".00" before the .n** suffixes. This command produces the following output:
Building a new DB, current time: 10/18/2023 14:04:02
New DB name: /Users/nastassiapatin/DBs/PR2/PR2_full-MagicBlastDB/pr2_version_5.0.0_SSU_taxo_long.fasta
New DB title: PR2 Full Database
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
[1] 6917 abort makeblastdb -in pr2_version_5.0.0_SSU_taxo_long.fasta -dbtype nucl -title
Do you have any ideas why the full PR2 fasta might be failing to build a BLAST database? When I build one from only a select group of taxa (order Dinophyceae) it works fine.
Thanks,
Nastassia
When reannotating the Tara Oceans V9 metabarcodes using PR2 v4.14, we noted the lack of Fragilariopsis in the new annotation. Using the most abundant Fragilariopsis OTU in the original dataset, we found that the best hits in PR2 v4.14 correspond to 100% identity to a group of sequences including a Dinophyceae (obviously an error in PR2 v4.14) and a bunch of diatom reference sequences poorly assigned as 'Raphid-pennate_X_sp', in addition to a Fragilariopsis sequence:
query:
1533bc0882cb58d28b5c56f49269c1c2
gtcgcacctaccgattgaatggtccggtgaagcctcgggattgtggttagtttcctttattggaagttagtcgcgagaacttgtctaaaccttatcatttagaggaaggtgaagtcgtaacaaggtttcc
Subjects (100 % id):
KC771185.1.1787_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
EF100371.1.1232_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KC771174.1.1777_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KC771149.1.1790_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KC771193.1.1786_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KC771190.1.1789_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KC771168.1.1787_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
EF140623.1.1779_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Fragilariopsis,s:Fragilariopsis_curta
KJ757846.1.1801_U;tax=k:Eukaryota,d:Alveolata,p:Dinoflagellata,c:Dinophyceae,o:Dinophyceae_X,f:Dinophyceae_XX,g:Dinophyceae_XXX,s:Dinophyceae_XXX_sp.
KJ757881.1.1791_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KJ758129.1.1800_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KJ758140.1.1795_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Raphid-pennate_X,s:Raphid-pennate_X_sp.
KJ758229.1.1792_U;tax=k:Eukaryota,d:Stramenopiles,p:Ochrophyta,c:Bacillariophyta,o:Bacillariophyta_X,f:Raphid-pennate,g:Fragilariopsis,s:Fragilariopsis_sp.
I wonder whether there are more mistakes in the last PR2 version. Maybe a way of checking this is by running some re-assignment test, or random checks to spot other mistakes, or maybe by detecting cases where the taxonomy of the best hit sequences (with identical %id) are strongly contradictory - a dinoflagellate followed by diatoms as in this case.
Hi,
I've just noticed that the diatom genus Bacteriastrum is not represented in the database. Some examples that could be included are:
Accession MG972355.1 Bacteriastrum hyalinum
Accession GQ330314 Bacteriastrum hyalinum
Accession MG972357.1 Bacteriastrum jadranum
I don't know about other areas, but is quiet common in the south mediterranean basin.
Many thanks,
Hi,
Within the group of Colpodellida (Alveolata) there is a genus named "Aphamonas" which, if I'm correct, should be named as 'Alphamonas'.
Best,
Máté
Hello, when I enter a 349-nt sequence into the query field online, I consistently get the following message: "An error has occurred. Check your logs or contact the app author for clarification."
I don't see anywhere to download or view the log files. The sequence is annotated on NCBI as belonging to the Order Dinophyceae. Any thoughts?
Hello professor!
I tried using this trained dataset for my analyses, but its seems corrupt.
Thank you! :D
Nous utilisons régulièrement votre base de données et on nous a remonté une erreur concernant les références ‘alternaria et aspergillus’ ci-dessous qui sont des ascomycota et non basidiomycota
Merci de votre retour,
Bonne journée
Marina Moletta-Denat
KF747355.1.1082_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Alternaria;Alternaria+arborescens;
KF747361.1.1189_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Aspergillus;Aspergillus+tubingensis;
KF747362.1.1038_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Aspergillus;Aspergillus+niger;
KF747363.1.1166_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Aspergillus;Aspergillus+tubingensis;
KF747365.1.1153_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Alternaria;Alternaria+alternata;
KF747366.1.1138_U Eukaryota;Opisthokonta;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Alternaria;Alternaria+citri;
Hi!
I'm trying to use the PR2 database with the assignTaxonomy
function from DADA2 to assign taxonomy to rRNA sequences in a fastq file. I found a note about using the taxLevels c("Kingdom","Supergroup","Division","Class","Order","Family","Genus","Species")
but I wonder how this compares to taxonomic assignments for 16S sequences (which I suspect may also be included in my sequence file) where the taxLevels only go down to genus and then a separate training fasta is used to assign species?
If I want to make taxonomic assignments consistent between 18S and 16S, should I skip species from the PR2 training fasta header and add it separately with the addSpecies
function?
Thanks in advance for any help and for a good resource!
Hi,
Thank you for making this information available.
I am using Decipher IDtaxa algorithm to assign taxonomic annotations to my taxonomic units, using the trainset of PR2 v5.0 as the reference. All worked fine, but upon revision of the output, I noticed that the string vector that should contain the headers of the taxonomic ranks is missing. I checked within the training set provided here, and as of today $rank: NULL
. I can add this manually, but you might want to check this.
Regards,
setwd("H:/pr2database/pr2database-master/pr2database-master")
library(shiny)
Warning message:
程辑包‘shiny’是用R版本4.3.2 来建造的
runApp("H:/pr2database/pr2database-master/pr2database-master")
Warning in loadSupport(appDir, renv = sharedEnv, globalrenv = NULL) :
Loading R/ subdirectory for Shiny application, but this directory appears to contain an R package. Sourcing files in R/ may cause unexpected behavior.
Cannot use system.file
Cannot use full path
[1] "Using cloud bucket"
[1] "global.R done"
ℹ Loading pr2database
Cannot use system.file
Cannot use full path
[1] "Using cloud bucket"
[1] "global.R done"
Error after opening the shiny app and entering a gene sequence:Must supply.init
when.x
is empty.
For example, the taxo_id of the species Halteria grandinella recorded in the table pr2_version_5.0.0_merged.csv is 1512, whereas 1512 stands for cellular organisms in NCBI taxonomy is Bacteria. Terrabacteria group; Bacillota; Clostridia; Lachnospirales; Lachnospiraceae; Lachnoclostridium.
All the other ids also differ from the one in Taxonomy.
Is there an appropriate tool that will allow me to convert this to the NCBI taxonomy ID?
Dear Daniel,
I am trying to do taxonomy classification using KrakenUniq with the PR2 database.
In order to build a custom KrakenUniq database, a taxonomy database, specifically nodes.dmp and names.dmp, is required.
Could you please tell me where to download the PR2 taxonomy files?
Thank you in advance.
Cheers,
Yue
I'm getting replicate entries in v 4.12.0. Taxa entry is identical, but sequence entry is different. I'm trying to use the mothur fasta and tax files in a pipeline to train PR2 for use in QIIME2 ASV determination and tax assignment - these duplicate entries throw an error.
Replicate entries: KC486130.1.1651_U, AB275104.1.1590_U, EU087243.1.813_U, EU087247.1.847_U, FJ459738.1.1634_U
I know this was an issue from the last release and it was updated quickly. Is there an additional script you can include with each release that performs this de-duplication?
https://github.com/shu251/db-build-microeuks
Hello,
I downloaded the sqlite database (pr2_version_4.12.0.sqlite) to be able to query it easily on an ec2 instance. However, when I query the pr2_taxonomy table and look at taxon_trophic_mode all entries are (None, ).
Here is more information if useful:
I am using sqlite3 to query the database through Python.
When I run:
for row in cur.execute('SELECT DISTINCT taxon_trophic_mode FROM pr2_taxonomy'): print(row)
I get solely:
(None,)
This set-up otherwise works fine to interact with the database.
Thanks,
Ben
Dear Daniel,
I hope you are well. I am currently analyzing a number of 18S MiSeq datasets with the software mothur (v. 1.39.5) and am using the PR2 database for the classification of sequences and OTUs. I have used PR2 version 4.7.2 for mothur in the past and found the classification much better in comparison to SILVA NR v. 132 – as was to be expected. 😊 I now tried to use the new version PR2 v. 4.9.0 for mothur but unfortunately encountered some errors that in the end cause mothur to abort. I checked the error messages I got by simply opening the fasta and tax files in Notepad and found that it was mainly formatting errors that caused the problems. I think/hope that these can easily be fixed.
FJ402948.1.1186_U & FJ402949.1.1210_U are both found twice in the tax file. As mothur requires each file to be unique, these doublets cause an error.
“LC054937.1.1751_U is missing the final ';', ignoring. [ERROR]: ;Plagiodinium;Plagiodinium_belizeanum; is missing the final ';', ignoring.”
The sequence LC054937.1.1751_U has an accidental hard return after the name Prorocentraceae, therefore mothur does not recognize that the ; and the final part of the taxonomy (“;Plagiodinium;Plagiodinium_belizeanum;”) belong together.
“'AB353770.1.1740_U' is in your template file and is not in your taxonomy file. Please correct.”
This error is shown for each sequence in the PR2 database and it is what cause mothur to abort. I think that this has to do with incorrect hard returns and TABs in the new version. If I look at v 4.7.2 using NotePad, the fasta file looks like
AB353770.1.1740_U
ATGCTTGTCTCAAAGATTAAGCCATGCATGTCTCAGTATAAGC…
KC672520.1.1801_U
TACCTGGTTGATTCTGCCCCTATTCATATGCTTGTCTCAAAGATTAAGCC…
And in the taxonomy file the lines are all like
AB353770.1.1740_U Eukaryota;Alveolata;Dinophyta;Dinophyceae;Dinophyceae_X;Dinophyceae_XX;Peridiniopsis;Peridiniopsis_kevei;
KC672520.1.1801_U Eukaryota;Opisthokonta;Fungi;Ascomycota;Pezizomycotina;Leotiomycetes;Leotiomycetes_X;Leotiomycetes_X_sp.;
In version 4.9.0 the fasta file looks like
AB353770.1.1740_UATGCTTGTCTCAAAGATTAAGCCATGCATGTCTCAGTATAAGC…>KC672520.1.1801_UTACCTGGTTGATTCTGCCCCTATTCATATGCTTGTCTCAAAGATTAAGCC…
Similarly, the tax file looks like
AB353770.1.1740_U Eukaryota;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Kryptoperidiniaceae;Unruhdinium;Unruhdinium_kevei;KC672520.1.1801_U Eukaryota;Opisthokonta;Fungi;Ascomycota;Pezizomycotina;Leotiomycetes;Leotiomycetes_X;Leotiomycetes_X_sp.;AB284159.1.1765_U Eukaryota;Alveolata;Dinoflagellata;Dinophyceae;Peridiniales;Protoperidiniaceae;Protoperidinium;Protoperidinium_bipes;AY123745.1.924_UC …
I assume that these formatting errors cause the mothur software to not recognize the corresponding names in the fasta and tax files.
René Groben
Dear Daniel,
In the most recent version of the "tax"-file for mothur (downloaded 07.11.2018), GU824068.1.1173_U appears four times.
I also get the "'XXXX' is in your template file, and is not in your taxonomy file." for all the entries in PR2. I am running mothur on a unix-based computer cluster.
Thanks for keeping PR2 updated!
Best,
Elianne
Hello there,
first of all, thanks for making this database available!
This is not a big problem but just something I noticed. I am using file: pr2_version_4.10.0_UTAX.fasta and got a warning formatting it:
WARNING: Empty sequence at line 2598276 in FASTA file /cluster/project/gdc/people/jwalser/db/pr2_version_4.10.0_UTAX.fasta, label >CP000499.0.0_U;tax=k:Eukaryota,d:Opisthokonta,p:Fungi,c:Ascomycota,o:Saccharomycotina,f:Saccharomycetales,g:Scheffersomyces,s:Scheffersomyces_stipitis
Adding the trimmed sequences (or removing the record it) solved it.
Best
Hello, I am afraid this may be a naive query rather than an issue. If the former, then I am sure it can be closed and discarded quickly. I have been running a large dataset against SILVA and PR2 (an older version, but my question still pertains) and looking to reconcile the two results for each. I ran into a few cases where I could not understand how I could be getting high identity matches to such different things in the two databases. In several of these cases, I retrieved the sequences from Genbank to align with the relevant OTUs and found a set of cases where our OTU was not matching as well to the Genbank entry as compared to the sequence with that same Genbank accession from PR2. I traced several of our problem cases to similar instances to situations where the PR2 sequence with AccessionX was longer than the sequence associated with AccessionX in Genbank. The PR2-version4.9.2 appears to have more than 1000 sequences that are greater than 100 bp longer than the associated Genbank sequence. Is this expected? I apologize if this is a silly question. I am just new to this type of analysis.
Hello,
I used your script #26 to train a database for the IdTaxa
function in the DECIPHER
package. However, after classification, the majority of sequences are classified as a specific species that does not match the results from NCBI blast.
My database is built with around 1.5 million COI sequences from NCBI. I would appreciate your advice on:
maxGroupSize
and maxIterations
for database training.Thank you in advance for your assistance.
Hi! Congrats for the great work!
From the README, we can see that plastid sequences are included in PR2. At the bottom, a reference to a plastid-only DB (PhytoREF) is provided. I guess it is not clear to what extent plastid sequences in PR2 overlap with PhytoREF, or whether PR2 is a superset of PhytoREF. In case it is not, and one wants to merge the two DBs, would dereplication be suggested?
Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.