calcofi / rcrux Goto Github PK

A repository for my work on rCRUX

License: GNU General Public License v3.0

R 98.88% JavaScript 0.30% Dockerfile 0.82%

noaa-omics-data noaa-omics-software noaa-omics-study

rcrux's Introduction

rCRUX: Generate CRUX metabarcoding reference libraries in R

Authors: Luna Gal, Zachary Gold, Ramon Gallego, Shaun Nielsen, Katherine Silliman, Emily Curd
Inspiration: The late, great Jesse Gomer. Coding extraordinaire and dear friend.
License: GPL-3
Support: Support for the development of this tool was provided by CalCOFI, NOAA, Landmark College, and VBRN.
Acknowledgments: This work benefited from the amazing input of many including Lenore Pipes, Sarah Stinson, Gaurav Kandlikar, and Maura Palacios Mejia.

Published-Manuscript

pre-print

pre-made databases

eDNA metabarcoding is increasingly used to survey biological communities using common universal and novel genetic loci. There is a need for an easy to implement computational tool that can generate metabarcoding reference libraries for any locus, and are specific and comprehensive. We have reimagined CRUX (Curd et al. 2019) and developed the rCRUX package R system for statistical computing R Core Team 2021 to fit this need by generating taxonomy and fasta files for any user defined locus. The typical workflow involves using get_seeds_local() or get_seeds_remote() to simulate in silico PCR (e.g. Ye et al. 2012) to acquire a set of sequences analogous to PCR products containing metabarcode primer sequences. The sequences or "seeds" recovered from the in silico PCR step are used to search databases for complementary sequence that lack one or both primers. This search step, blast_seeds() is used to iteratively align seed sequences against a local NCBI database for matches using a taxonomic rank based stratified random sampling approach. This step results in a comprehensive database of primer specific reference barcode sequences from NCBI. Using derep_and_clean_db(), the database is de-replicated by DNA sequence where identical sequences are collapsed into a representative read. If there are multiple possible taxonomic paths for a read, the taxonomic path is collapsed to the lowest taxonomic agreement.

Typical Workflow

Installation

Install from GitHub:

# install.packages(devtools)
devtools::install_github("CalCOFI/rCRUX", build_vignettes = TRUE)

library(rCRUX)

Dependencies

NOTE: These only need to be downloaded once or as NCBI updates databases. rCRUX can access and successfully build metabarcode references using databases stored on external drives.

BLAST+

NCBI's BLAST+ suite must be locally installed and accessible in the user's path. NCBI provides installation instructions for Windows, Linux, and Mac OS. Version 2.10.1+ through 2.13.0 are verified compatible with rCRUX.

The following is example shell script to download blast executables:

cd /path/to/Applications

wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.10.1/ncbi-blast-2.10.1+-x64-macosx.tar.gz

This link may help if you are using RStudio and having trouble adding blast+ to your path.

Blast-formatted database

rCRUX requires a local blast-formatted nucleotide database. These can be user generated or download a pre-formatted database from NCBI. NCBI provides a tool (perl script) for downloading databases as part of the blast+ package. A brief help page can be found here.

The following shell script can be used to download the blast-formatted nucleotide database. There are also taxon specific databases (e.g. nt_euk, nt_prok, and nt_viruses).


mkdir NCBI_blast_nt

cd NCBI_blast_nt

wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.???.tar.gz*"

time for file in *.tar.gz; do tar -zxvf $file; done

cd ..

You can test your nt blast database using the following command into terminal:

blastdbcmd -db '/my/directory/ncbi_nt/nt' -dbtype nucl -entry MN937193.1 -range 499-633

If you do not get the following, something went wrong in the build.

>MN937193.1:499-633 Jaydia carinatus mitochondrion, complete genome
TTAGATACCCCACTATGCCTAGTCTTAAACCTAGATAGAACCCTACCTATTCTATCCGCCCGGGTACTACGAGCACCAGC
TTAAAACCCAAAGGACTTGGCGGCGCTTCACACCCACCTAGAGGAGCCTGTTCTA

Possible error include but are not limited to:

Partial downloads of database files. Extracting each TAR archive (e.g. nt.00.tar.gz.md5) should result in 8 files with the following extensions(.nhd, .nhi, .nhr, .nin, .nnd, .nni, .nog, and .nsq). If a few archives fail during download, you can re-download and unpack only those that failed. You do not have to re-download all archives.
You downloaded and built a blast database from ncbi fasta files but did not specify -parse_seqids

The nt database is ~242 GB (as of 8/31/22) and can take several hours (overnight) to build. Loss of internet connection can lead to partially downloaded files and blastn errors (see above).

Note: Several blast formatted databases can be searched simultaneously. See documentation for details.

Taxonomizr

rCRUX uses the taxonomizr package for taxonomic assignment based on NCBI Taxonomy id's (taxids). Many rCRUX functions require a path to a local taxonomizr readable sqlite database. This database can be built using taxonomizr's prepareDatabase function.

This database is ~72 GB (as of 8/31/22) and can take several hours (overnight) to build. Loss of internet connection can lead to partially downloaded files and taxonomizr run errors.

The following code can be used to build this database:

library(taxonomizr)

accession_taxa_sql_path <- "/my/accessionTaxa.sql"
prepareDatabase(accession_taxa_sql_path)

Note: For poor bandwidth connections, please see the taxononmizr readme for manual installation of the accessionTaxa.sql database. If built manually, make sure to delete any files other than the accessionTaxa.sql database (e.g. keeping nucl_gb.accession2taxid.gz leads to a warning message).

Example pipeline

The following example shows a simple rCRUX pipeline from start to finish. Note that this example will require internet access and considerable database storage (~314 GB, see section above), run time (mainly for blastn), and system resources to execute.

Note: Blast databases and the taxonomic assignment databases (accessionTaxa.sql) can be stored on external hard drive. It increases run time, but is a good option if computer storage capacity is limited.

There are two options to generate seeds for the database generating blast step blast_seeds_local() or blast_seeds_remote(). The local option is slower, however it is not subject to the memory limitations of using the NCBI primer_blast API. The local option is recommended if the user is building a large database, wants to include any taxid in the search, wants to use multiple forward or reverse primers, and / or has many degenerate sites in their primer set. It also cached run data so if a run is interrupted the user can pick it up from the last successful round of blast by resubmitting the original command.

get_seeds_local

This example uses default parameters, with the exception of evalue to minimize run time.


forward_primer_seq = "TAGAACAGGCTCCTCTAG"

reverse_primer_seq =  "TTAGATACCCCACTATGC"

output_directory_path <- "/my/directory/12S_V5F1_local_111122_e300" # path to desired output directory

metabarcode_name <- "12S_V5F1" # desired name of metabarcode locus

accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql" # path to taxonomizr sql database

blast_db_path <- "/my/directory/ncbi_nt/nt"  # path to blast formatted database


get_seeds_local(forward_primer_seq,
                 reverse_primer_seq,
                 metabarcode_name,
                 output_directory_path,
                 accession_taxa_sql_path,
                 blast_db_path, evalue = 300)

Two output .csv files are automatically created at this path based on the arguments passed to get_seeds_local. One includes all unfiltered output the other is filtered based on user defined parameters and includes taxonomy.

A unique taxonomic rank summary file is also generated (e.g. the number of unique phyla, class, etc in the blast hits). If a taxonomic rank category contains NA's, they will be counted as a single unique rank. Sequence availability in NCBI for a given taxid is a limiting factor.

Also generated is a fasta file with the primers used for blast.

Example output can be found here.

If BLAST+ is not in your path do the following:


get_seeds_local(forward_primer_seq,
                 reverse_primer_seq,
                 metabarcode_name,
                 output_directory_path,
                 accession_taxa_sql_path,
                 blast_db_path, evalue = 300,
                 ncbi_bin = "/my/directory/ncbi-blast-2.10.1+/bin/")

get_seeds_remote

This example uses default parameters to minimize run time.

Searching jawless vertebrates (taxid: "1476529") and jawed vertebrates (taxid: "7776").


forward_primer_seq = "TAGAACAGGCTCCTCTAG"

reverse_primer_seq =  "TTAGATACCCCACTATGC"

output_directory_path <- "/my/directory/12S_V5F1_remote_111122" # path to desired output directory

metabarcode_name <- "12S_V5F1" # desired name of metabarcode locus

accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql" # path to taxonomizr sql database



get_seeds_remote(forward_primer_seq,
          reverse_primer_seq,
          metabarcode_name,
          output_directory_path,
          accession_taxa_sql_path,
          organism = c("1476529", "7776"),
          return_table = FALSE)

Note: When using default parameters only 1047 hits are returned from NCBI's primer blast (run 11-11-22). Returns hit sizes and contents are variable depending on parameters, random blast sampling, and database updates.

Two output .csv files are automatically created at this path based on the arguments passed to get_seeds_remote. One includes all unfiltered output the other is filtered based on user defined parameters and includes taxonomy.

A unique taxonomic rank summary file is also generated (e.g. the number of unique superkingdon, phyla, class, etc in the blast hits). If a taxonomic rank category contains NA's, they will be counted as a single unique rank.

Sequence availability in NCBI for a given taxid is a limiting factor, as are degenerate bases and API memory allocation.

Modifying defaults can increase the number of returns by orders of magnitude.

Example output can be found here.

blast_seeds

Iterative searches are based on a stratified random sampling unique taxonomic groups for a given rank from the get_seeds_local or get_seeds_remote output table. For example, the default is to randomly sample one read from each genus. The user can select any taxonomic rank present in the get_seeds_local output table. The number of seeds selected may cause blastn to exceed the users available RAM, and for that reason the user can choose the maximum number of reads to blast at one time (max_to_blast, default = 1000). blast_seeds will subsample each set of seeds based on max_to_blast and process all seeds before starting a new search for seeds to blast. It saves the output from each round of blastn.


seeds_output_path <- '/my/directory/12S_V5F1_remote_111122/12S_V5F1_filtered_get_seeds_remote_output_with_taxonomy.csv' # this is output from get_seeds_local or get_seeds_remote

blast_db_path <- "/my/directory/blast_database/nt"  # path to blast formatted database

accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql"  # path to taxonomizr sql database

output_directory_path <- '/my/directory/12S_V5F1_remote_111122/' # path to desired output directory

metabarcode_name <- "12S_V5F1"  # desired name of metabarcode locus


blast_seeds(seeds_output_path,
            blast_db_path,
            accession_taxa_sql_path,
            output_directory_path,
            metabarcode_name)

Note: After each round of blast, the system state is saved. If the script is terminated after a full round of blast, the user can pick up where they left off. The user can also change parameters at this point (e.g. change the max_to_blast or rank)

The output includes a summary table of all unique blast hits (summary.csv), a multi fasta file of all unique hits (metabarcode_name_.fasta), a taxonomy file of all unique hits (metabarcode_name_taxonomy.txt), a unique taxonomic rank summary file (metabarcode_name_taxonomic_rank_counts.txt), a list of all of the accessions not present in your blast database (e.g. relevant if you ran get_seeds_remote; blastdbcmd_failed.csv), and a list of accessions with 4 or more Ns in a row (default for that parameter is wildcards = "NNNN"; too_many_ns.csv). The default number of reads to blast per rank is 1 (default for that parameter is sample_size = 1). The script will error out if the user asks for more reads per rank than exist in the blast seeds table.

If BLAST+ is not in your path do the following


blast_seeds(seeds_output_path,
            blast_db_path,
            accession_taxa_sql_path,
            output_directory_path,
            metabarcode_name,
            ncbi_bin = "/my/directory/ncbi-blast-2.10.1+/bin/")

Example output can be found here.

Note: There will be variability between runs due to primer blast return parameters and random sampling of the blast seeds table that occurs during blast_seeds. However, variability can be decreased by changing parameters (e.g. randomly sampling species rather than genus will decrease run to run variability).

derep_and_clean_db

This function takes the output of blast_seeds and de-replicates identical sequences and collapses ambiguous taxonomy to generate a clean reference database.


output_directory_path <- '/my/directory/12S_V5F1_remote_111122/' # path to desired output directory

summary_path <- "/my/directory/12S_V5F1_remote_111122/blast_seeds_output/summary.csv" # this is the path to the output from blast_seeds

derep_and_clean_db(output_directory_path, summary_path, metabarcode_name)

Note: Accessions with the same sequence are collapsed into a representative sequence. If those accessions have different taxids (taxonomic paths), we determine the lowest taxonomic agreement across the multiple accessions with an identical sequence. For example, for the MiFish 12S locus, nearly all rockfishes in the genus Sebastes have identical sequences. Instead of including ~110 identical reference sequences, one for each individual species, we report a single representative sequence with a lowest common taxonomic agreement of the genus Sebastes. This prevents classification bias for taxa with more sequences and also provides accurate taxonomic resolution within the reference database.

We exclude all sequences with taxids that are NA. Such sequences are not immediately useful for classification of metabarcoding sequences. However, we caution that such results can be indicative of off target amplification of a given primer set. For example, the MiFish 12S primer set amplifies uncultured marine bacteria among other taxa (taxid = NA) indicating off target amplification of non-fish taxa. These sequences are saved in the References_with_NA_for_taxonomic_ranks.csv file.

The result of this function is a final clean reference database file set composed of a paired metabarcode_name_derep_and_clean.fasta and metabarcode_name_derep_and_clean_taxonomy.txt. A summary file of the number of unique taxonomic ranks is also generated: metabarcode_name_derep_and_clean_unique_taxonomic_rank_counts.txt. In addition, all representative sequences and associated accessions are saved in Sequences_with_lowest_common_taxonomic_path_agreement.csv, Sequences_with_mostly_NA_taxonomic_paths.csv, Sequences_with_multiple_taxonomic_paths.csv, and Sequences_with_single_taxonomic_path.csv files. These files allow for the traceback of representative sequences to multiple accessions.

Example output can be found here.

Detailed Explanation For The Major Functions

get_seeds_local

Overview

get_seeds_local takes a set of forward and reverse primer sequences (single or multiple forward and single or multiple reverse primers) and generates .csv summaries of data returned from a locally run adaptation of NCBI's primer blast. This function performs like in silicon to find possible full length barcode sequences containing forward and reverse primer matches. It also generates a count of unique instances of taxonomic ranks (Phylum, Class, Order, Family, Genus, and Species) found in the output.

This script is a local interpretation of get_seeds_remote that avoids querying NCBI's primer BLAST tool. Although it is slower than remotely generating blast seeds, it is not subject to the arbitrary throttling of jobs that require significant memory.

Expected Output

It creates a get_seeds_local directory at output_directory_path if one doesn't yet exist, then creates a subdirectory inside output_directory_path named after metabarcode_name. It creates three files inside that directory. One represents the unfiltered output and another represents the output after filtering with user modifiable parameters and with appended taxonomy. Also generated is a summary of unique taxonomic ranks after filtering and a fasta file of the primers used for blast.

Detailed Steps

get_seeds_local passes the forward and reverse primer sequence for a given PCR product to run_primer_blastn. In the case of a non degenerate primer set only two primers will be passed to run_primer_blast. In the case of a degenerate primer set, get_seeds_local will get all possible versions of the degenerate primer(s) (using primerTree's enumerate_primers() function), randomly sample a user defined number of forward and reverse primers, and generate a fasta file. The selected primers are subset and passed to run_primer_blastn which queries each primer against a blast formatted database using the task "blastn_short". This process continues until all of the selected primers are blasted. The result is an output table with the following columns of data: qseqid (query subject id), sgi (subject gi), saccver (subject accession version), mismatch (number of mismatches between the subject a query), sstart (subject start), send (subject end), staxids (subject taxids).

Temporary output is cached after each sucessful run of run_primer_blastn, so if a run is interrupted the user can resubmit the command and pick up where they left off. The user can modify parameters for the run with the exception of num_fprimers_to_blast and num_rprimers_to_blast. Temporary files are deleted at the end of the run.

The returned blast hits for the seqeunces are matched and checked to see if they generate plausible amplicons (e.g. amplify the same accession and are in the correct orientation to produce a PCR product). These hits are written to a file with the suffix _unfiltered_get_seeds_local_output.csv. These hits are further filtered for length and number of mismatches.

Taxonomy is appended to these filtered hits using get_taxonomizr_from_accession. The results are written to to file with the suffix _filtered_get_seeds_local_output_with_taxonomy.csv. The number of unique instances for each rank in the taxonomic path for the filtered hits are tallied (NAs are counted once per rank) and written to a file with the suffix _filtered_get_seeds_local_unique_taxonomic_rank_counts.txt.

Note: Information about the blastn parameters can be found in run_primer_blast, and by accessing blastn -help in your terminal. Default parameters were optimized to provide results similar to those generated through remote blast via primer-blast as implemented in iterative_primer_search and modifiedPrimerTree_Functions.

Parameters

forward_primer_seq

which which turns degenerate primers into into a list of all possible non degenerate primers and converts the primer(s) into to a fasta file to be past to run_primer_blastn.

  e.g. forward_primer_seq <- "TAGAACAGGCTCCTCTAG" or forward_primer_seq <- c("TAGAACAGGCTCCTCTAG", "GGWACWGGWTGAACWGTWTAYCCYCC")

reverse_primer_seq

which which turns degenerate primers into into a list of all possible non degenerate primers and converts the primer(s) into to a fasta file to be past to run_primer_blastn.

  e.g. reverse_primer_seq <-  "TTAGATACCCCACTATGC" or reverse_primer_seq <- c("TTAGATACCCCACTATGC", "TANACYTCNGGRTGNCCRAARAAYCA")

output_directory_path

the parent directory to place the data in.

  e.g. "/path/to/output/12S_V5F1_local_111122_e300_111122"

metabarcode_name

used to name the subdirectory and the files. get_seeds_local appends metabarcode_name to the beginning of each of the files it generates.
```
  e.g. metabarcode_name <- "12S_V5F1"
```

accession_taxa_sql_path

the path to sql database created by taxonomizr

  e.g. accession_taxa_sql_path <- "/my/accessionTaxa.sql"

mismatch

the highest acceptable mismatch value per hit. get_seeds_local removes each row with a mismatch greater than the specified value.
```
  The default is mismatch = 6
```

minimum_length

get_seeds_local removes each row that has a value less than minimum_length in the product_length column.
```
  The default is minimum_length = 5
```

maximum_length

get_seeds_local removes each row that has a value greater than maximum_length in the product_length column.
```
  The default is maximum_length = 500
```

blast_db_path

blast_db_path a directory containing one or more blast-formatted database. For multiple blast databases, separate them with a space and add an extra set of quotes.

   e.g blast_db_path <- "/my/ncbi_nt/nt" or blast_db_path <- '"/my/ncbi_nt/nt  /my/ncbi_ref_euk_rep_genomes/ref_euk_rep_genomes"'

task

passed to run_primer_blastn the task for blastn to perform

  The default is "blastn_short" - which is optimized for searches with queries < 50 bp

word_size

passed to run_primer_blastn is the fragment size used for blastn search - smaller word sizes increase sensitivity and time of the search.
```
  The default is word_size =  7
```

evalue

passed to run_primer_blastn is the number of expected hits with a similar quality score found by chance.
```
  The default is evalue = 3e-7
```

coverage

passed to run_primer_blastn is the minimum percent of the query length recovered in the subject hits.
```
  The default is coverage = 90
```

perID

passed to run_primer_blastn is the minimum percent identity of the query relative to the subject hits.
```
  The default is perID = 50
```

reward

passed to run_primer_blastn is the reward for nucleotide match.
```
  The default is reward = 2
```

align

is the maximum number of subject hits to return per query blasted.

   The default is align = '10000000'. - to few alignments will result in no matching pairs of forward and reverse primers.  To many alignments can result in an error due to RAM limitations.

num_fprimers_to_blast

is the maximum number of possible forward primers to blast. This is relevant for degenerate primers, all possible primers from a degenerate sequence are enumerated, and the user can choose a number to be randomly sampled and used for primer blast.

   The default is num_fprimers_to_blast = 50

num_rprimers_to_blast

is the maximum number of possible reverse primers to blast. This is relevant for degenerate primers, all possible primers from a degenerate sequence are enumerated, and the user can choose a number to be randomly sampled and used for primer blast.

   The default is num_rprimers_to_blast = 50

max_to_blast

is the number of primers to blast simultaneously.

   The default is max_to_blast = 2. - Increasing this number will decrease overall run time, but increase the amount of RAM required.

num_threads

is the number of CPUs to engage in the blastn search.
The default num_treads = NULL, uses [parallel::detectCores()] to determine the user's number of CPUs automatically and use that for the value of -num_threads. ncbi_bin
passed to run_primer_blastn is the path to blast+ tools if not in the user's path. Specify only if blastn and is not in your path.

  The default is ncbi_bin = NULL - if not specified in path do the following: ncbi_bin = "/my/local/ncbi-blast-2.10.1+/bin/".

Examples

 # Non degenerate primer example: 12S_V5F1 (Riaz et al. 2011)

 forward_primer_seq = "TAGAACAGGCTCCTCTAG"
 reverse_primer_seq =  "TTAGATACCCCACTATGC"
 output_directory_path <- "/my/directory/12S_V5F1_local_111122_species_750"
 metabarcode_name <- "12S_V5F1"
 accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql"
 blast_db_path <- "/my/directory/ncbi_nt/nt"


 get_seeds_local(forward_primer_seq,
                 reverse_primer_seq,
                 metabarcode_name,
                 output_directory_path,
                 accession_taxa_sql_path,
                 blast_db_path,
                 minimum_length = 80,
                 maximum_length = 150)

 # adjusting the minimum_length and maximum_length parameters reduces the number of total hits by removing reads that could result from off target amplification


 # Degenerate primer example - mlCOIintF/jgHC02198 (Leray et al. 2013)
 # Note: this will take considerable time and computational resources

 forward_primer_seq <- "GGWACWGGWTGAACWGTWTAYCCYCC"
 reverse_primer_seq <- "TANACYTCNGGRTGNCCRAARAAYCA"
 output_directory_path <- "/my/directory/CO1_local"
 metabarcode_name <- "CO1"


 get_seeds_local(forward_primer_seq,
                 reverse_primer_seq,
                 metabarcode_name,
                 output_directory_path,
                 accession_taxa_sql_path,
                 blast_db_path,
                 minimum_length = 200,
                 maximum_length = 400,
                 aligns = '10000',
                 num_rprimers_to_blast = 200,
                 num_rprimers_to_blast = 2000,
                 max_to_blast = 10)


 # Non Degenerate but high return primer example - 18S (Amaral-Zettler et al. 2009)
 # Note: this will take considerable time and computational resources

 forward_primer_seq <- "GTACACACCGCCCGTC"
 reverse_primer_seq <- "TGATCCTTCTGCAGGTTCACCTAC"
 output_directory_path <- "/my/directory/18S_local"
 metabarcode_name <- "18S"


 get_seeds_local(forward_primer_seq,
                 reverse_primer_seq,
                 metabarcode_name,
                 output_directory_path,
                 accession_taxa_sql_path,
                 blast_db_path,
                 minimum_length = 250,
                 maximum_length = 350,
                 max_to_blast = 1)

 # blasting two primers at a time can max out a system's RAM, however blasting one at a time is more feasable for personal computers with 16 GB RAM

get_seeds_remote

Overview

get_seeds_remote takes a set of forward and reverse primer sequences and generates .csv summaries of NCBI's primer blast data returns. Only full length barcode sequences containing primer matches are captured. It also generates a count of unique instances of taxonomic ranks (Phylum, Class, Order, Family, Genus, and Species) captured in the seed library.

This script uses iterative_primer_search to perform tasks. Its parameters are very similar to primerTree's primer_search(), but it takes vectors for organism and for database and performs a primer search for each combination. For each combination it calls modifiedPrimerTree_Functions, which is a modified versions of primerTree's primer_search() and primerTree's parse_primer, to query NCBI's primer BLAST tool, filters the results, and aggregates them into a single data.frame.

It downgrades errors from primer_search and parse_primer_hits into warnings. This is useful when searching for a large number of different combinations, allowing the function to output successful results.

Expected Output

It creates a directory get_seeds_remote in the output_directory_path. It creates three files inside that directory. One represents the unfiltered output and another represents the output after filtering with user modifiable parameters and with appended taxonomy. Also generated is a summary of unique taxonomic ranks after filtering.

Detailed Steps

get_seeds_remote passes the forward and reverse primer sequence for a given PCR product to iterative_primer_search along with the taxid(s) of the organism(s) to blast, the database to search, and many additional possible parameters to NCBI's primer blast tool (see Note below). Degenerate primers are converted into all possible non degenerate sets and a user defined maximum number of primer combinations is passed to to the API using modifiedPrimerTree_Functions. Multiple taxids are searched independently, as are multiple databases (e.g. c('nt', 'refseq_representative_genomes'). The data are parsed and stored in a dataframe, which is also written to a file with the suffix _unfiltered_get_seeds_remote_output.csv.

These hits are further filtered using filter_primer_hits to calculate and append amplicon size to the dataframe. Only hits that pass with default or user modified length and number of mismatches parameters are retained.

Taxonomy is appended to these filtered hits using get_taxonomizr_from_accession. The results are written to to file with the suffix _filtered_get_seeds_remote_output_with_taxonomy.csv. The number of unique instances for each rank in the taxonomic path for the filtered hits are tallied (NAs are counted once per rank) and written to a file with the suffix _filtered_get_seeds_local_remote_taxonomic_rank_counts.txt

Notes: get_seeds_remote passes many parameters to NCBI's primer blast tool. See below for more information.

primer BLAST defaults to homo sapiens, so it is important that you supply a specific organism or organisms. NCBI's taxids can be found here. You can specify multiple organism by passing a character vector containing each of the options, like in the example below.

Often NCBI API will throttle higher taxonomic ranks (Domain, Phylum, etc.). One work around is to supply multiple lower level taxonomic ranks (Class, Family level, etc.) or use get_seeds_local.

Parameters

forward_primer_seq

passed to primer_search, which turns it into a list of all possible non degenerate primers, then passes a user defined number of primer set combinations to NCBI.

  e.g. forward_primer_seq <- "TAGAACAGGCTCCTCTAG"

reverse_primer_seq

passed to primer_search, which turns it into a list of all possible non degenerate primers, then passes a user defined number of primer set combinations to NCBI.

   e.g. reverse_primer_seq <-  "TTAGATACCCCACTATGC"

output_directory_path

the parent directory to place the data in.

   e.g. "/path/to/output/12S_V5F1_remote_111122"

metabarcode_name

used to name output files. get_seeds_remote appends metabarcode_name to the beginning of each of the two files it generates.
```
  e.g. metabarcode_name <- "12S_V5F1"
```

accession_taxa_sql_path

the path to sql created by taxonomizr

    e.g. accession_taxa_sql_path <- "/my/accessionTaxa.sql"

organism

a vector of character vectors. Each character vector is passed in turn to primer_search, which passes them to NCBI. get_seeds_remote aggregates all of the results into a single file

  e.g. organism = c("1476529", "7776")) - Note: increasing taxonomic rank (e.g. increasing from order to class) for this parameter can maximize primer hits, but can also lead to API run throttling due to memory limitations

num_permutations

the number of primer permutations to search, if the degenerate bases cause more than this number of permutations to exist, this number will be sampled from all possible permutations.

  The default is num_permutations = 50 - Note for very degenerate bases, searches may be empty due to poor mutual matches for a given forward and reverse primer combination.

mismatch

the highest acceptable mismatch value. parse_primer_hits returns a table with a mismatch column. get_seeds_remote removes each row with a mismatch greater than the specified value.

  The default is mismatch = 3 - Note: this is smaller than get_seeds_local because of differences in mismatch calculation between function.

minimum_length

parse_primer_hits returns a table with a product_length column. get_seeds_remote removes each row that has a value less than minimum_length in the product_length column.
```
  The default is minimum_length = 5
```

maximum_length

parse_primer_hits returns a table with a product_length column. get_seeds_remote removes each row that has a value greater than maximum_length in the product_length column
```
  The default is maximum_length = 500
```

primer_specificity_database

passed to primer_search, which passes it to NCBI.

  The default is primer_specificity_database = 'nt'.

HITSIZE

a primer BLAST search parameter. Set to a high vlaue to maximize the number of observations returned.

  The default HITSIZE = 50000 - Note: increasing this parameter can maximize primer hits, but can also lead to API run throttling due to memory limitations

NUM_TARGETS_WITH_PRIMERS

a primer BLAST search parameter set high to maximize the number of observations returned.

  The default is NCBI NUM_TARGETS_WITH_PRIMERS = 1000 - Note: increasing this parameter can maximize primer hits, but can also lead to API run throttling due to memory limitations

...

additional arguments passed to modifiedPrimerTree_Functions. See NCBI primer-blast tool for more information.

Check NCBI's primer blast for additional search options

get_seeds_remote passes many parameters to NCBI's primer blast tool. You can match the parameters to the fields available in the GUI here. First, use your browser to view the page source. Search for the field you are interested in by searching for the title of the field. It should be enclosed in a tag. Inside the label tag, it says for = "<name_of_parameter>". Copy the string after for = and add it to get_seeds_remote as the name of a parameter, setting it equal to whatever you like.

As of 2022-08-16, the primer blast GUI contains some options that are not implemented by primer_search. The [table below] documents some of the available options.

Name	Default
PRIMER_SPECIFICITY_DATABASE	nt
EXCLUDE_ENV	unchecked
ORGANISM	Homo sapiens
TOTAL_PRIMER_SPECIFICITY_MISMATCH	1
PRIMER_3END_SPECIFICITY_MISMATCH	1
TOTAL_MISMATCH_IGNORE	6
MAX_TARGET_SIZE	4000
HITSIZE	50000
EVALUE	30000
WORD_SIZE	7
NUM_TARGETS_WITH_PRIMERS	1000
MAX_TARGET_PER_TEMPLATE	100

You can check primerblast for more information on how to modify search options. For example, if want you to generate a larger hitsize, open the source of the primer designing tool and look for that string. You find the following:

<label for="HITSIZE" class="m ">Max number of sequences returned by Blast</label>
         <div class="input ">
                      <span class="sel si">
                      <select name="HITSIZE" id="HITSIZE" class= "opts checkDef" defVal="50000" >
                        <option  value="10">10</option>
                        <option  value="50">50</option>
                        <option  value="100">100</option>
                        <option  value ="250">250</option>
                        <option  value="500">500</option>
                        <option  value="1000">1000</option>
                        <option  value="10000">10000</option>
                        <option selected="selected"  value="50000">50000</option>
                        <option  value="100000">100000</option>
                      </select>                       
                    </span>
                    <a class="helplink hiding" title="help" id="hitsizeHelp" href="#"><i class="fas fa-question-circle"></i> <span class="usa-sr-only">Help</span></a>
                    <p toggle="hitsizeHelp" class="helpbox hidden">
                      Maximum number of database sequences (with unique sequence identifier) Blast finds for primer-blast to screen for primer pair specificities. Note that the actual number of similarity regions (or the number of hits) may be much larger than this (for example, there may be a large number of hits on a single target sequence such as a chromosome).   Choose a higher value if you need to perform more stringent search.
                    </p>

You can find the description and suggested values for this search option. HITSIZE ='1000000' is added to the search below along with several options that increase the number of entries returned from primer_search.

Example

forward_primer_seq = "TAGAACAGGCTCCTCTAG"
reverse_primer_seq =  "TTAGATACCCCACTATGC"
output_directory_path <- "/my/directory/12S_V5F1_remote_111122_modified_params"
metabarcode_name <- "12S_V5F1"
accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql"

get_seeds_remote(forward_primer_seq,
                reverse_primer_seq,
                metabarcode_name,
                output_directory_path,
                accession_taxa_sql_path,
                HITSIZE ='1000000',
                evalue='100000',
                word_size='6',
                MAX_TARGET_PER_TEMPLATE = '5',
                NUM_TARGETS_WITH_PRIMERS ='500000', minimum_length = 50,
                MAX_TARGET_SIZE = 200,
                organism = c("1476529", "7776"), return_table = FALSE)

# This results in approximately 111500 blast seed returns (there is some variation due to database updates, etc.), note the default generated approximately 1047.
# This assumes the user is not throttled by memory limitations.

blast_seeds

Overview

blast_seeds takes the output from get_seeds_local or get_seeds_remote and iteratively blasts seed sequences using a stratified random sampling of a taxonomic rank (default is genus). The blast hits are de-duplicated, filtered, and returnedc as .csv files, a fasta file and a taxonomy file.

The intermediate results and metadata associated with a search in progress are saved as local files in the save directory blast_seeds_save. This allows the function to resume a partially completed blast, mitigating the consequences of encountering an error or experiencing other interruptions. To resume a partially completed blast, supply the same seeds and working directory. See the documentation of blast_datatable for more information.

Expected Output

During the blast_seeds the following data are cached as files in a temporary directory blast_seeds_save in the output_directory_path. These files are passed to and updated by blast_datatable: output_table.txt (most recent updates from the blast run), blast_seeds_passed_filter.txt (seed table that tracks the blast status of seeds), unsampled_indices.txt (list of seed indices that need to be blasted), too_many_ns.txt (tracks seeds that have been removed due to more consecutive Ns in a sequence than are acceptable (see parameter wildcards), blastdbcmd_failed.txt (tracks reads that are present in the seeds database, but not the local blast database. This is relevant for the results of get_seeds_remote, and lastly num_rounds.txt (tracks the number of completed blast round for a given seed file).

During the final steps of the function the final data is saved in rblast_seeds_output recording the results of the blast.The final output of blast_seeds are the following: summary.csv (blast output with appended taxonomy), {metabarcode_name}_.fasta, {metabarcode_name}.taxonomy, {metabarcode_name}_blast_seeds_summary_unique_taxonomic_rank_counts.txt, too_many_ns.txt, blastdbcmd_failed.txt.

Detailed Steps

blast_seeds passes a datatable returned by get_seeds_remote or get_seeds_local to blast_datatable, which uses a random stratified sample based on taxonomic rank to iteratively blast and process the seeds in the datatable. The user can specify how many sequences can be blasted simultaneously using max_to_blast. The randomly sampled seeds (or subsets of seeds) are sent to run_blastdbcmd_blastn_and_aggregate_resuts, which uses run_blastdbcmd to find a seed sequence that corresponds to the accession number and forward and reverse stops recorded in the seeds table. run_blastdbcmd outputs sequences as .fasta-formatted strings, which run_blastdbcmd_blastn_and_aggregate_resuts concatenates into a multi-line fasta, then passes to run_blastn as an argument. The output of run_blastn is de-replicated by accession, and only the longest read per replicates is retained in the output table. The run state is saved and passed back to blast_datatable.

For each blast iteration, once all of the seeds of the random sample are processed, they are removed from the seeds dataframe as are the seeds recovered through blast. blast-datatable repeats this process or stratified random sampling until there are fewer seed sequences remaining than max_to_blast, at which point it blasts all remaining seeds. The final aggregated results are cleaned for multiple blast taxids, hyphens, and wildcards and returned with taxonomy added using get_taxonomizr_from_accession.

Note: The blast db downloaded from NCBIs FTP site has representative accessions. This means that identical sequences have been collapsed across multiple accessions even if they have different taxids. Here we identify representative accessions with multiple taxids, and unpack all of the accessions that were collapsed into that representative accessions. blast_seeds does not identify or unpack representative accessions that report a single taxid.

Saving data: blast_datatable uses files generated in run_blastdbcmd_blastn_and_aggregate_resuts that store intermediate results and metadata about the search to local files as it goes. This allows the function to resume a partially completed blast, partially mitigating the consequences of encountering an error or experiencing other interruptions. Interruptions while blasting a subset of a random stratified sample will result in a loss of the remaining reads of the subsample, and may decrease overall blast returns. The local files are written to blast_seeds_save by rsave_state. Manually changing these files is not suggested as it can change the behavior of blast_datatable.

Restarting an interrupted blast_seeds run: To restart from an incomplete blast_seeds run, submit the previous command again. Do not modify the paths specified in the previous command, however parameter arguments (e.g. rank, max_to_blast) can be modified. blast_seeds will automatically detect save files and resume from where it left off.

Warning: If you are resuming from an interrupted blast, make sure you supply the same data.frame for blast_seeds. If you intend to start a new blast, make sure that there is not existing blast save data in the directory output_directory_path\blast_seeds_save.

Note: blast_datatable does not save intermediate data from run_blastdbcmd, so if it is interrupted while getting building the fasta to submit to run_blastn it will need to repeat some work when resumed. The argument max_to_blast controls the frequency with which it calls blastn, so it can be used to make blast_datatable save more frequently.

Parameters

seeds_output_path

a path to an output csv from get_seeds_local or get_seeds_remote

    e.g. seeds_output_path <- '/my/rCRUX_output_directory/12S_V5F1_filtered_get_seeds_remote_output_with_taxonomy.csv'

blast_db_path

blast_db_path a directory containing one or more blast-formatted database. For multiple blast databases, separate them with a space and add an extra set of quotes.

   e.g blast_db_path <- "/my/ncbi_nt/nt" or blast_db_path <- '"/my/ncbi_nt/nt  /my/ncbi_ref_euk_rep_genomes/ref_euk_rep_genomes"'

accession_taxa_sql_path

a path to the accessionTaxa sql created by taxonomizr.

    e.g. accession_taxa_sql_path <- "/my/accessionTaxa.sql"

output_directory_path

a directory in which to save partial and complete output.

    e.g. output_directory_path = "/path/to/output/12S_V5F1_local_111122_e300_111122"

metabarcode_name

a prefix for the output fasta, taxonomy, and count of unique ranks.

    e.g. metabarcode_name <- "12S_V5F1"

expand_vectors

logical, determines whether to expand too_many_Ns and not_in db into real tables and write them in the output directory.
```
  The default is expand_vectors = TRUE
```

warnings

value to set the "warn" option to during the function call. On exit it returns to the previous value. Setting this argument to NULL will not change the option. ...
additional arguments passed to blast_datatable sample_size
passed to blast_datatable is the the number of entries to sample per rank.

  The default sample_size = 1 - is recommended Note: unless the user is sampling higher order taxonomy.  If there are not enough seeds to sample per rank the run will end in an error.

max_to_blast

passed to blast_datatable and is the maximum number of entries to accumulate into a fasta before calling blastn.

   The default is max_to_blast = 1000 - Note: the optimal number of reads to blast will depend on the user's environment (available RAM) and the number of possible hits (determined by marker and parameters)

wildcards

passed to blast_datatable us a character vector that represents the minimum number of consecutive Ns the user will tolerate in a given seed or hit sequence.
```
  The default is wildcards = "NNNN"
```

rank

passed to blast_datatable is the data column representing the taxonomic rank to randomly sample.

  The default is rank = 'genus' - Note: sampling a lower rank  (e.g. species) will generate more total hits and take more time, conversely sampling a higher rank (e.g. family) will generate fewer total hits and take less time.

- Note: It is possible to blast all seeds using rank = 'all'. This may take a very long time but will produce the most complete output. ncbi_bin
passed to run_blastdbcmd and run_blastnis the path to blast+ tools if not in the user's path. Specify only if blastn and blastdbcmd are not in your path.

  The default is ncbi_bin = NULL - Note: if not specified in path do the following: ncbi_bin = "/my/local/ncbi-blast-2.10.1+/bin".

evalue

passed to run_blastn is the number of expected hits with a similar quality score found by chance.
```
 The default is evalue = 1e-6
```

coverage**

passed to run_blastn is the minimum percent of the query length recovered in the subject hits.
```
  The default is coverage = 50
```

perID

passed to run_blastn is the minimum percent identity of the query relative to the subject hits.
```
  The default is perID = 70
```

align

passed to run_blastn is the maximum number of subject hits to return per query blasted.
```
   The default is align = '50000'
```

minimum_length

removes each row that has a value less than the minimum_length in the product_length column.
```
    The default is minimum_length = 5
```

maximum_length

removes each row that has a value greater than maximum_length in the product_length column
```
   The default is maximum_length = 500
```

num_threads

is the number of CPUs to engage in the blastn search.
The default num_treads = NULL, uses [parallel::detectCores()] to determine the user's number of CPUs automatically and use that for the value of -num_threads.

Example

seeds_output_path <- "/my/directory/12S_V5F1_remote_111122_modified_params/blast_seeds_output/summary.csv""
output_directory_path <- "/my/directory/12S_V5F1_remote_111122_modified_params"
metabarcode_name <- "12S_V5F1"
accession_taxa_sql_path <- "/my/directory/accessionTaxa.sql"
blast_db_path <- "/my/directory/ncbi_nt/nt"


blast_seeds(seeds_output_path,
            blast_db_path,
            accession_taxa_sql_path,
            output_directory_path,
            metabarcode_name,
            rank = 'species',
            max_to_blast = 750)

# using the rank of species will increase the number of total unique blast hits
# modifying the max_to_blast submits fewer reads simultaneously and reduces overall RAM while extending the run

derep_and_clean_db

Overview

derep_and_clean_db takes the output from blast_seeds and de-replicates the dataset to identify representative sequences.

Expected Outputs

It generates an output directory called derep_and_clean_db at output_directory_path to store the output .csv files and the fasta and taxonomy file generated by the function.

Detailed Steps

Before de-replicating the data set, all sequences with NA taxonomy for phylum, class, order, family, and genus are removed from the dataset because they typically represent environmental samples with low value for taxonomic classification. These sequences are stored in a Sequences_with_mostly_NA_taxonomic_paths.csv

All sequences with the same length and composition are collapsed to a single database entry, where the accessions and taxids (if there are more than one) are concatenated. The sequences with a clean taxonomic path (e.g. no ranks with multiple entries) are written to Sequences_with_single_taxonomic_path.csv.

Sequences with multiple entries for a given taxonomic rank are written to Sequences_with_multiple_taxonomic_paths.cvs. These sequences are processed further by removing NAs from rank instances with more than one entry (e.g. "Chordata, NA" will mutate to "Chordata"). Any remaining instances of taxonomic ranks more than one taxid will be reduced to NA (e.g. "Badis assamensis, Badis badis" will mutate to "NA"). These sequences, with taxonomic paths shortened to the lowest taxonomic agreement, are written to Sequences_with_lowest_common_taxonomic_path_agreement.csv.

Lastly, the sequences from Sequences_with_single_taxonomic_path.csv and Sequences_with_lowest_common_taxonomic_path_agreement.csv are used to generate a fasta file and taxonomy file of representative NCBI accessions for each sequence. The number of accessions identical to the representative accession is given.

Parameters

output_directory_path

the path to the output directory

   e.g. "/path/to/output/12S_V5F1_remote_111122"

summary_path

the path to the input file

   e.g. "/path/to/output/12S_V5F1_remote_111122/blast_seeds_output/summary.csv"

metabarcode_name

used to name the subdirectory and the files.
```
   e.g. metabarcode_name <- "12S_V5F1"
```

Example

output_directory_path <- "/my/directory/12S_V5F1_remote_111122_modified_params"
summary_path <- "/my/directory/12S_V5F1_remote_111122_modified_params/blast_seeds_output/summary.csv"
metabarcode_name <- "12S_V5F1"


derep_and_clean_db(output_directory_path, summary_path, metabarcode_name)

Pre-Made Version Controlled Reference Databases

Primer Name	Gene	Length of Target	get_seeds_local() length minimum	get_seeds_local() length maximum	blast_seeds() length minimum	blast_seeds() length maximum	blast_seeds() max_to_blast	Forward Sequence (5'-3')	Reverse Sequence (5'-3')	Reference	Zenodo Link
MiFish Universal	12S	163–185	170	250	140	250	1000	GTGTCGGTAAAACTCGTGCCAGC	CATAGTGGGGTATCTAATCCCAGTTTG	Miya, M., Sato, Y., Fukunaga, T., Sado, T., Poulsen, J. Y., Sato, K., ... & Kondoh, M. (2015). MiFish, a set of universal PCR primers for metabarcoding environmental DNA from fishes: detection of more than 230 subtropical marine species. Royal Society open science, 2(7), 150088.	https://doi.org/10.5281/zenodo.7909637
MiFish Universal Expanded	12S	163–185	170	250	140	250	1000	GTGTCGGTAAAACTCGTGCCAGC	CATAGTGGGGTATCTAATCCCAGTTTG	Miya, M., Sato, Y., Fukunaga, T., Sado, T., Poulsen, J. Y., Sato, K., ... & Kondoh, M. (2015). MiFish, a set of universal PCR primers for metabarcoding environmental DNA from fishes: detection of more than 230 subtropical marine species. Royal Society open science, 2(7), 150088.	https://doi.org/10.5281/zenodo.7908864
Kelly 16s	16S	114-160	80	195	50	168	100	AGTTACYYTAGGGATAACAGCG	CCGGTCTGAACTCAGATCAYGT	Kelly, R. P., O’Donnell, J. L., Lowell, N. C., Shelton, A. O., Samhouri, J. F., Hennessey, S. M., ... & Williams, G. D. (2016). Genetic signatures of ecological diversity along an urbanization gradient. PeerJ, 4, e2444.	https://doi.org/10.5281/zenodo.7909659
Ford 16S	16S	330	279	477	231	429	1000	GCAATCACTTGTCTTTTAAATGAAGACC, GTAATCACTTGTCTTTTAAATGAAGACC	GGATTGCGCTGTTATCCCTA	Ford MJ, Hempelmann J, Hanson MB, Ayres KL, Baird RW, et al. (2016) Estimation of a Killer Whale (Orcinus orca) Population’s Diet Using Sequencing Analysis of DNA from Feces. PLOS ONE 11(1): e0144956. https://doi.org/10.1371/journal.pone.0144956	https://doi.org/10.5281/zenodo.7909642
MarVer3	16S	232-274	160	345	124	309	100	AGACGAGAAGACCCTRTG	GGATTGCGCTGTTATCCC	Valsecchi, E., Bylemans, J., Goodman, S. J., Lombardi, R., Carr, I., Castellano, L., ... & Galli, P. (2020). Novel universal primers for metabarcoding environmental DNA surveys of marine mammals and other marine vertebrates. Environmental DNA, 2(4), 460-476.	https://doi.org/10.5281/zenodo.7909663
MiDeca	16S	154-184	105	230	66	191	50	GGACGATAAGACCCTATAAA	ACGCTGTTATCCCTAAAGT	Komai, Tomoyuki, et al. "Development of a new set of PCR primers for eDNA metabarcoding decapod crustaceans." Metabarcoding and Metagenomics 3 (2019): e33835.	https://doi.org/10.5281/zenodo.7909669
Ceph18S	18S	150-190	105	235	65	195	100	CGCGGCGCTACATATTAGAC	GCACTTAACCGACCGTCGAC	D. S. W. de Jonge, V. Merten, T. Bayer, O. Puebla, T. B. H. Reusch, H.-J. T. Hoving, A novel metabarcoding primer pair for environmental DNA analysis of Cephalopoda (Mollusca) targeting the nuclear 18S rRNA region. R. Soc. Open Sci. 8, 201388 (2021)	https://doi.org/10.5281/zenodo.7909639
Scott Baker Dlp1.5-H/Oordlp4	D loop	350-390	245	495	205	455	100	TCACCCAAAGCTGRARTTCTA	GCGGGTTGCTGGTTTCACG	Baker, C. S., Steel, D., Nieukirk, S., & Klinck, H. (2018). Environmental DNA (eDNA) from the wake of the whales: Droplet digital PCR for detection and species identification. Frontiers in Marine Science, 5, 133.	https://doi.org/10.5281/zenodo.7909691
ITS2 Plants	ITS2	450 to 550 bp	315	600	270	560	100	ATGCGATACTTGGTGTGAAT	GACGCTTCTCCAGACTACAAT	Gu, W., Song, J., Cao, Y., Sun, Q., Yao, H., Wu, Q., ... & Duan, J. (2013). Application of the ITS2 region for barcoding medicinal plants of Selaginellaceae in Pteridophyta. PloS one, 8(6), e67818.	https://doi.org/10.5281/zenodo.7909655
trnl plants	trnl	~85bp	60	110	23	73	250	GGGCAATCCTGAGCCAA	TTTGAGTCTCTGCACCTATC	Coissac, E., Pompanon, F., Gielly, L., Miquel, C., Valentini, A., Vermat, T., ... & Willerslev, E. (2007). Power and limitations of the chloroplast trnL (UAA) intron for plant DNA barcoding. Nucleic Acids Research 3 (35),.(2007).	https://doi.org/10.5281/zenodo.7909700
MiSebastes	CytB	153	107	199	71	163	100	AAGCTCATTCAAGTGCTT	GACCACTTACACAATTCT	Min, M. A., Barber, P. H., & Gold, Z. (2021). MiSebastes: An eDNA metabarcoding primer set for rockfishes (genus Sebastes). Conservation Genetics Resources, 13(4), 447-456.	https://doi.org/10.5281/zenodo.7909674
18s SSU3/SSU4	18S v7	170	120	220	98	200	100	GGTCTGTGATGCCCTTAGATG	GGTGTGTACAAAGGGCAGGG	McInnes, J. C., Alderman, R., Deagle, B. E., Lea, M. A., Raymond, B., & Jarman, S. N. (2017). Optimised scat collection protocols for dietary DNA metabarcoding in vertebrates. Methods in Ecology and Evolution, 8(2), 192-202.	https://doi.org/10.5281/zenodo.7915854
Plant RBCL7/8	rbcl	180	170	250	140	250	100	CTCCTGAMTAYGAAACCAAAGA	GTAGCAGCGCCCTTTGTAAC	McFrederick, Q. S., and S. M. Rehan (2016). Characterization of pollen and bacterial community composition in brood provisions of a small carpenter bee. Molecular Ecology 25:2302–2311. & Spence, A. R., Wilson Rankin, E. E., & Tingley, M. W. (2022). DNA metabarcoding reveals broadly overlapping diets in three sympatric North American hummingbirds. The Auk, 139(1), ukab074.	https://doi.org/10.5281/zenodo.7909678
teleo L1848/H1913	12S	100	70	130	40	100	100	ACACCGCCCGTCACTCT	CTTCCGGTACACTTACCATG	Valentini, A., Taberlet, P., Miaud, C., Civade, R., Herder, J., Thomsen, P. F., ... & Gaboriaud, C. (2016). Next‐generation monitoring of aquatic biodiversity using environmental DNA metabarcoding. Molecular Ecology, 25(4), 929-942.	https://doi.org/10.5281/zenodo.7909697
Taberlet c/h	trnl	150	105	150	85	128	100	CGAAATCGGTAGACGCTACG	CCATTGAGTCTCTGCACCTATC	Taberlet, P., Gielly, L., Pautou, G., & Bouvet, J. (1991). Universal primers for amplification of three non-coding regions of. Plant molecular biology, 17, 1105-1109. & Taberlet, Pierre, Eric Coissac, François Pompanon, Ludovic Gielly, Christian Miquel, Alice Valentini, Thierry Vermat, Gerard Corthier, Christian Brochmann, and Eske Willerslev. Power and limitations of the chloroplast trn L (UAA) intron for plant DNA barcoding. Nucleic acids research 35, no. 3 (2007): e14-e14.	https://doi.org/10.5281/zenodo.7909695
Fungal_ITS gITS7/ITS4	FITS	150-350	105	500	65	461	100	GTGARTCATCGARTCTTTG	TCCTCCGCTTATTGATATGC	White, T. J., Bruns, T., Lee, S., & Taylor, J. (1990). Amplification and direct sequencing of fungal ribosomal RNA genes for phylogenetics. PCR Protocols: A Guide to Methods and Applications, 18(1), 315–322. & Ihrmark, K., Bödeker, I., Cruz-Martinez, K., Friberg, H., Kubartova, A., Schenck, J., Strid, Y., Stenlid, J., Brandström-Durling, M., & Clemmensen, K. E. (2012). New primers to amplify the fungal ITS2 region–evaluation by 454-sequencing of artificial and natural communities. FEMS Microbiology Ecology, 82(3), 666–677.	https://doi.org/10.5281/zenodo.7909648
18s v4 V4F-TAReuk454FWD1	18S	270	100	500	140	400	100	CCAGCASCYGCGGTAATTCC	ACTTTCGTTCTTGATYR	Stoeck, T., Bass, D., Nebel, M., Christen, R., Jones, M. D., BREINER, H. W., & Richards, T. A. (2010). Multiple marker parallel tag environmental DNA sequencing reveals a highly complex eukaryotic community in marine anoxic water. Molecular ecology, 19, 21-31.	https://doi.org/10.5281/zenodo.8407881
Leray CO1 mito	CO1	313	140	500	200	450	100	GGWACWGGWTGAACWGTWTAYCCYCC	TANACYTCnGGRTGNCCRAARAAYCA	Leray, M., Yang, J. Y., Meyer, C. P., Mills, S. C., Agudelo, N., Ranwez, V., ... & Machida, R. J. (2013). A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents. Frontiers in zoology, 10(1), 34.	https://doi.org/10.5281/zenodo.8407603
Leray CO1 EMBL	CO1	313	140	500	200	450	100	GGWACWGGWTGAACWGTWTAYCCYCC	TANACYTCnGGRTGNCCRAARAAYCA	Leray, M., Yang, J. Y., Meyer, C. P., Mills, S. C., Agudelo, N., Ranwez, V., ... & Machida, R. J. (2013). A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents. Frontiers in zoology, 10(1), 34.	https://doi.org/10.5281/zenodo.8407606
Leray CO1 searchterm	CO1	313	140	500	200	450	100	GGWACWGGWTGAACWGTWTAYCCYCC	TANACYTCnGGRTGNCCRAARAAYCA	Leray, M., Yang, J. Y., Meyer, C. P., Mills, S. C., Agudelo, N., Ranwez, V., ... & Machida, R. J. (2013). A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents. Frontiers in zoology, 10(1), 34.	https://doi.org/10.5281/zenodo.8407620
Leray CO1 Combined	CO1	313	140	500	200	450	100	GGWACWGGWTGAACWGTWTAYCCYCC	TANACYTCnGGRTGNCCRAARAAYCA	Leray, M., Yang, J. Y., Meyer, C. P., Mills, S. C., Agudelo, N., Ranwez, V., ... & Machida, R. J. (2013). A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents. Frontiers in zoology, 10(1), 34.	https://doi.org/10.5281/zenodo.8407632
18s v9	18S	87-186	50	300	50	300	100	TTGTACACACCGCCC	CCTTCYGCAGGTTCACCTAC	Amaral-Zettler, L. A., McCliment, E. A., Ducklow, H. W. & Huse, S. M. A method for studying protistan diversity using massively parallel sequencing of V9 hypervariable regions of small-subunit ribosomal RNA Genes. PLoS ONE 4, (2009).	https://doi.org/10.5281/zenodo.8407878
16S phytoplankton	16S	300	100	500	140	400	100	GTGYCAGCMGCCGCGGTAA	GGACTACNVGGGTWTCTAAT	Walters, W., Hyde, E. R., Berg-Lyons, D., Ackermann, G., Humphrey, G., Parada, A., ... & Knight, R. (2016). Improved bacterial 16S rRNA gene (V4 and V4-5) and fungal internal transcribed spacer marker gene primers for microbial community surveys. Msystems, 1(1), e00009-15.	https://doi.org/10.5281/zenodo.8407871
16S V4	16S	250	100	500	100	350	100	GTGYCAGCMGCCGCGGTAA	CCGYCAATTYMTTTRAGTTT	Parada, A. E., Needham, D. M. & Fuhrman, J. A. Every base matters: Assessing small subunit rRNA primers for marine microbiomes with mock communities, time series and global field samples. Environ. Microbiol. 18 (2016).	https://doi.org/10.5281/zenodo.8407899
MiFish Universal Expanded + FishCARD	12S	163–185	170	250	140	250	1000	GTGTCGGTAAAACTCGTGCCAGC	CATAGTGGGGTATCTAATCCCAGTTTG	Miya, M., Sato, Y., Fukunaga, T., Sado, T., Poulsen, J. Y., Sato, K., ... & Kondoh, M. (2015). MiFish, a set of universal PCR primers for metabarcoding environmental DNA from fishes: detection of more than 230 subtropical marine species. Royal Society open science, 2(7), 150088.	https://doi.org/10.5281/zenodo.8409238

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

Amaral-Zettler, L. A., McCliment, E. A., Ducklow, H. W., & Huse, S. M. (2009). A method for studying protistan diversity using massively parallel sequencing of V9 hypervariable regions of small-subunit ribosomal RNA Genes. PLoS ONE, 4(7), e6372. http://doi.org/10.1371/journal.pone.0006372

Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., & Madden T.L. (2008) "BLAST+: architecture and applications." BMC Bioinformatics 10:421.

Curd, E.E., Gold, Z., Kandlikar, G.S., Gomer, J., Ogden, M., O'Connell, T., Pipes, L., Schweizer, T.M., Rabichow, L., Lin, M. and Shi, B., 2019. Anacapa Toolkit: An environmental DNA toolkit for processing multilocus metabarcode datasets. Methods in Ecology and Evolution, 10(9), pp.1469-1475. https://doi.org/10.1111/2041-210X.13214.

Hester, J., 2020. primerTree: Visually Assessing the Specificity and Informativeness of Primer Pairs.

Leray, M., Yang, J. Y., Meyer, C. P., Mills, S. C., Agudelo, N., Ranwez, V., Boehm, J. T., & Machida, R. J. (2013). A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: Application for characterizing coral reef fish gut contents. Frontiers in Zoology , 10 (1), 1–14. https://doi.org/10.1186/1742-9994-10-34

R Core Team, R., 2021. R: A language and environment for statistical computing.

Riaz, T., Shehzad, W., Viari, A., Pompanon, F., Taberlet, P. and Coissac, E., 2011. ecoPrimers: inference of new DNA barcode markers from whole genome sequence analysis. Nucleic acids research, 39(21), pp.e145-e145.

Sherrill-Mix, S., 2019. taxonomizr: Functions to Work with NCBI Accessions and Taxonomy. R package version 0.5. 3.

Ye J, Coulouris G, Zaretskaya I, Cutcutache I, Rozen S, & Madden TL. (2012) "Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction." BMC Bioinformatics 13:134.

rcrux's People

Contributors

Stargazers

Watchers

Forkers

guoyuh jungbluth ocstringham sformel-usgs

rcrux's Issues

Readme error

Likely the example pipeline section discussing "blast_seeds_local() or blast_seeds_remote()" is supposed to refer to get_seeds_local/remote?

Thanks!

update examples

Examples are run/executed on package checking .. so either:

wrap in \dontrun{...}
Use mock-db() in examples

Referencing magrittr::`%>%` pipe in functions

Use #' @importFrom magrittr %>%`` or usethis::use_pipe() instead of `%>%` <- magrittr::`%>%`. The usethis approach might be best, as it only needs to be stated once in the whole package (a file is created) ...

consistent use of packages

I saw readr was referenced once in get_seeds_local() .. it's a great package, so why not use it for all read_/write_ functions? or exclude it as a dependency?

Imports:
tibble,
dplyr,
magrittr,
taxonomizr,
data.table,
tidyr,
stringr,
lubridate,
primerTree,
progress,
zoo,
readr,
XML,
httr,
plyr,
phylotools

❯ checking dependencies in R code ... WARNING
'::' or ':::' imports not declared from:
'purrr' 'rlang'
Namespaces in Imports field not imported from:
'primerTree' 'zoo'
All declared Imports should be used.

clean/document modifiedPrimerTreeFunctions.R

Would be good to have proper documentation here.. where are these functions used? only in get_seeds_remote?

ncbi path variable not carried forward?

Getting this error in blast_seeds() after it successfully ran blastn; error happened after about an hour of runtime:

The number of unsampled indices is less than or equal to the maximum number to be blasted Running blastdbcmd on 1000 samples. Calling blastn. This may take a long time. 1969595 blast hits returned. 21715 unique blast hits after this round. Error in system2("blastdbcmd", args = c("-db", blast_db_path, "-dbtype", : error in running command

I'm thinking the ncbi_bin argument to blast_seeds() doesn't get scoped quite right, and so doesn't make it down to the function in which system2("blastdbcmd") gets called.

A `set.seed()` in randomisation functions e.g. species blast

Would be good to have the ability to reproduce randomisation parts .. would allow reproducible results for paper writing where randomisation occurs - means if analysis needs to be rerun, you should get the same numbers

Simplify logic in `blast_datatable` -> `run_blastdbcmd_blastn_and_aggregate_resuts()`

Address `check()` issues

There are quite a few from:
package imports
variable bindings
if() conditions comparing class() to string ..
documention of functions
examples in functions being run - either 'dont run' or use the mock-db
various others

just a matter of clean up

Length of sequence headers after derep and clean

I was trying to create a blastable database from the pre-compiled COI database CO1_combined_derep_and_clean.fasta . The header of the fasta sequnces after cleanup reads something like >KM252950.1_representative_of_12_identical_accessions. When I tried to use this file as the input fasta for makeblastdb , with the command
makeblastdb -in CO1_combined_derep_and_clean.fasta -dbtype nucl -parse_seqids -out rCRUX_COI -title "rCRUX_Leray"
I got an error message complaining about the length of the sequence headers
"BLAST Database creation error: Near line 1, the local id is too long. Its length is 52 but the maximum allowed local id length is 50. Please find and correct all local ids that are too long."
I checked this was the issue by shortening the sequence headers with sed
sed 's\_representative_of\\g' CO1_combined_derep_and_clean.fasta > CO1_ready.fasta
To save a headache to future Moncho, I also shortened the headers in the taxonomy file
sed 's\_representative_of\\g' CO1_combined_derep_and_clean_taxonomy.txt > CO1_tax_ready.txt

Now the headers read >KM252950.1_12_identical_accessions

I think we could modify derep_and_clean to account for this

Suggestion: multithread

A suggestion to make -num_threads an argument to blast_seeds(). It's otherwise buried; I think the right core function would be run_blastn(), which is doing the actual system2() call to blastn.

get_seeds_local() compares strings instead of numbers

Problem

line 351:
append_table <- read.csv(append_table_path, colClasses = "character")
lines 389-390:

f_and_r <- dplyr::mutate(f_and_r, product_length = dplyr::case_when((forward_start < reverse_start & forward_start < forward_stop & reverse_stop < reverse_start ) ~ (as.numeric(reverse_start) - as.numeric(forward_start)),
                                                                     (forward_start > reverse_start & forward_start > forward_stop & reverse_stop > reverse_start) ~ (as.numeric(forward_start) - as.numeric(reverse_start)),))

Solution

Something like this needs to go just before lines 389-390 ... or import those columns earlier as integers (would need to use readr::read_csv, however, as it lets you specify individual column classes unlike read.csv..

 f_and_r <-
    dplyr::mutate(f_and_r,
                  dplyr::across(c('forward_start', 'forward_stop', 'reverse_start', 'reverse_stop'), .fns = as.integer)
    )

Default evalue for run_primer_blastn is 3e+07

The documentation states that the default e-value for run_primer_blast is 3e-07, but looking at the run_primer_blastn code it is 3e+07, which seems really high. Which one should it be?

Improve `run_primer_blastn()` + ncbi_bin parameter

Can improve the code in run_primer_blastn() to deal with if the binary is in the PATH or a user supplied path. Currently the code is copied for each instance, but one has an extra "-num_alignments", "10000000",.

Goal is to have 1 system2() call, where the command is build given what we know about the PATH or a user supplied path

Add `Remotes` in description files for github repo files

Remotes:
cpauvert/psadd

Add `biocViews:` to description for bioconductor package installation

The packages "DECIPHER", "psadd", and "phyloseq" are required.

I believe these are from bioconductor

Consistent use of text output formats - csv, tsv

There are varying output formats across the package - write.table('*.txt', sep = ","), write.table('*.txt', sep = "\t"), write.csv('*.csv'), write.csv('*.txt')

We should decided on the preferred output and/or just be explicit when outputting - comma-delimited = .csv, tab-delimited =.txt ..i.e. should not have write.table to a .txt with a sep=","

NCBI download code

NCBI just added some smaller nucleotide databases to the FTP site, so you can get one for just eukaryotes if you want (see here. This also means the current code on the readme to download the nt database isn't quite right. If you just want the full nt database (not the partial databases), use this:

wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz*"

Having only the nt_euk DB might make COI a lot faster though...

Use of `paste0()` for file paths, can/should use `file.path()`

I've noticed heavy use of paste0 for file path building .. there is a base R file.path() function to do this.. i think the latter is better practice, although you need to keep consistent with it's use and idiosyncrasies (there will never be a trailing / ) .. so i did a quick change of the few that exist in get_seeds_local() but it didn't play nice with save_output_as_csv() 😅 given the later is pasting things again and it expected a trailing / in a path generated within get_seeds_local()

Depending on how interconnected different functions are that are pasting file.paths together will determine how easy of an update this would be.

ReadMe install pointing to old github repo

Went to update the rCrux package and noticed in the readme for the install of it that it is pointing to LunaGal/rCrux not the new location.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble