PROTAX for aligned sequences
c/
and scripts/
contain c-code and Perl scripts required to train a Probabilistic Taxonomic classifier (PROTAX). Instructions are in file readme.txt
c2/
contains c-code for classification. Sequences are represented as 64-bit integer vectors (representing 16 consecutive nucleotides as one long integer) in order to gain speedup. Speedup can be measured by running classify_v1
(sequences represented as character strings) and classify_v2
(sequences represented as 64-bit int vectors). Both programs measure the time for calculating all pairwise distances between query sequence and reference sequences and the time to convert the sequence distances into taxon probabilities.
In addition, there are two variants of classify_v2
:
classify_rseq
classify reference sequences without using self-similarityclassify_info
prints the nearest and 2nd nearest reference sequence to query sequence in each node, along with the predictions
There are also several utility programs using the fast distance calculations used in classify_v2
:
dist_best
for each query sequence, give the most similar reference sequence and the distance.dist_matrix
calculate all pairwise distances between the queries, and report those less than a given threshold.dist_bipart
calculate all pairwise distances between the queries and references, and report those less than a given threshold.