Single Nucleotide Polymorphism finder

Given a reference nucleotide sequence $R$ and a long read $r$, this algorithm finds - if it does exist - the position $i$ on $r$ such that $r$ matches a substring of $R$, except for $r_i$.

This solution was built on top of the concept of Spaced Seed.

Compilation

Install unordered_dense;
compile ntHash by setting as prefix the root of the repository (see ntHash docs for further information);

run the following:

mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release ../
make

Usage

./spacedSeeds run <reference FASTA> <read FASTA>

There are some optional arguments which can be used to set custom parameters:

--k sets the length of the spaced seeds in the second phase of the algorithm
--firstK sets the length of the exact k-mers which will be extracted during the first part of the algorithm
--bloomFilterThreshold sets the probability of the bloom filter of encountering a false positive
--firstThreshold sets the threshold under which the areas are not taken into account for the second part of the algorithm.

How does it work

The algorithm is designed to follow these steps:

Approximate identification of possible areas of match. This step is crucial for the algorithm performance, since it allows to exclude almost all of $R$ from the next step, improving drastically the speed on real data, requiring much less memory and time. It makes use of a bloom filter containing informations on the existence or not of the $R$ exact k-mers. This step requires $\mathcal{O}(|R|)$ time. In more detail, it proceeds as follows:
1. $R$ is split into subsequences of length $\left\lceil \frac{|r|}{2} \right\rceil$
2. A bloom filter containing the exact k-mers $K_r$ of $r$ is constructed
3. For each partition $p$ of $R$ with exact k-mers $K_p$, the metric $\textrm{sim}(b, r) = \frac{| K_p \cap K_r |}{|K_p|}$ is computed and it is decided whether to consider it or not a candidate partition by using a threshold $\tau$.
Matching between $r$ and the candidate areas of $R$. At this moment, each candidate partition is considered until a match is found. More specifically, the algorithm works as follows:
1. The hashes of the spaced seeds of $r$ are extracted. Not all the possible seeds are requested to be saved since they would represent redundant information, hence only the ones having the "do not care" symbol are considered for the extraction, except for the last $k$ ones, which are needed to complete the set of possible match positions.
2. An index for $r$ is constructed
3. For each partition $i$ of $R$ with similarity $\geq \tau$, an index representing $i$ and the adjacent positions is constructed and used to find the match position by means of the seed hashes contained in both the indices of $i$ and $r$. Note that only the spaced seeds in common with $r$ are saved in the $i$ index. Moreover, the match visits the seeds from the least frequent in $i$ and adjacents to the most frequent.

Overall structure of the indices used during the matching phase.

Benchmarks

Benchmarks on reads with at least one match. Reference extracted from the GRCh38 genome.

Benchmarks on reads with no matches. Reference extracted from the GRCh38 genome.

gabrielecanesi / dna-snp-finder Goto Github PK

dna-snp-finder's Introduction

Single Nucleotide Polymorphism finder

Compilation

Usage

How does it work

Benchmarks

Credits

dna-snp-finder's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs