GithubHelp home page GithubHelp logo

gabrielecanesi / dna-snp-finder Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 152 KB

An algorithm for finding SNPs of a long read compared to a reference genome

License: MIT License

CMake 2.36% C++ 97.64%
bioinformatics dna-sequences kmers spaced-seed

dna-snp-finder's Introduction

Single Nucleotide Polymorphism finder

Given a reference nucleotide sequence $R$ and a long read $r$, this algorithm finds - if it does exist - the position $i$ on $r$ such that $r$ matches a substring of $R$, except for $r_i$.

This solution was built on top of the concept of Spaced Seed.

Compilation

  1. Install unordered_dense;
  2. compile ntHash by setting as prefix the root of the repository (see ntHash docs for further information);
  3. run the following:
    mkdir build && cd build
    cmake -DCMAKE_BUILD_TYPE=Release ../
    make

Usage

./spacedSeeds run <reference FASTA> <read FASTA>

There are some optional arguments which can be used to set custom parameters:

  • --k sets the length of the spaced seeds in the second phase of the algorithm
  • --firstK sets the length of the exact k-mers which will be extracted during the first part of the algorithm
  • --bloomFilterThreshold sets the probability of the bloom filter of encountering a false positive
  • --firstThreshold sets the threshold under which the areas are not taken into account for the second part of the algorithm.

How does it work

The algorithm is designed to follow these steps:

  1. Approximate identification of possible areas of match. This step is crucial for the algorithm performance, since it allows to exclude almost all of $R$ from the next step, improving drastically the speed on real data, requiring much less memory and time. It makes use of a bloom filter containing informations on the existence or not of the $R$ exact k-mers. This step requires $\mathcal{O}(|R|)$ time. In more detail, it proceeds as follows:
    1. $R$ is split into subsequences of length $\left\lceil \frac{|r|}{2} \right\rceil$
    2. A bloom filter containing the exact k-mers $K_r$ of $r$ is constructed
    3. For each partition $p$ of $R$ with exact k-mers $K_p$, the metric $\textrm{sim}(b, r) = \frac{| K_p \cap K_r |}{|K_p|}$ is computed and it is decided whether to consider it or not a candidate partition by using a threshold $\tau$.
  2. Matching between $r$ and the candidate areas of $R$. At this moment, each candidate partition is considered until a match is found. More specifically, the algorithm works as follows:
    1. The hashes of the spaced seeds of $r$ are extracted. Not all the possible seeds are requested to be saved since they would represent redundant information, hence only the ones having the "do not care" symbol are considered for the extraction, except for the last $k$ ones, which are needed to complete the set of possible match positions.
    2. An index for $r$ is constructed
    3. For each partition $i$ of $R$ with similarity $\geq \tau$, an index representing $i$ and the adjacent positions is constructed and used to find the match position by means of the seed hashes contained in both the indices of $i$ and $r$. Note that only the spaced seeds in common with $r$ are saved in the $i$ index. Moreover, the match visits the seeds from the least frequent in $i$ and adjacents to the most frequent.
Indices

Overall structure of the indices used during the matching phase.

Benchmarks

Benchmarks with match

Benchmarks on reads with at least one match. Reference extracted from the GRCh38 genome.



Benchmarks without match

Benchmarks on reads with no matches. Reference extracted from the GRCh38 genome.

Credits

dna-snp-finder's People

Contributors

gabrielecanesi avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.