GithubHelp home page GithubHelp logo

xp3i4 / genmap Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cpockrandt/genmap

0.0 2.0 0.0 197 KB

GenMap - Fast and Exact Computation of Genome Mappability

License: Other

CMake 12.31% C++ 85.52% Shell 2.17%

genmap's Introduction

GenMap - Fast and Exact Computation of Genome Mappability BUILDSTATUS

GenMap is a tool to compute the mappability respectively frequency of nucleotide sequences (DNA and RNA). In particular, it computes the (k,e)-frequency, i.e., how often each k-mer from the sequence occurs with up to e errors in the sequence itself. The (k,e)-mappability is the inverse of the (k,e)-frequency. Hence, a mappability value of 1 at position i indicates that the k-mer in the sequence at position i occurs only once in the sequence with up to e errors. A low mappability value indicates that this k-mer belongs to a repetitive region.

A small example on how to run GenMap is listed below, for detailed examples such as marker sequence computation on multiple fasta files, please check out our GitHub Wiki pages.

For questions or feature requests feel free to open an issue on GitHub or send an e-mail to christopher.pockrandt [ÄT] fu-berlin.de.

The corresponding paper will be uploaded to biorxiv.org in mid-March. Until then major design changes of the interface and minor changes to its specification are possible.

Binaries

Your CPU must support the POPCNT instruction. If you have a modern CPU, you can go with the optimized 64 bit version that additionally uses up to SSE4 (MMX, SSE, SSE2, SSE3, SSSE3, SSE4). This improves the running time by 10 %. To verify whether your CPU supports these instructions sets you can check the output of cat /proc/cpuinfo | grep -E "mmx|sse|popcnt" (Linux) or sysctl -a | grep -i -E "mmx|sse|popcnt" (Mac).

Platform Details Additional requirements
Download Linux binaries Linux 64 bit -
Linux 64 bit optimized requires up to SSE4
Download Mac binaries Mac 64 bit -
Mac 64 bit optimized requires up to SSE4

Building from source

Please note that building from source can easily take 10 minutes and longer depending on your machine and compiler.

$ git clone --recursive https://github.com/cpockrandt/genmap.git
$ mkdir genmap-build && cd genmap-build
$ cmake ../genmap -DCMAKE_BUILD_TYPE=Release
$ make genmap
$ ./bin/genmap

If you are using a very old version of Git (< 1.6.5) the flag --recursive does not exist. In this case you need to clone the submodule separately before you can run cmake:

$ git clone https://github.com/cpockrandt/genmap.git
$ cd genmap
$ git submodule update --init --recursive

Requirements

Operating System
GNU/Linux, Mac
Architecture
Intel/AMD platforms that support POPCNT
Compiler
GCC ≥ 4.9, LLVM/Clang ≥ 3.8
Build system
CMake ≥ 3.0
Language support
C++14

Mappability example

Below you can see the (4,1)-mappability and frequency M and F of the nucleotide sequence T = ATCTAGCTTGCTAATCTA. Only mismatches (Hamming distance) are considered. GenMap can also allow for insertions and deletions (Edit/Levenshtein distance, coming soon).

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
T[i] A T C T A G C T T G C T A A T C T A
M[i] 0.33 0.33 0.33 0.5 0.25 0.5 0.5 0.5 0.5 0.25 0.5 1.0 1.0 0.33 0.33 0 0 0
F[i] 3 3 3 2 4 2 2 2 2 4 2 1 1 3 3 0 0 0

The mappability value M[1] = 0.33 means that the 4-mer starting at position 1 T[1..3] = TCTA occurs three times in the sequence with up to one mismatch, namely at positions 1 (TCTA), 9 (GCTA) and 14 (TCTA).

The mappability can be exported in various formats that allow post-processing or display in genome browsers.

SCREENSHOT

You can check out the (36,2)- and (24,1)-mappability on the human genome (GRCh38) in the UCSC Genome Browser for the plus strand (36, 2) / (24, 1) and for both strands (36, 2) / (24, 1).

Getting started

Building the index

At first you have to build an index of the fasta file(s) whose mappability you want to compute. This step only has to performed once. You might want to check out prebuilt indices for download.

$ ./genmap index -G /path/to/fasta.fasta -I /path/to/index/folder

A new folder /path/to/index/folder will be created to store the index and all associated files.

There are two algorithms that can be chosen for index construction. One uses RAM (radix), one uses secondary memory (skew). Depending on the quota and main memory limitations you can choose the appropriate algorithm with -A radix or -A skew. For skew you can change the location of the temp directory via the environment variable (e.g., to choose a directory with more quota):

$ export TMPDIR=/somewhere/else/with/more/space

Computing the mappability

To compute the (30,2)-mappability of the previously indexed genome, simply run:

$ ./genmap map -E 2 -K 30 -I /path/to/index/folder -O /path/to/output/folder -t -w -b

This will create a text, wig and bed file in /path/to/output/folder storing the computed mappability in different formats. You can formats that are not required by omitting the corresponding flags -t -w or -b.

Instead of the mappability, the frequency can be outputted, you only have to add the flag -fl to the previous command.

Help pages and examples

A detailed list of arguments and explanations can be retrieved with --help:

$ ./genmap --help
$ ./genmap index --help
$ ./genmap map --help

More detailed examples can be found in the Wiki.

Pre-built indices

Building an index on a large genome takes some time and requires a lot of space. Hence, we provide indexed genomes for download. If you need other genomes indexed and do not have the computational resources, please send an e-mail to christopher.pockrandt [ÄT] fu-berlin.de.

Genome Index size (compressed) Download
Human GRCh38 (hg38 patch 13) 6.6 GB GRCh38 index
Human GRCh37 (hg19 patch 13) 6.4 GB GRCh37 index
Mouse GRCm38 (mm10 patch 6) 5.7 GB GRCm38 index
Fruitfly D. melanogaster (dm6 rel. 6) 0.3 GB dm6 index
Worm C. elegans (ce11 WBcel235) 0.2 GB ce11 index

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.