GithubHelp home page GithubHelp logo

colinbrislawn / fastspar Goto Github PK

View Code? Open in Web Editor NEW

This project forked from scwatts/fastspar

0.0 2.0 0.0 625 KB

Rapid and scalable correlation estimation for compositional data

License: GNU General Public License v3.0

C++ 95.22% Fortran 2.49% Makefile 0.38% Shell 0.01% M4 1.87% C 0.03%

fastspar's Introduction

FastSpar

Build Status License

Rapid and scalable correlation estimation for compositional data

Table of contents

Introduction

FastSpar is a C++ implementation of the SparCC algorithm which is up to several thousand times faster than the original Python2 release and uses much less memory. The FastSpar implementation provides threading support and a p-value estimator which accounts for the possibility of repetitious data permutations (see this paper for further details).

An important step of correlation analysis is removal of noise and dimension reduction. A common method to perform this is distribution-based clustering of OTUs. The aim is to reunite OTUs derived from sequencing error with the parent OTU by clustering raw OTUs based on nucleotide edit distance and count distribution. FastSpar is paired with an efficient implementation of the popular distribution-based clustering method dbOTU3.

Citation

If you use this tool, please cite the FastSpar paper and original SparCC paper:

Requirements

There are no requirements for using the pre-compiled static binaries on 64-bit linux distributions. Otherwise, there are several libraries which are required for building and running dynamically linked binaries. For further information, see Compiling from source.

Installing

FastSpar can be installed via pre-compiled binaries, Bioconda, or from source.

GNU/Linux

For most 64-bit linux distributions (e.g. Ubuntu, Debian, RedHat, etc) the easiest way to obtain FastSpar is via statically compiled binaries on the releases page. These binaries can be downloaded and run immediately without any setup as they have no dependencies.

Bioconda

FastSpar is available through Bioconda (thanks to @epruesse):

conda install -c bioconda -c conda-forge fastspar

Compiling from source

Compiling from source requires these libraries and software:

C++11 (gcc-4.9.0+, clang-4.9.0+, etc)
OpenMP 4.0+
Gfortran
Armadillo 6.7+
LAPACK
BLAS (OpenBLAS is recommended)
GNU Scientific Library 2.1+
GNU getopt
GNU make
GNU autoconf
GNU autoconf-archive
GNU m4

After meeting the above requirements, compiling and installing FastSpar from source can be done by:

git clone https://github.com/scwatts/fastspar.git
cd fastspar
./autogen.sh
./configure --prefix=/usr/
make
make install

Once completed, the FastSpar executables can be run from the command line.

Usage

Correlation inference

To run FastSpar, you must have absolute OTU counts in BIOM tsv format file (with no metadata). The fake_data.tsv (from the original SparCC implementation) will be used as an example:

fastspar --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv

The number of iterations (rounds of SparCC correlation estimation) and exclusion iterations (the number of times highly correlation OTU pairs are discovered and excluded) can also be tweaked:

fastspar --iterations 50 --exclude_iterations 20 --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv

Further, the minimum threshold to exclude correlated OTU pairs can be increased:

fastspar --threshold 0.2 --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv

Calculation of exact p-values

There are several methods to calculate p-values for inferred correlations. Here we have elected to use a robust permutation based approach. This process involves inferring correlation from random permutations of the original OTU count data. The magnitude of each p-value is related to how often a more extreme correlation is observed for randomly permutated data. In the below example, we calculate p-values from 1000 bootstrap correlations.

First we generate the 1000 bootstrap counts:

mkdir bootstrap_counts
fastspar_bootstrap --otu_table tests/data/fake_data.tsv --number 1000 --prefix bootstrap_counts/fake_data

And then infer correlations for each bootstrap count (running in parallel with all processes available):

mkdir bootstrap_correlation
parallel fastspar --otu_table {} --correlation bootstrap_correlation/cor_{/} --covariance bootstrap_correlation/cov_{/} -i 5 ::: bootstrap_counts/*

From these correlations, the p-values are then calculated:

fastspar_pvalues --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --prefix bootstrap_correlation/cor_fake_data_ --permutations 1000 --outfile pvalues.tsv

Threading

If FastSpar is compiled with OpenMP, threading can be used by invoking --threads <thread_number> at the command line for several tools:

fastspar --otu_table tests/data/fake_data.txt --correlation median_correlation.tsv --covariance median_covariance.tsv --iterations 50 --threads 10

Contributors

  • Scott Ritchie
    • Advised on use of permutation based statistical testing
    • Provided an example use of statmod::permp

License

GNU General Public License v3.0

fastspar's People

Contributors

scwatts avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.