GithubHelp home page GithubHelp logo

hongqin / gwas_flow Goto Github PK

View Code? Open in Web Editor NEW

This project forked from joyvalley/gwas_flow

0.0 1.0 0.0 52.23 MB

GPU accelerated GWAS framework based on TensorFlow

License: MIT License

Dockerfile 4.65% Python 95.35%

gwas_flow's Introduction

GWAS_Flow

Citing

GWAS-Flow was written and published in the hope that you might find it useful. If you do and use it for your research please cite the paper published alongside the software, which is currently publicly accessible on the BiorXiv preprint server. https://www.biorxiv.org/content/10.1101/783100v1 doi: 10.1101/783100

Introduction

GWAS_Flow is an open source python based software provding a GPU-accelerated framework for performing genome-wide association studies (GWAS), published under the MIT-License. GWAS is a set of major algorithms in quantitative genetics to find associations between phenotypes and their respective genotypes. With a broad range of applications ranging from plant breeding to medicine. In recent years the data sets used for those studies increased rapidly in size, and accordingly the time necessary to perform these on conventional CPU-powered machines increased exponentially. Here we use TensorFlow a framework that is commonly used for machine learning applications to utilize graphical processing units (GPU) for GWAS.

Requirements

Required Software

Required python packages

Docker and Singularity

Installation

git and anaconda

This has been tested on multiple linux systems with anconda versions > 4.7

clone the repository directly with git

git clone https://github.com/Joyvalley/GWAS_Flow

create an anaconda environment and install the necessary packages

conda create -n gwas_flow python=3.7.3
conda activate gwas_flow
conda install -y tensorflow==1.14 # conda install tensorflow-gpu==1.14 for gpu usage
conda install -y scipy pandas numpy h5py
conda install -y -c conda-forge pandas-plink 
conda install -y -c conda-forge matplotlib 
pip install limix

docker

For the installation with docker the only required software is docker itself.

git clone https://github.com/Joyvalley/GWAS_Flow.git 
cd GWAS_Flow
docker build  -t gwas_flow  docker

singularity

git clone https://github.com/Joyvalley/GWAS_Flow.git 

docker build  -t gwas_flow docker

!! make sure to change /PATH/TO/FOLDER
docker run -v /var/run/docker.sock:/var/run/docker.sock -v /PATH/TO/FOLDER:/output --privileged -t singularityware/docker2singularity:1.11 gwas_flow:latest
change the name of e.g. gwas_flow_latest-2019-08-19-8c98f492dd54.img to gwas_flow_sing.img

Execution with anaconda installation

Input data

GWAS_Flow is designed to work with several different input data formats. For all of them there is are sample data avaialble in the folder gwas_sample_data/ The minimal requirement is to provide a genotype and a phenotype file if no kinship matrix is provided a kinship matrix according to van Raden ist caluculated from the provided marker information. Depending on the size of the marker matrix this can take a while.

hdf5 input

python gwas.py -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py

csv input

python gwas.py -x gwas_sample_data/G_sample.csv -y gwas_sample_data/Y_sample.csv -k gwas_sample_data/K_sample.csv

plink input

To use PLINK data format add a bed bim and fam file with the same prefix to the folder. You can tell GWAS-Flow to use those files by using prefix.plink as the option for the genotype file

python gwas.py -x gwas_sample_data/my_plink.plink -y gwas_sample_data/pheno2.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py

Flgas and options are

-x , --genotype : file containing marker information in csv or hdf5 format of size
-y , --phenotype : file container phenotype information in csv format
-k , --kinship : file containing kinship matrix of size k X k in csv or hdf5 format
-m : name of column to be used in phenotype file. Default m='phenotype_value' 
--cof: file with cofactor information (only one co-factor as of now)
-a , --mac_min : integer specifying the minimum minor allele count necessary for a marker to be included. Default a = 1
-bs, --batch-size : integer specifying the number of markers processed at once. Default -bs 500000
-p , --perm : perform n permutations
--plot : create manhattanplot 
-o , --out : name of output file. Default -o results.csv  
-h , --help : prints help and command line options

use python gwas.py -h to see the command line options

Execution with docker and singularity

Execute the docker container with the sample data

docker run --rm -u $UID:$GID -v $PWD:/data gwas_flow:latest  -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py

On Windows you can use something like this after activating the file sharing for the drive the repo is stored on:

cd c:\PATH\TO\REPO\GWAS_Flow
docker run -v c:/PATH/TO/REPO/GWAS_Flow:/data gwas_flow:latest -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py

!! The GPU versions of docker and singularity are still under development and might or might not work properly with your setup. To run the GWAS-Flow on GPUs as of now we recommand the usage of anaconda environments

Execute the singularity image with the sample data

singularity run  gwas_flow_sing.img -x gwas_sample_data/AT_geno.hdf5 -y gwas_sample_data/phenotype.csv -k gwas_sample_data/kinship_ibs_binary_mac5.h5py

further options

Co-factor

So far GWAS-Flow is capable of using on co-factor the co-factor is added to the analysis with the flag --cof FILENAME e.g

 python gwas.py -x gwas_sample_data/G_sample.csv -y gwas_sample_data/Y_sample.csv -k gwas_sample_data/K_sample.csv --cof gwas_sample_data/cof.csv 

Permutation

add the flag --perm 100 to calculate a significance threshold based on 100 permutations. Change 100 to any integer larger 2 to perform n permutations

Manhattan plot

By default there is no plot generated if you add --plot True a manhattan plot is generated

manhattan

The dash-dotted line is the bonferroni threshold of significance and the dashed line the permutation based threshold The latter is only calculated if the flag --perm n was used with n > 2.

Performance Benchmarking and Recommendations

time_plot

The image displays the average time of 10 runs with 10000 markers each and varying number of phenotypes for GWAS_Flow on GPU and CPUs and a standard R-Script for GWAS. The computational time growths exponentially with increasing number of phenotypes. With lower numbers of phenotypes (< 800), the CPU version is faster than the GPU Version. This gets more and more lopsided the more phenotypes are included. All calculations have been performed on 16 i9 vCPUS and a NVIDIA Tesla P100 graphic card.

gwas_flow's People

Contributors

joyvalley avatar iimog avatar petermeissner avatar arthurkorte avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.