GithubHelp home page GithubHelp logo

lightning-auriga / statgen-mi Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 11.2 MB

multiple imputation for association studies

License: GNU General Public License v3.0

Python 100.00%
gwas gwas-pipeline gwas-tools snakemake snakemake-workflow snakemake-workflows

statgen-mi's Introduction

statgen-mi

CircleCI

codecov

Brief Summary

multiple imputation for association studies

Authors

  • Lightning Auriga (@lightning-auriga)

Usage

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above).

Step 1: Obtain a copy of this workflow

  1. Clone this repository to your local system, into the place where you want to perform the data analysis.
    git clone https://github.com/lightning-auriga/statgen-mi.git

Step 2: Configure workflow

Configure the workflow according to your needs via editing the files in the config/ folder. Adjust config.yaml to configure the workflow execution, and manifest.tsv to specify your sample setup.

Configuration settings

The following settings are available in primary user configuration under config/config.yaml:

  • manifest: location of primary run manifest file; defaults to config/manifest.tsv
  • tools: configuration options for association tools supported by the workflow
    • bcftools: configuration options specific to bcftools
      • executable: command to launch bcftools (see note below)
      • plugin_path: path to plugin libraries for this version of bcftools
      • note: this workflow uses bcftools +setGT to randomize genotypes from imputed probabilities. this functionality is not actually present in +setGT, but has been modded in; see this repo for the code. eventually, this will hopefully get migrated somewhere more useful like conda, but a local build suffices for this early stage. recommend placing the bcftools repo at ../bcftools relative to this workflow
    • plink2: configuration options specific to plink2 --glm methods
      • executable: command to launch plink2. if using conda, this should remain default plink2
      • maxthreads: maximum number of threads to deploy in a plink2 task
      • maxmem: maximum RAM (in MB) supplied to a plink2 task
      • mi_draws: number of multiple imputation simulated sets to run for plink2 MI runs
  • imputed_datasets: user-defined sets of imputed data that can be selected for analysis
    • each tag under imputed_datasets should be unique, and can be used to refer to the dataset in the manifest
    • each tag should contain under it:
      • type: descriptor of imputed file type. currently the only accepted value is minimac4
      • filename: full path to and name of imputed data file(s). for minimac4: the dose.vcf.gz file(s). if multiple paths are specified in an array, the files will each be processed in turn and concatenated (in order) after run completion
  • regression_models: user-defined sets of phenotypes and covariates that can be selected for analysis
    • each tag under regression_models should be unique, and can be used to refer to the model in the manifest
    • each tag should contain under it:
      • filename: full path to and name of plink-format phenotype file containing any relevant variables. other variables can also be present
      • phenotype: primary outcome for this regression model, as the corresponding header entry in the phenotype file
      • covariates: (optional) list of covariates for this regression model, as the corresponding header entry or entries in the phenotype file
      • model: descriptor of association type. currently recognized options are linear or logistic
      • vif: (optional) for tools that support this, primarily plink: variance inflation factor cap above which a model is suppressed
  • queue: user-defined configuration data for compute queue
    • small_partition: slurm partition (or equivalent for other cluster profiles) for jobs with the following restrictions:
      • max RAM will never exceed 3500M
      • max time will be less than 10 minutes
      • this is exposed to save money, but can just be set to the same value as the below partition if desired
    • large_partition: slurm partition (or equivalent for other cluster profiles) for jobs using maximum per-tool analysis settings, with the following additional restrictions:
      • at least 8000M RAM should be available for a task
      • jobs on the partition should be permitted to run at least four hours before being killed

Run manifest

Each desired MI run should be configured in a row of the run manifest, by default at config/manifest.tsv. The following entries are required for each run:

  • analysis: unique identifier for this particular run
  • imputed_dataset: tag for desired imputed dataset to use, as enumerated in config/config.yaml
  • tool: supported association tool for analysis
  • regression_model: tag for desired regression model to use, as enumerated in config/config.yaml

Step 3: Install Snakemake

Install Snakemake using conda:

conda create -c bioconda -c conda-forge -n snakemake snakemake

For installation details, see the instructions in the Snakemake documentation.

Step 4: Execute workflow

Activate the conda environment:

conda activate snakemake

Test your configuration by performing a dry-run via

snakemake --use-conda -n

Execute the workflow locally via

snakemake --use-conda --cores $N

using $N cores or run it in a cluster environment via

snakemake --use-conda --profile /path/to/slurm-profile --jobs 100

See the Snakemake documentation for further details.

Step 5: Investigate results

More information will be added here as it becomes available. For now, draft results are populated under results/{analysis}/{imputed_dataset}/{tool}/{regression_model}

Step 6: Commit changes

Whenever you change something, don't forget to commit the changes back to your github copy of the repository:

git commit -a
git push

Testing

Testing for embedded snakemake python scripts is in workflow/scripts/tests and handled with pytest. snakemake_unit_tests integration TBD.

Version History

  • see the changelog for details.
  • note that this was originally written under the name "Cameron Palmer"

statgen-mi's People

Contributors

lightning-auriga avatar

Watchers

 avatar

statgen-mi's Issues

resolve bcftools +setGT modding issue

I had tried modding some randomization support into bcftools +setGT, but the solution was rather ephemeral and not kept in sync with upstream. Not sure how I want to approach this, but since this project is unlikely to ever be used, I guess it's not a rush :)

CI

once pytest is operational, get this thing running with circle

add snptest back in

depending on how interested I am in this, may add the old snptest frequentist options back in here for testing purposes

plink2 logistic regression

while in theory this is supported, no actual cooked in tests are currently configured, so something is most certainly broken in there somewhere

LMMs

this package was once intended to be used in part with LMM tools. the tools have entirely changed in the intervening time, but the idea still stands.

pytest

pipeline scripts are mostly structured in readiness for pytest, but test implementation is still required

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.