gwasRtools

Some useful R functions for processing GWAS output

List of functions

get_loci()
get_nearest_gene()
lambda_gc()

Installation

To install the development version from GitHub use the remotes package:

remotes::install_github("lcpilling/gwasRtools")

remotes::install_github("lcpilling/gwasRtools@*release") # To install the latest release
remotes::install_github("lcpilling/[email protected]")   # To install a specific version (see tags)

Example dataset

The package includes a subset of variants from the Graham et al. 2021 GWAS of LDL in 1,320,016 Europeans (GWAS catalog GCST90239658). I will use this throughout.

library(gwasRtools)
head(gwas_example)
#>           SNP CHR        BP    A1 A2      MAF        BETA        SE     P
#> 1 rs370704455   1 104275751 CATCT  C 7.21e-02 -0.00350057 0.0040782 0.391
#> 2 rs545633289   1 104275758     C  G 5.25e-04 -0.36329200 0.2903640 0.211
#> 3 rs367651119   1 104275797     G  A 5.99e-03 -0.01142350 0.0164861 0.488
#> 4 rs560138913   1 104296901     T  G 3.67e-03 -0.05503450 0.0318479 0.084
#> 5 rs533558585   1 104296979     T  C 7.57e-05  0.00173860 0.1826840 0.992
#> 6 rs551694811   1 104297001     T  A 1.97e-03 -0.01075230 0.0316984 0.734

get_loci()

Determine loci from GWAS summary statistics. Use distance from lead significant SNP to estimate independent loci [default distance = 500kb]. Uses -log10(p) derived from BETA/SE so does not need P as input. Example below with default input.

gwas_loci = get_loci(gwas_example)
#> 
#> Locus size (bases) = 5e+05
#> P-value threshold = 5e-08
#> 
#> N variants = 319732
#> N variants p<threshold = 4132
#> N loci = 15

head(gwas_loci)
#>               SNP CHR        BP   A1 A2   MAF   BETA     SE        P locus  lead
#> 57882  rs12046439   1 107536799    T  C 0.248 0.0099 0.0017 5.01e-09     1 FALSE
#> 57900 rs143849791   1 107537916 CATG  C 0.325 0.0128 0.0016 5.85e-15     1 FALSE
#> 57922 rs113329442   1 107539252    A  G 0.330 0.0110 0.0014 1.27e-13     1 FALSE
#> 57987   rs3861909   1 107544176    G  A 0.327 0.0118 0.0015 3.52e-15     1 FALSE
#> 58025  rs17496332   1 107546375    A  G 0.331 0.0111 0.0014 8.70e-14     1 FALSE
#> 58091   rs2878349   1 107549245    G  A 0.327 0.0118 0.0014 2.33e-15     1 FALSE

gwas_loci |> dplyr::filter(lead==TRUE) |> head()
#>           SNP CHR        BP A1 A2   MAF    BETA     SE         P locus lead
#> 1 rs111232683   1 107566149  G  C 0.343  0.0135 0.0016  5.43e-17     1 TRUE
#> 2 rs114254196   1 108635400  C  T 0.008 -0.0448 0.0081  4.38e-08     2 TRUE
#> 3 rs115292790   1 109310728  G  A 0.013 -0.0563 0.0060  2.01e-20     3 TRUE
#> 4  rs12740374   1 109817590  G  T 0.219 -0.1482 0.0016 4.73e-305     4 TRUE
#> 5 rs140266316   1 110326545  G  A 0.016 -0.0577 0.0059  4.73e-22     5 TRUE
#> 6    rs657801   1 111736389  T  C 0.315  0.0090 0.0015  1.88e-09     6 TRUE

Loci are numbered. Variants within a locus (i.e., significant below the p_threshold and less than n_bases from last significant variant).
Lead variant for each locus is highlighted where lead==TRUE (i.e., smallest p-value for any variant within a locus)

The HLA region can be treated as one continuous locus by setting single_hla_locus to TRUE.

Use LD clumping to identify independent SNPs at the same locus

Setting option ld_clump=TRUE will use {ieugwasr} package ld_clump() function to run Plink LD clumping.

Default is to use a local Plink installation (this is faster) with EUR reference panel. But setting option ld_clump_local to FALSE will use the online IEU API. See the {ieugwasr} docs for details. Default R2 threshold for LD pruning is 0.01 (modify with ld_pruning_r2 option).

gwas_loci = get_loci(gwas_example, ld_clump=TRUE)
#> ** Performing LD clumping. Can take a few minutes
#> ** Local Plink installation will be called -- output appears in your R terminal
#> 
#> Locus size (bases) = 5e+05
#> P-value threshold = 5e-08
#> 
#> N variants = 319732
#> N variants p<threshold = 4132
#> N loci = 15
#> N independent variants (LD R2 threshold 0.01) = 153

head(gwas_loci)
#>               SNP CHR        BP   A1 A2   MAF   BETA     SE        P locus  lead lead_dist lead_ld
#> 57882  rs12046439   1 107536799    T  C 0.248 0.0099 0.0017 5.01e-09     1 FALSE     FALSE   FALSE
#> 57900 rs143849791   1 107537916 CATG  C 0.325 0.0128 0.0016 5.85e-15     1 FALSE     FALSE   FALSE
#> 57922 rs113329442   1 107539252    A  G 0.330 0.0110 0.0014 1.27e-13     1 FALSE     FALSE   FALSE
#> 57987   rs3861909   1 107544176    G  A 0.327 0.0118 0.0015 3.52e-15     1 FALSE     FALSE   FALSE
#> 58025  rs17496332   1 107546375    A  G 0.331 0.0111 0.0014 8.70e-14     1 FALSE     FALSE   FALSE
#> 58091   rs2878349   1 107549245    G  A 0.327 0.0118 0.0014 2.33e-15     1 FALSE     FALSE   FALSE

gwas_loci |> dplyr::filter(lead==TRUE) |> head()
#>           SNP CHR        BP A1 A2   MAF    BETA     SE        P locus lead lead_dist lead_ld
#> 1 rs111232683   1 107566149  G  C 0.343  0.0135 0.0016 5.43e-17     1 TRUE      TRUE    TRUE
#> 2 rs114254196   1 108635400  C  T 0.008 -0.0448 0.0081 4.38e-08     2 TRUE      TRUE    TRUE
#> 3 rs140300970   1 109020060  A  T 0.022 -0.0278 0.0049 2.13e-08     3 TRUE     FALSE    TRUE
#> 4 rs148503795   1 109166178  C  G 0.010 -0.0423 0.0070 2.47e-09     3 TRUE     FALSE    TRUE
#> 5  rs74896173   1 109167705  T  C 0.009 -0.0429 0.0076 2.08e-08     3 TRUE     FALSE    TRUE
#> 6 rs111751551   1 109242056  G  A 0.010 -0.0505 0.0069 4.32e-13     3 TRUE     FALSE    TRUE

Where before, locus 3 would only have had one lead SNP (based on distance/lowest p-value) ld_clump() has identified multiple independent variants in the region.

Note that there are now three lead columns:

lead_dist is the original lead column, simply based on distance and p-values
lead_ld is the direct results from ld_clump()
The lead column combines the two (some SNPs are missing from LD panel so it does not always choose the lowest p-value if only 1 variant identified at a locus).

** Note that ld_clump() only considers R^2 when defining independent variants, not D' -- you should perform additional checking/conditional analysis where relevant for lead variants in close proximity.

get_nearest_gene()

Get nearest gene from a set of variants using GENCODE data. Need to provide a data.frame of variant IDs (e.g., rsids), CHR and POS. Default column names are the same as for get_loci(). Default max distance from variant to gene is 100kb.

gwas_loci = get_nearest_gene(gwas_loci, build=37)
#> Using human genome build 37
#> Getting nearest gene for 4131 unique variants
#> (Removed 1 duplicated or missing variant IDs/positions)

head(gwas_loci)
#>           SNP CHR        BP   A1 A2   MAF   BETA     SE        P locus  lead  gene  dist
#> 1  rs12046439   1 107536799    T  C 0.248 0.0099 0.0017 5.01e-09     1 FALSE PRMT6 62468
#> 2 rs143849791   1 107537916 CATG  C 0.325 0.0128 0.0016 5.85e-15     1 FALSE PRMT6 61351
#> 3 rs113329442   1 107539252    A  G 0.330 0.0110 0.0014 1.27e-13     1 FALSE PRMT6 60015
#> 4   rs3861909   1 107544176    G  A 0.327 0.0118 0.0015 3.52e-15     1 FALSE PRMT6 55091
#> 5  rs17496332   1 107546375    A  G 0.331 0.0111 0.0014 8.70e-14     1 FALSE PRMT6 52892
#> 6   rs2878349   1 107549245    G  A 0.327 0.0118 0.0014 2.33e-15     1 FALSE PRMT6 50022

gwas_loci |> dplyr::filter(lead==TRUE) |> head()
#>           SNP CHR        BP A1 A2   MAF    BETA     SE        P locus lead     gene   dist
#> 1 rs111232683   1 107566149  G  C 0.343  0.0135 0.0016 5.43e-17     1 TRUE    PRMT6  33118
#> 2 rs114254196   1 108635400  C  T 0.008 -0.0448 0.0081 4.38e-08     2 TRUE SLC25A24  41258
#> 3 rs140300970   1 109020060  A  T 0.022 -0.0278 0.0049 2.13e-08     3 TRUE    NBPF6   6436
#> 4 rs148503795   1 109166178  C  G 0.010 -0.0423 0.0070 2.47e-09     3 TRUE  FAM102B -63467
#> 5  rs74896173   1 109167705  T  C 0.009 -0.0429 0.0076 2.08e-08     3 TRUE  FAM102B -64994
#> 6 rs111751551   1 109242056  G  A 0.010 -0.0505 0.0069 4.32e-13     3 TRUE  PRPF38B  -7111

If dist is positive, the variant is intergenic, and this is the distance to the closest gene.
If dist is negative, the variant is within a gene, and this is the distance to the start of the gene.
If dist is NA, the variant is not within n_bases of a gene in GENCODE.

lambda_gc()

Estimate inflation of test statistics. Lambda GC compares the median test statistic against the expected median test statistic under the null hypothesis of no association. For well-powered GWAS of traits with a known polygenic inheritance, we expect inflation of lambda GC. For traits with no expected association, we expect lambda GC to be around 1.

lambda_gc(gwas_example$P)
#> [1] 1.41112

lukepilling / gwasrtools Goto Github PK

gwasrtools's Introduction

gwasRtools

List of functions

Installation

Example dataset

get_loci()

Use LD clumping to identify independent SNPs at the same locus

get_nearest_gene()

lambda_gc()

gwasrtools's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs