bulik / ldsc Goto Github PK

LD Score Regression (LDSC)

License: GNU General Public License v3.0

Python 95.31% MATLAB 0.01% M 0.01% Perl 3.44% R 1.23% Roff 0.01%

ldsc's Introduction

LDSC (LD SCore) `v1.0.1`

ldsc is a command line tool for estimating heritability and genetic correlation from GWAS summary statistics. ldsc also computes LD Scores.

Getting Started

In order to download ldsc, you should clone this repository via the commands

git clone https://github.com/bulik/ldsc.git
cd ldsc

In order to install the Python dependencies, you will need the Anaconda Python distribution and package manager. After installing Anaconda, run the following commands to create an environment with LDSC's dependencies:

conda env create --file environment.yml
source activate ldsc

Once the above has completed, you can run:

./ldsc.py -h
./munge_sumstats.py -h

to print a list of all command-line options. If these commands fail with an error, then something as gone wrong during the installation process.

Short tutorials describing the four basic functions of ldsc (estimating LD Scores, h2 and partitioned h2, genetic correlation, the LD Score regression intercept) can be found in the wiki. If you would like to run the tests, please see the wiki.

Updating LDSC

You can update to the newest version of ldsc using git. First, navigate to your ldsc/ directory (e.g., cd ldsc), then run

git pull

If ldsc is up to date, you will see

Already up-to-date.

otherwise, you will see git output similar to

remote: Counting objects: 3, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), done.
From https://github.com/bulik/ldsc
   95f4db3..a6a6b18  master     -> origin/master
Updating 95f4db3..a6a6b18
Fast-forward
 README.md | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

which tells you which files were changed. If you have modified the ldsc source code, git pull may fail with an error such as error: Your local changes to the following files would be overwritten by merge:.

In case the Python dependencies have changed, you can update the LDSC environment with

conda env update --file environment.yml

Where Can I Get LD Scores?

You can download European and East Asian LD Scores from 1000 Genomes here. These LD Scores are suitable for basic LD Score analyses (the LD Score regression intercept, heritability, genetic correlation, cross-sex genetic correlation). You can download partitioned LD Scores for partitioned heritability estimation here.

Support

Before contacting us, please try the following:

The wiki has tutorials on estimating LD Score, heritability, genetic correlation and the LD Score regression intercept and partitioned heritability.
Common issues are described in the FAQ
The methods are described in the papers (citations below)

If that doesn't work, you can get in touch with us via the google group.

Issues with LD Hub? Email [email protected]

Citation

If you use the software or the LD Score regression intercept, please cite

Bulik-Sullivan, et al. LD Score Regression Distinguishes Confounding from Polygenicity in Genome-Wide Association Studies. Nature Genetics, 2015.

For genetic correlation, please also cite

Bulik-Sullivan, B., et al. An Atlas of Genetic Correlations across Human Diseases and Traits. Nature Genetics, 2015. Preprint available on bioRxiv doi: http://dx.doi.org/10.1101/014498

For partitioned heritability, please also cite

Finucane, HK, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature Genetics, 2015. Preprint available on bioRxiv doi: http://dx.doi.org/10.1101/014241

For stratified heritability using continuous annotation, please also cite

Gazal, S, et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nature Genetics, 2017.

If you find the fact that LD Score regression approximates HE regression to be conceptually useful, please cite

Bulik-Sullivan, Brendan. Relationship between LD Score and Haseman-Elston, bioRxiv doi: http://dx.doi.org/10.1101/018283

For LD Hub, please cite

Zheng, et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics (2016)

License

This project is licensed under GNU GPL v3.

Authors

Brendan Bulik-Sullivan (Broad Institute of MIT and Harvard)

Hilary Finucane (MIT Department of Mathematics)

ldsc's People

Contributors

Stargazers

Watchers

Forkers

josephepowell r03ert0 qichense joepickrell asmunduhreinn prabindm geneinmylife ifantasy eugeniaradulescu zhonghualiu longmanz mkanai arinbasu mmesbahu genomicsplc xuanyao ofrei nealelab andreashegergenomics alenzhao xuqiang9042 fanqianrui yakirr ftlbiped lancastertm bryant1410 omeed-maghzian anguillanneuf chuanj realjuliaz fuopen geneticresources anglixue manuelak connielk kyoko-wtnb quanlongjiang yu-1011 harshill lprelot hemantsankhamitr yupenghe kevmanderson dongjt0727 jdblischak mikegloudemans xtmgah emattei tanishash yanch86 joomango pgormley ce-carey avinoamshye dd-david taffymach billgreenwald dazcam leamhernandez hrk2109 liangyy explorerwjy rodrigoduarte88 walterxie venkoyoung fengyq ericaenjoy3 mischalundberg baolinwu zscu fabbondanza lechanglc hpomares flysunys wentaocai hxchengtj strao1986 pascaltimshel tankmermaid richardyando 88911 explodecomputer dtaliun biostatpzeng anandksrao biostatyu sandyyy123 jizhao666 wangdi2014 helenmenlor sijiahuo chenll9701 chinhbn zhanyq yuqiuwang astheeggeggs jgu1 sbaram1 hessj scau-drr

ldsc's Issues

munge_sumstats.py

Hello,

Thank you for making this interesting piece of software I'm keen on using it on my own data sets. I've ran into a spot of bother using the munge_sumstats.py provided (downloaded on the 20/01/2015). I'm trying to use it on the data from the psychiatric genome consortium as per he tutorial. When I do I receive the following error

File "munge_sumstats.py", line 11, in
from ldscore import sumstats
ImportError: cannot import name sumstats

I've ensured that the ldscore.py files are in my working folder. Any help here would be great.

Best, William Hill

better error messages for --rg

currently if one file in rg doesn't work, ldsc will print an error message then move on to the next file and will place an NA in the table at the end. would be better to have a 'completed with errors' message at the bottom of the log file + a copy of the error message (prevents having to scroll all the way up to find the error message in the middle of a big log file).

Missing tutorial/FAQ links

Hi. Thank you for providing a great software.

I just found out the links for the tutorial/FAQ in README.md were 404. They should be the links to the wiki, but they are currently pointing to ./tutorials or ./docs/FAQ.

Another issue I just found is that inconsistent explanation in the tutorial.
In the "Estimating Genetic Correlation" section, --ref-ld-chr takes eur_ref_ld_chr/ but the explanation says eur_w_ld_chr/....

typing --ref-ld-chr eur_ref_ld_chr/ tells ldsc to use the files eur_w_ld_chr/1.l2.ldscore, ... , eur_w_ld_chr/22.l2.ldscore.

Throughout the document, you sometimes mix-up the two, and it will possibly confuse beginners.

deal with mismatched alleles in sumstats files not generated with --merge-alleles

right now it throws a KeyError when trying to look up some mismatched allele 4-mer sumstats.FLIP_ALLELES

Option for map column to be in Morgans or cM

The Plink documentation says that the 3rd column of the map file should be in Morgans, so that's how we have coded all our files. I think that ldsc is expecting that column to be cM. Please use Morgans as the default and have an option to use cM instead.

fix prop in dev

see commented out lines 313-321 of test_regressions.py

infer N from beta [SE or P] and reference panel MAF

probably better than using max N for studies that don't provide N???

(in munge_sumstats.py)

CCA weights

The weight functions, as written, constrain the h2 estimates to be in the interval [0,1] and may return negative weights otherwise.

The is fine for QT traits, but is it right for CCA GWAS, where h2obs may well be > 1? This will require simulation and/or pencil and paper.

Probably not a huge issue

Appropriate Error Message for missing singed-sumstats

Hi.

If you don't specify --signed-sumstats and there is no name ldsc understands for that, munge_sunstats.py now fails with the error below.

Traceback (most recent call last):
  File "munge_sumstats.py", line 628, in <module>
    munge_sumstats(parser.parse_args(), p=True)
  File "munge_sumstats.py", line 533, in munge_sumstats
    sign_cname = sign_cnames[0]
IndexError: list index out of range

I think more sophisticated error message should be provided by checking if len(sign_cnames) == 0.

Thanks.

Can't find munge_sumstats.py script

Hello,

Thank you for making ldsc available !

I'm trying to do the LD score tutorial from the wiki, I downloaded the PGC bip and scz files, but I can't find the script munge_sumstats.py. I tried with the script ldsc/ldscore/sumstats.py instead, it produced two .pyc files, but no results. This is the command I used:
python sumstats.py --sumstats $datadir/pgc.cross.scz/pgc.cross.SCZ17.2013-05.txt --N 17115 --out scz

best,
roberto

munge_sumstats.py error

Hi-
Thanks for making this great software available. I'm running into an error when using munge_sumstats.py.

Here's the error message

Call:
./munge_sumstats.py
--out IGANTMP
--merge-alleles /ifs/scratch/msph/db2175/data/LDSC/PartitionedHeritability/LDSCORE/w_hm3.snplist
--a1-inc
--N 11946.0
--snp rsID
--sumstats META_ALL5COHORTS_IGAN_08RSQ.txt

Interpreting column names as follows:
Allele2: Allele 2, interpreted as non-ref allele for signed sumstat.
MarkerName: Variant ID (e.g., rs number)
rsID: Variant ID (e.g., rs number)
Allele1: Allele 1, interpreted as ref allele for signed sumstat.
P-value: p-Value

Reading list of SNPs for allele merge from /ifs/scratch/msph/db2175/data/LDSC/PartitionedHeritability/LDSCORE/w_hm3.snplist
Read 1217311 SNPs for allele merge.
Reading sumstats from META_ALL5COHORTS_IGAN_08RSQ.txt into memory 5000000.0 SNPs at a time.
.
ERROR converting summary statistics:

Traceback (most recent call last):
File "/ifs/scratch/msph/biostat/db2175/ldsc/ldsc/munge_sumstats.py", line 639, in munge_sumstats
dat = parse_dat(dat_gen, cname_translation, merge_alleles, log, args)
File "/ifs/scratch/msph/biostat/db2175/ldsc/ldsc/munge_sumstats.py", line 250, in parse_dat
if ii.sum() == 0:
File "/ifs/scratch/msph/biostat/db2175/ldsc/lib/python2.7/site-packages/pandas/core/generic.py", line 698, in nonzero
.format(self.class.name))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Conversion finished at Tue Mar 15 20:09:26 2016

This is my script:

!/bin/bash

wd=pwd
out=IGANTMP
f=META_ALL5COHORTS_IGAN_08RSQ.txt
N=11946
hapmap=/ifs/scratch/msph/db2175/data/LDSC/PartitionedHeritability/LDSCORE/
w_hm3.snplist
cd /ifs/scratch/msph/biostat/db2175/ldsc
source bin/activate
cd $wd

python /ifs/scratch/msph/biostat/db2175/ldsc/ldsc/munge_sumstats.py --su
mstats $f
--merge-alleles $hapmap
--out $out --a1-inc --N $N --snp rsID

I have the right version of pandas (and of the other packages as well):

Python 2.7.6 (default, Dec 10 2013, 14:55:31)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import pandas as pd
pd.version
'0.15.2'

Thanks for your help!
Daniel

Incorrect command in tutorial? --out to wrong file?

Near the top of 'Heritability and Genetic Correlation', there is a command just after "If you just want to compute heritability and the LD Score regression intercept, replace the last two commands with "

I think this command should be something like --out scz instead of --out scz_bip. The important thing is that it matches the command on the less line in the immediately following line.

Also, it might help to insert a space just before the \ as part of the --h2 parameter on the same command

median value issue for munge_sumstats.py

I have problem using munge_sumstats.py for one of gwas data, I used the effect size as the summary statistics, the errors shows below:

Traceback (most recent call last):
File "/nas02/home/k/x/kxia/software/ldsc/munge_sumstats.py", line 654, in munge_sumstats
check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
File "/nas02/home/k/x/kxia/software/ldsc/munge_sumstats.py", line 366, in check_median
raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of SIGNED_SUMSTATS is 2.2 (should be close to 0.0). This column may be mislabeled.

Any method to fix this?
Thank you

h2 is negative -1.105

Hi
my h2 estimate is negative (shown below). I didn't have any issue during data munging. So I don't know what's wrong. BTW, my sample size is 300+, do you think is it too small for estimating heritability or genetic correlation? Thank you

Total Observed scale h2: -1.105 (1.5709)
Lambda GC: 1.0436
Mean Chi^2: 1.0409
Intercept: 1.0478 (0.0093)
Ratio: 1.1669 (0.2275)

./ldsc.py -h ERROR

Hi,

I just installed ldsc, seem to have all the python packages installed, but still get this message:

-bash-4.1$ ./ldsc.py -h
Traceback (most recent call last):
File "./ldsc.py", line 13, in
import ldscore.parse as ps
File "/data/sgg/zoltan/bin/ldsc/ldscore/parse.py", line 100
new_col_dict = {c: c + '_' + str(i) for c in y.columns if c != 'SNP'}
^
SyntaxError: invalid syntax

Does it give you a hint what may have gone wrong?

Thanks a lot

Zoltan

document .frq file format

Specifying sample size when GWAS included related individuals

Hello,

I have performed a GWAS for a quantitative trait using a twin sample containing MZ and DZ siblings. To account for the effect of shared environment between siblings, I used a general estimating equation as described here: http://www.nature.com/mp/journal/v19/n11/full/mp2014121a.html

I would like to use the results in LD Score regression. Am I correct in thinking that I need to calculate and use the effective sample size?

Thanks,

Ollie

unexpected keyword argument 'subset'

I ran the following command:

python ldsc.py --h2 mydata.sumstats --ref-ld-chr eur_w_ld_chr/ --w-ld-chr eur_w_ld_chr/ --out myresults_ldsc

And got the following error:

Traceback (most recent call last):
File "ldsc.py", line 617, in
sumstats.estimate_h2(args, log)
File "/home/genetics/Desktop/ptsd_analysis_10/ldsc/ldscore/sumstats.py", line 258, in estimate_h2
args, log, args.h2)
File "/home/genetics/Desktop/ptsd_analysis_10/ldsc/ldscore/sumstats.py", line 234, in _read_ld_sumstats
sumstats = _read_sumstats(args, log, fh, alleles=alleles, dropna=dropna)
File "/home/genetics/Desktop/ptsd_analysis_10/ldsc/ldscore/sumstats.py", line 165, in _read_sumstats
sumstats = sumstats.drop_duplicates(subset='SNP')
TypeError: drop_duplicates() got an unexpected keyword argument 'subset'

Any advice?
Thanks

document .sumstats file format (replacement for .chisq)

warn if rg driven by single locus

could be nice to check for this -- a simple approach would be to check that the extrema of z1z2 aren't too insane

Run failure

Hello Folks,

I would like to try this project to calculate genetic correlation between depression and stroke using LD score regression. So I have started following the instruction from the documentation. I have successfully cloned the repository. But when I trying to run following command line, I am facing ImportError which is also described below:

Command line: /usr/local/packages/Python-2.7.8/bin/python ldsc/ldsc.py -h

Error:
Traceback (most recent call last):
File "ldsc/munge_sumstats.py", line 11, in
from scipy.stats import chi2
File "/usr/local/packages/Python-2.7.8/lib/python2.7/site-packages/scipy/stats/init.py", line 338, in
from .stats import *
File "/usr/local/packages/Python-2.7.8/lib/python2.7/site-packages/scipy/stats/stats.py", line 180, in
import scipy.special as special
File "/usr/local/packages/Python-2.7.8/lib/python2.7/site-packages/scipy/special/init.py", line 629, in
from .basic import *
File "/usr/local/packages/Python-2.7.8/lib/python2.7/site-packages/scipy/special/basic.py", line 18, in
from . import orthogonal
File "/usr/local/packages/Python-2.7.8/lib/python2.7/site-packages/scipy/special/orthogonal.py", line 101, in
from scipy import linalg
File "/usr/local/packages/Python-2.7.8/lib/python2.7/site-packages/scipy/linalg/init.py", line 190, in
from ._decomp_update import *
File "scipy/linalg/_decomp_update.pyx", line 1, in init scipy.linalg._decomp_update (scipy/linalg/decomp_update.c:39096)
ImportError: /usr/local/packages/Python-2.7.8/lib/python2.7/site-packages/scipy/linalg/cython_lapack.so: undefined symbol: zlacn2

I have installed all required Python modules for the same version of Python which I am using to run the script.

Can someone please help me to solve this issue?

Thanks.

Best,
Tushar

fatal error while running munge_sumstats.py

Hello,

After successful installation of ldsc on my system, I tried to run munge_sumstats.py script to format my input files to ldsc format files. I thought this will be a smooth process but it's not. While running munge_sumstats.py script, I am facing a fatal error about median values of beta. For reference, I have pasted the script and the fatal error below:

SCRIPT:
python munge_sumstats.py
--sumstats inFile.txt
--N 18759
--out out
--merge-alleles hapmap.txt

ERROR:
ERROR converting summary statistics:

Traceback (most recent call last):
File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 654, in munge_sumstats
check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 366, in check_median
raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of beta is -3.6 (should be close to 0). This column may be mislabeled.

After this exercise, I thought that my input file has negative beta values that might cause the error. So I converted all negative values to positive and ran the same script. Unfortunately, the same error occurred. For reference, I have pasted the error below:

ERROR converting summary statistics:

Traceback (most recent call last):
File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 654, in munge_sumstats
check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
File "/local/projects-t2/SIGN/packages/ldsc/munge_sumstats.py", line 366, in check_median
raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of beta is 3.6 (should be close to 0). This column may be mislabeled.

I hope that I am clear enough in explaining my issue.

Can someone elaborate me that why am I getting this error?

Thanks.

Number of EUR individuals in 1000 Genomes Project

Thanks for the well-documented software! It is really helpful. After reading the paper and wiki, I am wondering why you used 378 1000 Genomes Europeans individuals in the analysis. It seems that the number of phased EUR individuals is 379.

link not found

Hi,

The link to the pre-computed LD score page no longer works. Is this temporary or permanent?
https://data.broadinstitute.org/alkesgroup/LDSCORE/

Best,
Jessie

Some test failures

A few formal tests suggested in the wiki fail on my Windows machine, even with a factory-fresh python- and ldsc-installation, see output below.

ldsc still seems to be working fine (e.g. the tutorial schizophrenia/bipolar correlation), but it's a bit disconcerting.

Windows 7
Python 2.7.10 :: Anaconda 2.3.0 (64-bit)
ldsc v1.0.0 commit 34394e8 (August 3)

Test results:

$ nosetests -A 'not slow'
..............................................................F.F...............
.........................................................
======================================================================
FAIL: test_n_cas_con_flag (test_munge_sumstats.test_process_n)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\xxxxxx\ldsc\test\test_munge_sumstats.py", line 104, in test_n_c
as_con_flag
    assert_series_equal(dat.N, self.N_const)
  File "c:\Users\xxxxxx\AppData\Local\Continuum\Anaconda\lib\site-packages\panda
s\util\testing.py", line 701, in assert_series_equal
    assert_attr_equal('name', left, right)
  File "c:\Users\xxxxxx\AppData\Local\Continuum\Anaconda\lib\site-packages\panda
s\util\testing.py", line 552, in assert_attr_equal
    assert_equal(left_attr,right_attr,"attr is not equal [{0}]" .format(attr))
  File "c:\Users\xxxxxx\AppData\Local\Continuum\Anaconda\lib\site-packages\panda
s\util\testing.py", line 533, in assert_equal
    assert a == b, "%s: %r != %r" % (msg.format(a,b), a, b)
AssertionError: attr is not equal [name]: 'N' != None

======================================================================
FAIL: test_n_flag (test_munge_sumstats.test_process_n)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\xxxxxx\ldsc\test\test_munge_sumstats.py", line 97, in test_n_fl
ag
    assert_series_equal(dat.N, self.N_const)
  File "c:\Users\xxxxxx\AppData\Local\Continuum\Anaconda\lib\site-packages\panda
s\util\testing.py", line 701, in assert_series_equal
    assert_attr_equal('name', left, right)
  File "c:\Users\xxxxxx\AppData\Local\Continuum\Anaconda\lib\site-packages\panda
s\util\testing.py", line 552, in assert_attr_equal
    assert_equal(left_attr,right_attr,"attr is not equal [{0}]" .format(attr))
  File "c:\Users\xxxxxx\AppData\Local\Continuum\Anaconda\lib\site-packages\panda
s\util\testing.py", line 533, in assert_equal
    assert a == b, "%s: %r != %r" % (msg.format(a,b), a, b)
AssertionError: attr is not equal [name]: 'N' != None

----------------------------------------------------------------------
Ran 137 tests in 3.229s

FAILED (failures=2)




$ nosetests -A 'slow'
.........................F....
======================================================================
FAIL: test_sumstats.Test_RG_Statistical.test_hsq_int_se
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\xxxxxx\AppData\Local\Continuum\Anaconda\lib\site-packages\nose\
case.py", line 197, in runTest
    self.test(*self.arg)
  File "c:\Users\xxxxxx\ldsc\test\test_sumstats.py", line 274, in test_hsq_int_s
e
    map(t('intercept'), map(t('hsq2'), self.rg))), atol=0.1)
  File "c:\Users\xxxxxx\AppData\Local\Continuum\Anaconda\lib\site-packages\numpy
\testing\utils.py", line 1297, in assert_allclose
    verbose=verbose, header=header)
  File "c:\Users\xxxxxx\AppData\Local\Continuum\Anaconda\lib\site-packages\numpy
\testing\utils.py", line 665, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0.1

(mismatch 100.0%)
 x: array(0.5372853385973988)
 y: array(0.6539727418594192)

----------------------------------------------------------------------
Ran 30 tests in 43.639s

FAILED (failures=1)

--h2 w/ suffix

would be nice if --h2 took the file name instead of prefix.

write tests for everything to do with annot/frq in parse.py

also the 59 line parse.annot function looks like it could probably be shortened. I don't actually know what it does, so this is just a guess.

error for generate ldscore file when no annotation in that chromosome

Hi,
I found when the annot file is a complete 0 matrix, the script can not generate the ldscore file.
Anyway to fix that?

The log is below:
Call:
./ldsc.py
--print-snps LDSCORE/hm.22.snp
--ld-wind-cm 1.0
--out /home/unix/zzhang/hptmp/LDscoreTmp/igap_pval5.rsid.22
--bfile LDSCORE/1000G.mac5eur.22
--annot /broad/hptmp/zhizhuo/LDscoreAnnot/igap_pval5.rsid.22.annot.gz
--l2

Beginning analysis at Thu Jun 11 11:17:35 2015
Read list of 129364 SNPs from LDSCORE/1000G.mac5eur.22.bim
Read 53 annotations for 129364 SNPs from /broad/hptmp/zhizhuo/LDscoreAnnot/igap_pval5.rsid.22.annot.gz
Read list of 379 individuals from LDSCORE/1000G.mac5eur.22.fam
Reading genotypes from LDSCORE/1000G.mac5eur.22.bed
After filtering, 129364 SNPs remain
Estimating LD Score.
Traceback (most recent call last):
File "ldsc-master/ldsc.py", line 606, in
ldscore(args, log)
File "ldsc-master/ldsc.py", line 318, in ldscore
df.columns = new_colnames
File "/broad/software/free/Linux/redhat_6_x86_64/pkgs/anaconda_2.1.0/lib/python2.7/site-packages/pandas/core/generic.py", line 1958, in setattr
return object.setattr(self, name, value)
File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.set (pandas/lib.c:41294)
File "/broad/software/free/Linux/redhat_6_x86_64/pkgs/anaconda_2.1.0/lib/python2.7/site-packages/pandas/core/generic.py", line 406, in _set_axis
self._data.set_axis(axis, labels)
File "/broad/software/free/Linux/redhat_6_x86_64/pkgs/anaconda_2.1.0/lib/python2.7/site-packages/pandas/core/internals.py", line 2217, in set_axis
'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 6 elements, new values have 58 elements

misformatted .annot doesn't raise sensible error

I passed ldsc --l2 a mis-formatted annot file with only SNP / annot1 / annot2 columns and got an uninterpretable error message. Should fix this.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/bulik/ldsc_dev/ldsc.py in <module>()
    609                                 args.pq_exp = 1
    610
--> 611                         ldscore(args, log)
    612                 # summary statistics
    613                 elif (args.h2 or args.rg) and (args.ref_ld or args.ref_ld_chr) and (args.w_ld or args.w_ld_chr):

/Users/bulik/ldsc_dev/ldsc.py in ldscore(args, log)
    307         new_colnames = geno_array.colnames + ldscore_colnames
    308         df = pd.DataFrame.from_records(np.c_[geno_array.df, lN])
--> 309         df.columns = new_colnames
    310         if args.print_snps:
    311                 if args.print_snps.endswith('gz'):

/Library/Python/2.7/site-packages/pandas/core/generic.pyc in __setattr__(self, name, value)
   1956         try:
   1957             object.__getattribute__(self, name)
-> 1958             return object.__setattr__(self, name, value)
   1959         except AttributeError:
   1960             pass

/Library/Python/2.7/site-packages/pandas/lib.so in pandas.lib.AxisProperty.__set__ (pandas/lib.c:41295)()

/Library/Python/2.7/site-packages/pandas/core/generic.pyc in _set_axis(self, axis, labels)
    404
    405     def _set_axis(self, axis, labels):
--> 406         self._data.set_axis(axis, labels)
    407         self._clear_item_cache()
    408

/Library/Python/2.7/site-packages/pandas/core/internals.pyc in set_axis(self, axis, new_labels)
   2215         if new_len != old_len:
   2216             raise ValueError('Length mismatch: Expected axis has %d elements, '
-> 2217                              'new values have %d elements' % (old_len, new_len))
   2218
   2219         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 6 elements, new values have 5 elements

dev currently has no chi^2 filter

solution: implement two-step estimator. easy to check statistical validity w/ statistical tests

change .l2.ldscore file format

we're not using the CM or MAF fields.

the chr and bp fields are only used for sorting the LD Scores in parse.py, but this might be overcautious, since ldsc always prints sorted LD Scores

Feature request

--h2-observed-to-liabiity FN

where FN is a text file. Input has any number of rows, each row has label (like mdd_old), h2_obs, K, Ncase, Ncontrol. Output repeats the input and adds h2_liab.

Others will need this too, and would be bad if people reported h2-obs as h2-liability in papers.

rename ldsc or ldsc.py

I'd like to import a function from ldsc.py, but when I try, it thinks I'm trying to import a module from ldsc. Could we rename one of those two? I don't want to make the change on my own because of the disastrous renaming attempt on bitbucket.

1000G.mac5eur.* files

Hi,

The 1000G.mac5eur.* files referred to in Step 1 of the Partitioned Heritability page do not seem to be on https://data.broadinstitute.org/alkesgroup/LDSCORE/.

Thanks.

Sarah

space

295G of the 400G I have on orchestra are taken up by reference panels. The 83 cell-type specific reference panels take up 265G. What do you think about compressing and un-compressing?

How can I perform genetic correlation of two traits from different population? European versus Asian.

hilary should check out the FAQ

see docs/FAQ

Large Ratio values in univariate heritability estimations

Hello,
For a series of phenotypes I'm getting Ratio values of 17% and even 30%. The GWAS data is not genomic-controlled. However, when I tried a GC version of the same analyses, the results were not much different (I am not doing the GWAS analyses myself), and in general smaller that what I had obtained using GCTA. The number of subjects is N~13,000, European ancestry. What can explain the large Ratio values and the absence of difference between GC and non-GC heritability estimates?
Thank you!

setting --constrain-intercept / --samp-prev / --pop-prev with --rg and a list

currently there's no way to compute a list of rg's (with --rg a,b,c,d syntax) for more than two traits with different values for --constrain-intercept or the --*-prev flags

--overlap-annot and _keep_ld/_check_variance

Does --overlap-annot play nice with _keep_ld and _check_variance (in sumstats)? I found a to-do note in the code suggesting not

# check that M_annot == np.sum(annot_matrix,axis=0) and n_annot == annot_matrix.shape[1]
# make --overlap-annot versions of _keep_ld and _check_variance

--ref-ld-list w/ identical column names

regenerate the test LD score files to make sure that this is ok

table at the end of --rg truncates long file names

The genetic correlation table printed at the end of the --rg command truncates long filenames (e.g., ./P/GIANT_Randall2013PlosGenet_stage1_publicre...).

HapMap3 SNPs link

Hi,

The link for the list of HapMap3 SNPs under Part 1 of the Partitioned Heritability page (https://github.com/bulik/ldsc/wiki/Partitioned-Heritability) isn't working for me. The following error comes up: "You don't have permission to access /~bulik/w_hm3.snplist.bz2 on this server."

Thanks.

Sarah

repeat rsids

We only care if there are repeats in the intersection of the 3 sets of snps. So we should (a) ignore all other repeat rsids, and (b) take out the offending snps while printing a warning instead of just throwing an error.

error in rg

I am running rg and I have the following ERROR. However I can run each of the datsets with another one...

Call:
./ldsc.py
--ref-ld-chr eur_w_ld_chr/
--out d1y_DBP
--rg /home/mbustamante/ldsc/diarrhea/d1y.sumstats.gz,DBP.sumstats.gz
--w-ld-chr eur_w_ld_chr/

Beginning analysis at Tue Dec 15 16:09:48 2015
Reading summary statistics from /home/mbustamante/ldsc/diarrhea/d1y.sumstats.gz ...
Read summary statistics for 1148956 SNPs.
Reading reference panel LD Score from eur_w_ld_chr/[1-22] ...
Read reference panel LD Scores for 1293150 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from eur_w_ld_chr/[1-22] ...
Read regression weight LD Scores for 1293150 SNPs.
After merging with reference panel LD, 1147988 SNPs remain.
After merging with regression SNP LD, 1147988 SNPs remain.
Computing rg for phenotype 2/2
Reading summary statistics from DBP.sumstats.gz ...
Read summary statistics for 1217311 SNPs.
After merging with summary statistics, 1147988 SNPs remain.
973560 SNPs with valid alleles.
ERROR computing rg for phenotype 2/2, from file DBP.sumstats.gz.
Traceback (most recent call last):
File "/home/mbustamante/ldsc/ldscore/sumstats.py", line 343, in estimate_rg
rghat = _rg(loop, args, log, M_annot, ref_ld_cnames, w_ld_cname, i)
File "/home/mbustamante/ldsc/ldscore/sumstats.py", line 465, in _rg
intercept_gencov=intercepts[2], n_blocks=n_blocks, twostep=args.two_step)
File "/home/mbustamante/ldsc/ldscore/regressions.py", line 699, in init
np.multiply(hsq1.tot_delete_values, hsq2.tot_delete_values))
FloatingPointError: invalid value encountered in sqrt

Summary of Genetic Correlation Results
p1 p2 rg se z p h2_obs h2_obs_se h2_int h2_int_se gcov_int gcov_int_se
/home/mbustamante/ldsc/diarrhea/d1y.sumstats.gz DBP.sumstats.gz NA NA NA N A NA NA NA NA NA NA

no-intercept produces h2=4.84...

Hello,
I'm analysing a quantitative trait (brain volume) in a population of unrelated subjects, controlled for population stratification, without genomic control. If I don't constraint the intercept, I obtain:
h2=0.1743 (0.0527)
but if I add the --no-intercept flag, then
h2=4.8404 (0.0757) !
what can be the reason?
thank you in advance!

Question regarding LD Score Estimation

Hi,

I am using the LD Score Regression method to compute the genetic correlation between two different phenotypes in patients from European ancestry (using summary statistic data). I have noticed some of my SNPs which demonstrate genome-wide significance in both phenotypes are not present 'w_hm3.snplist' file or the 'eur_w_ld_chr' folder with the precomputed ld scores. Do the authors think this will make a difference to the estimate? If so, would it be wise to compute additional LD scores from the reference panel from which I imputed the data? Furthermore, if I do this, do I need to update the snplist file used to munge/reformat the data prior to analysis? If so, could the authors please help suggest the best way to do this (including filtering)? Many thanks.

Amit

rg regression weights

possibly a math error in the rg regression weights -- should be

E[chi^2_1]E[\chi^2_2] + 1(rg stuff)

rather than

+2(rg stuff)

(check this)

Using Meta-analysis Results Containing MetaChip for LDSC

Hello,

According to the paper of LDSC, MetaboChip data is not applicable to the LDSC (at least 1M SNPs are needed for LDSC I think?).

Currently we have two meta-analysis datasets for LDSC correlation analysis, stage 1 (all GWAS, ~2M SNPs) and combined (GWAS, ~2M SNPs and MetaboChip, ~0.3M SNPs).

I am thinking using the stage 1 only but the N is merely 10k, and the result has a relatively large SE. Using combined one provides a seemingly better result, yet I am not sure if this is appropriate.

Best wishes,
Longda

something is wrong with bivariatePointNormal

It returns a bunch of NAs?

bulik / ldsc Goto Github PK

ldsc's Introduction

LDSC (LD SCore) v1.0.1

Getting Started

Updating LDSC

Where Can I Get LD Scores?

Support

Citation

License

Authors

ldsc's People

Contributors

Stargazers

Watchers

Forkers

ldsc's Issues

!/bin/bash

Recommend Projects

Recommend Topics

Recommend Org

Jobs

LDSC (LD SCore) `v1.0.1`