omerwe / polyfun Goto Github PK
View Code? Open in Web Editor NEWPolyFun (POLYgenic FUNctionally-informed fine-mapping)
License: MIT License
PolyFun (POLYgenic FUNctionally-informed fine-mapping)
License: MIT License
When df_snpvar
contains SNPs that aren't in df_sumstats
, the script throws an error and doesn't continue. I added several lines to simply remove the extra snps from df_snpvar
and continue.
lines 634-639
if df_snpvar.shape[0] < df_sumstats.shape[0]:
# raise ValueError('not all SNPs in the sumstats file are also in the annotations file')
# BMS edit:: Remove extra SNPs instead of stopping.
logging.info('Not all SNPs in the sumstats file are also in the annotations file.')
snp_filt = df_snpvar.SNP.isin(df_sumstats.SNP)
logging.info('Removing %d extra SNPs...' % (sum(snp_filt)))
df_snpvar = df_snpvar.loc[snp_filt, :]
In my case, this edit resulted in the removal of 34872 SNPs (using the sample annotations downloaded with the repo and Nalls et al GWAS summary stats).
Hello.
I noticed the PolyFun results on your home page (https://alkesgroup.broadinstitute.org/polyfun_results/), the results for each trait only contain more than 200,000 SNPs. But in the UKBB directory of your home page (https://alkesgroup.broadinstitute.org/UKBB/) , the sumstats file for each trait contains tens of millions of SNPs. Why is the number of SNPs so different?
Thank you!
In the paper it says:
multi-ethnic fine-mapping and incorporation of tissue-specific functional annotations may further increase fine-mapping power. Incorporating these into our fine-mapping framework is an avenue for future work.
https://www.biorxiv.org/content/10.1101/807792v3.full
What is the timeline for this? Or can you suggest a workaround?
For scenarios where someone's summary stats file has both SNP and CHR/POS, I modified the merging procedure so that column names don't get duplicated.
lines 66-76
#merge the dfs
# BMS edits
print('Merging metadata files by...')
if all(x in df_snps.columns for x in ['SNP', 'BP', 'CHR']):
logging.info(' SNP and position.')
df = df_meta.merge(df_snps, on=['SNP', 'A1', 'A2', 'SNP', 'BP', 'CHR'], how='inner')
elif 'SNP' in df_snps.columns:
print(' SNP.')
df = df_meta.merge(df_snps, on=['SNP', 'A1', 'A2'], how='inner')
else:
print(' position.')
df = df_meta.merge(df_snps, on=['CHR', 'BP', 'A1', 'A2'], how='inner')
Hello,
I would like to estimate the relative contribution of eQTLs to the heritability of a complex trait. Can you please tell em if your software could be used for that purpose. If not can you please recommend any other software that would work with GWAS data?
Cheers,
Ana
Hi Omer, I recently found a bug w/ the call munge function where A1
and A2
that are lowercase make all SNPs thrown out during fine mapping. The failure was noticed during Susie fine mapping when 0 SNPs
left after overlapping w/ ref and ld matrices. It was thrown as a warning rather than error. I propose to make them all upper to match with the UKB SNPs. I tested this and this saved 3/20 GWAS sum stats that I'm working with.
Also btw, pandas v1.0.0 doesn't seem to work with one obscure polyfun_utils.py function due to new syntax. Works with later versions.
def rename_df_columns(df_sumstats):
chr_column = find_df_column(df_sumstats, ['CHR', 'CHROMOSOME', 'CHROM'])
bp_column = find_df_column(df_sumstats, ['BP', 'POS', 'POSITION', 'COORDINATE', 'BASEPAIR'])
snp_column = find_df_column(df_sumstats, ['SNP', 'RSID', 'RS', 'NAME','MarkerName'])
a1freq_col = find_df_column(df_sumstats, ['A1FREQ', 'freq', 'MAF', 'FRQ'], allow_missing=True)
info_col = find_df_column(df_sumstats, 'INFO', allow_missing=True)
beta_col = find_df_column(df_sumstats, ['BETA', 'EFF', 'EFFECT', 'EFFECT_SIZE', 'OR'], allow_missing=True)
se_col = find_df_column(df_sumstats, ['SE'], allow_missing=True)
pvalue_col = find_df_column(df_sumstats, ['P_BOLT_LMM', 'P', 'PVALUE', 'P-VALUE', 'P_value', 'PVAL'], allow_missing=True)
z_col = find_df_column(df_sumstats, ['Z', 'ZSCORE', 'Z_SCORE'], allow_missing=True)
n_col = find_df_column(df_sumstats, ['N', 'sample_size'], allow_missing=True)
ncase_col = find_df_column(df_sumstats, ['N_cases', 'Ncase', 'Nca','Total_NCase'], allow_missing=True)
ncontrol_col = find_df_column(df_sumstats, ['N_controls', 'Ncontrol','Nco','Total_NControl'], allow_missing=True)
try:
allele1_col = find_df_column(df_sumstats, ['ALLELE1', 'A1', 'a_1'])
allele0_col = find_df_column(df_sumstats, ['ALLELE0', 'A0', 'a_0'])
except ValueError:
allele1_col = find_df_column(df_sumstats, ['ALLELE1', 'A1'])
allele0_col = find_df_column(df_sumstats, ['ALLELE2', 'A2'])
# make alleles uppercase
df_sumstats[allele1_col] = df_sumstats[allele1_col].str.upper()
df_sumstats[allele0_col] = df_sumstats[allele0_col].str.upper()
# rename columns
return df_sumstats.rename(columns={snp_column:'SNP', allele1_col:'A1',
allele0_col:'A2', a1freq_col:'MAF', bp_column:'BP',
chr_column:'CHR', info_col:'INFO', beta_col:'BETA',
se_col:'SE', pvalue_col:'P', z_col:'Z', n_col:'N',
ncase_col:'N_CASES', ncontrol_col:'N_CONTROLS'}, errors='ignore')
I get this error when running finemapper.py
using susie. It seems that df_ld_snps
and df_sumstats
have the same snpid
but the order is different. I sorted snpvar_ridge_constrained.gz
by position which fixed the problem, but I'm not sure why this file is not sorted. My munged sumstats are sorted by chr and position...
[INFO] Computing LD matrix for chromosome 17 BP 29786491-31538425
[INFO] Running LDStore...
[INFO] done in 7.71 seconds
[INFO] Running LDStore meta...
[INFO] Extracting LD matrix to a text file
[INFO] Running LDStore LD extraction...
[INFO] done in 15.35 seconds
Traceback (most recent call last):
File "/project2/xinhe/software/polyfun/run_finemapper.py", line 124, in <module>
verbose=args.verbose, ld=ld, df_ld_snps=df_ld_snps, debug_dir=args.debug_dir)
File "/project2/xinhe/software/polyfun/finemapper.py", line 401, in finemap
assert np.all(self.df_ld_snps['SNP'] == self.df_sumstats_locus['SNP'])
File "/project2/xinhe/software/miniconda3/envs/polyfun/lib/python3.6/site-packages/pandas/core/ops/common.py", line 64, in new_method
return method(self, other)
File "/project2/xinhe/software/miniconda3/envs/polyfun/lib/python3.6/site-packages/pandas/core/ops/__init__.py", line 521, in wrapper
raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects
original 1.snpvar_ridge_constrained.gz
CHR SNP BP A1 A2 SNPVAR Z N
1 rs10875231 100000012 T G 1.6348e-08 1.2727e+00 80000
1 rs186077422 10000006 A G 1.8599e-08 -1.7100e+00 80000
1 rs6678176 100000827 T C 2.2472e-08 1.1075e+00 80000
1 rs78286437 100000843 T C 1.6348e-08 2.5414e-01 80000
1 rs144406489 100001138 A G 1.6348e-08 1.9074e-02 80000
sorted by BP
column:
CHR SNP BP A1 A2 SNPVAR Z N
1 rs575272151 11008 C G 2.3577e-08 -1.0408e+00 80000
1 rs544419019 11012 C G 2.3577e-08 -1.0408e+00 80000
1 rs540538026 13110 A G 3.8092e-08 -1.1355e+00 80000
1 rs62635286 13116 T G 3.6471e-08 -1.9663e-01 80000
1 rs200579949 13118 A G 3.6471e-08 -1.9663e-01 80000
Thanks.
Is this deliberate? It seems to mean you can't read compressed files since it overrides the automatic compression detection.
Hi,
I am interested in creating priors with PolyFun using approach 3. I would like to pass estimates of LD from the data I have, stored in .bcor files. Is this possible? The --ld-ukb flag does something similar, but seems to be set up for using UK Biobank specifically. I can see how to do the finemapping step with my bcor LD estimates, but I suspect accurate LD estimation in creating the priors will also be very important.
Thanks,
Joni
Thank you for your answer. When I was using Polyfun, I found that there were more than 9 million SNPs in my statistical summary file and I used the most complete reference file you gave, but I only took the first 5 annotations due to memory reasons, and many SNPs were deleted after one after another merge weight LDscore.After the chi^2 step, there are only over 1000 SNPs remain, so there are too few SNPs left. Could you please explain the reason to me?How is it possible to get more SNPs and get better results
Hi, I am receiving the following error after executing the test run for polyfun. I ran the following command in the polyfun conda environment:
python test_polyfun.py --python3 python --ldstore ldstore_v1.1_x86_6
4/ldstore
The following was the error message:
Writing fine-mapping results to /tmp/tmp4xe7wax0/finemap.1.46000001.49000001.gz Traceback (most recent call last): File "test_polyfun.py", line 226, in <module> test_finemapper(temp_dir, args.ldstore, args.python3) File "test_polyfun.py", line 208, in test_finemapper compare_dfs(tmpdir, gold_dir, outfile) File "test_polyfun.py", line 35, in compare_dfs assert np.all(df1[c] == df2[c]), 'found mismatch between %s and %s in column %s'%(file1, file2, c) AssertionError: found mismatch between /tmp/tmp4xe7wax0/finemap.1.46000001.49000001.gz and /home/tkamath/polyfun/gold/finemap.1.46000001.49000001.gz in column SNP
Hello Omer,
I used UKB LD in my fine mapping process with finemap method as following:
python finemapper.py
--ld ../UKB_LD/chr19_41000001_44000001
--sumstats output/ea.bolt.19.snpvar_ridge_constrained.gz
--n 80000
--chr 19
--start 40853017
--end 41853107
--method finemap
--finemap-exe finemap/default/finemap
--max-num-causal 10
--cache-dir LD_cache
--out output/finemap.finemap_exe.1.46000001.49000001.gz
I got following message:
[INFO] Loading LD file ../UKB_LD/chr19_41000001_44000001
[INFO] Done in 45.04 seconds
Traceback (most recent call last):
File "/group/tools/polyfun/finemapper.py", line 1129, in
verbose=args.verbose, ld_file=args.ld, debug_dir=args.debug_dir, allow_missing=args.allow_missing, susie_outfile=args.susie_outfile)
File "/group/tools/polyfun/finemapper.py", line 880, in finemap
self.sync_ld_sumstats(ld_arr, df_ld_snps, allow_missing=allow_missing)
File "/group/tools/polyfun/finemapper.py", line 275, in sync_ld_sumstats
raise ValueError(error_msg)
ValueError: not all SNPs in the sumstats file were found in the LD matrix! You could drop the missing SNPs with the flag --allow-missing, but please note that these omitted SNPs may be
I add --allow-missing option
python finemapper.py
--ld ../UKB_LD/chr19_41000001_44000001
--sumstats output/ea.bolt.19.snpvar_ridge_constrained.gz
--n 80000
--chr 19
--start 40853017
--end 41853107
--method finemap
--finemap-exe finemap/default/finemap
--max-num-causal 10
--cache-dir LD_cache
--out output/finemap.finemap_exe.1.46000001.49000001.gz
--allow-missing
I got:
[INFO] Loading sumstats file...
[INFO] Loaded sumstats for 272448 SNPs in 4.43 seconds
Traceback (most recent call last):
File "/group/tools/polyfun/finemapper.py", line 1129, in
verbose=args.verbose, ld_file=args.ld, debug_dir=args.debug_dir, allow_missing=args.allow_missing, susie_outfile=args.susie_outfile)
File "/group/tools/polyfun/finemapper.py", line 841, in finemap
raise ValueError('FINEMAP object does not support --allow-missing')
ValueError: FINEMAP object does not support --allow-missing
Best regards,
Yunling
Hi Omer,
I seem to be having an issue with pyarrow (on our cluster, but not on my laptop) in extract_snpvar.py
. On our cluster, I get the the error:
from pyarrow import ArrowIOError
ImportError: cannot import name 'ArrowIOError' from 'pyarrow' (unknown location)
Is there a specific version of pyarrow that works best? Here's my Conda env:
(echoR) [schilb03@li03c04 Fine_Mapping]$ conda list
# packages in environment at /sc/orga/work/schilb03/conda/envs/echoR:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_llvm conda-forge
bitarray 1.2.1 py38h516909a_0 conda-forge
ca-certificates 2019.11.28 hecc5488_0 conda-forge
certifi 2019.11.28 py38_0 conda-forge
decorator 4.4.1 py_0 conda-forge
fastparquet 0.3.3 py38hc1659b7_0 conda-forge
joblib 0.14.1 py_0 conda-forge
ld_impl_linux-64 2.33.1 h53a641e_8 conda-forge
libblas 3.8.0 14_openblas conda-forge
libcblas 3.8.0 14_openblas conda-forge
libffi 3.2.1 he1b5a44_1006 conda-forge
libgcc-ng 9.2.0 h24d8f2e_2 conda-forge
libgfortran-ng 7.3.0 hdf63c60_5 conda-forge
liblapack 3.8.0 14_openblas conda-forge
libllvm8 8.0.1 hc9558a2_0 conda-forge
libopenblas 0.3.7 h5ec1e0e_7 conda-forge
libstdcxx-ng 9.2.0 hdf63c60_2 conda-forge
llvm-openmp 9.0.1 hc9558a2_2 conda-forge
llvmlite 0.31.0 py38h8b12597_0 conda-forge
ncurses 6.1 hf484d3e_1002 conda-forge
networkx 2.4 py_0 conda-forge
numba 0.48.0 py38hb3f55d8_0 conda-forge
numpy 1.18.1 py38h95a1406_0 conda-forge
openssl 1.1.1d h516909a_0 conda-forge
pandas 1.0.1 py38hb3f55d8_0 conda-forge
pip 20.0.2 py_2 conda-forge
pyarrow 0.16.0 pypi_0 pypi
python 3.8.1 h357f687_2 conda-forge
python-dateutil 2.8.1 py_0 conda-forge
pytz 2019.3 py_0 conda-forge
readline 8.0 hf8c457e_0 conda-forge
scikit-learn 0.22.1 py38hcdab131_1 conda-forge
scipy 1.4.1 py38h921218d_0 conda-forge
setuptools 45.2.0 py38_0 conda-forge
six 1.14.0 py38_0 conda-forge
sqlite 3.30.1 hcee41ef_0 conda-forge
tabix 0.2.6 ha92aebf_0 bioconda
thrift 0.11.0 py38he1b5a44_1001 conda-forge
tk 8.6.10 hed695b0_0 conda-forge
tqdm 4.42.1 py_0 conda-forge
wheel 0.34.2 py_1 conda-forge
xz 5.2.4 h14c3975_1001 conda-forge
zlib 1.2.11 h516909a_1006 conda-forge
Thanks,
Brian
Hello, I want to use Polyfun to calculate the posterior probability, but when calculating the method 2 of the first step, the code always shows dimension error, I don't understand the relationship between dimension and sample size N, what size does this sample size N refer to?
I noticed in the original S-LDSC that a .results
file with the enrichment scores for each annotation is also produced. Is PolyFun also capable of producing this file (or something comparable)?
Thanks,
Brian
From the LDSC wiki:
This file has the results of the analysis in tab-delimited form. If any category contains all SNPs, then that category will not appear in this file. There is one row for each category and columns summarizing the results: Proportion of SNPs, Proportion of heritability, Enrichment, and standard errors. Enrichment is (Prop. heritability) / (Prop. SNPs). If you use the --print-coefficients flag, then there will also be columns for the regression coefficients. (See Finucane, Bulik-Sullivan et al., bioRxiv for a discussion of the relationship between the coefficients and proportions of heritability.)
Hi,
Thanks for creating this awesome tool. I am following the wiki for creating annotations (https://github.com/omerwe/polyfun/wiki/2.-Using-and-creating-functional-annotations).
I think this page is missing the fact that polyfun.py also looks for files in the form of annot..l2.M. The tutorial says we only need 22 files but we also need the .M file for each chromosome. polyfun.py will result in a FileNotFoundError if the .M file is not present.
Best,
Alan
In line 62 I noticed that when the columns are renamed, freq and MAF seem to be treated the same. But couldn't these two things be different in a summary stats file? Perhaps one way would be to check if 1-freq ≤ .5, and if it is then you know it's the minor allele (and then can flip the ref/alt alleles and effect, though the specifics of this might depend on the particular file format).
Currently, when Not all SNPs in the SNPs file were found in the meta file. Wrote a list of missing SNPs
appears, it registers as an error and stops the script. In my fork I modified this so that it simply removes the missing SNPs and continues (lines 99-102
).
munge_sumstats from LDSC handles (log) odds ratios in place of beta. Is it intentional to not handle this in PolyFun?
Thanks.
In the README, you say to use zcat to view .gz files, but only gzcat seems to work for me. Could this just be a Mac vs. Windows thing?
Versioning details:
Apple gzip 272.250.1
MacBook Pro, Mojave V10.14.5
The python module bitarray
also seems to be required for polyfun (in step 2 of Approach 2), but isn't mentioned in the README.
(Sorry, this one is my fault following #34...!)
When trying to run create_finemapper_jobs.py, it fails with the following error:
File "resources/finemapping/polyfun/create_finemapper_jobs.py", line 6, in
from polyfun import configure_logger, check_package_versions
File "/scratch/mdd-meta/resources/finemapping/polyfun/polyfun.py", line 13, in
from compute_ldscores_ukb import compute_ldscores_chr
ModuleNotFoundError: No module named 'compute_ldscores_ukb'
I think this is because "compute_ldscores_ukb.py" is now "compute_ldscores_from_ld.py"?
Hello,
Thank for create_finemapper_jobs.py, the genome wide finemapping become much easier. At the same time, I got email from system admin that I filled /tmp folder quickly with ld.npz files so nothing else will dispatch. I can clean up the /tmp folder after each job finished. I thought it might be a good idea to have codes to do clean up inside the finemapper.py. I used susie in the finemapper.py.
Best regards,
Yunling
Kind of a random (and possibly dumb) question; does it make any sense to run LDSC on a single user-defined locus (e.g. a locus from a GWAS, that spans anywhere between 500 and 5000 bps)? Or does LDSC always require genome-wide data?
Hey Omer,
It looks like susieR has been updated so that in the newest version susie_bhat() is no longer a function (it was merged into the new function susie_suff_stat ()
. Unfortunately this change causes an error in run_finemapper.py
:
python ./echolocatoR/tools/polyfun/run_finemapper.py --ld Data/GWAS/Nalls23andMe_2019/LRRK2/plink/chr12_40000001_43000001 --sumstats Data/GWAS/Nalls23andMe_2019/_genome_wide/PolyFun/output/PD_GWAS.12.snpvar_constrained.gz --n 1474097 --chr 12 --start 40515388 --end 40713899 --method susie --max-num-causal 10 --out Data/GWAS/Nalls23andMe_2019/LRRK2/finemap.UKBB.LRRK2.gz
*********************************************************************
* Fine-mapping Wrapper
* Version 1.0.0
* (C) 2019 Omer Weissbrod
*********************************************************************
[INFO] Loaded an LD matrix for 22522 SNPs from Data/GWAS/Nalls23andMe_2019/LRRK2/plink/chr12_40000001_43000001.npz
[INFO] Loading sumstats file...
[INFO] Loaded sumstats for 373589 SNPs
[1] "/usr/local/anaconda3/envs/polyfun_venv/lib/R"
[INFO] Removing 21839 SNPs in the LD matrix that are not in the sumstats file
[WARNING] Flipping the alleles of 380 SNPs in the LD matrix that are flipped compared to the sumstats file
[INFO] Starting functionally-informed SuSiE fine-mapping for chromosome 12 BP 40515388-40713899 (683 SNPs)
Traceback (most recent call last):
File "./echolocatoR/tools/polyfun/run_finemapper.py", line 118, in <module>
verbose=args.verbose, ld=ld, df_ld_snps=df_ld_snps)
File "/Users/schilder/Desktop/Fine_Mapping/echolocatoR/tools/polyfun/finemapper.py", line 492, in finemap
susie_obj = self.susieR.susie_bhat(
AttributeError: module 'susieR' has no attribute 'susie_bhat'
To handle this, in my R scripts I wrote a little snippet of code that searches for the new function and if it can't find it, uses the old one. Maybe you could include a pythonic version of this in your script?:
susie_func <- ifelse(length(find("susie_bhat"))==0,
susieR::susie_suff_stat, susieR::susie_bhat)
fitted_bhat <- susie_func(...)
Moreover, as a way of dealing with these little dependency/versioning issues, would it be possible for you to provide a yaml file that contains all of the python, R, and other packages necessary for running PolyFun? This would be a really great way to make the right Conda environment and get started with PolyFun right away. Here's one example from a related repo by OpenTargets.
Thanks,
Brian
Hi, I am running S-LDSC with baseline and my own annotations and am running into the following error:
overlap_matrix_prop[i, :] = overlap_matrix[i, :] / M_annot
ValueError: operands could not be broadcast together with shapes (1,121) (1,137)
Baseline has 16 annotations and my own annotations have 121. Both have the SNP,CHR,BP,A1,A2 columns. Both have .M files that agree with the number of annotations. Both have the same number of SNPs in each chromosome, and they are all tab-delimited.
There must be some mismatch between the two I suppose. I tried running each individually and there were no problems.
So I'm on step 2 of Polyfun approach 3 and I'm getting the following error (despite having the --allow-missing
flag on. I'm using 1KG Phase 1 as a reference (--bfile-chr
) at the moment. Do you know what might be going on?
python ./echolocatoR/tools/polyfun/polyfun.py --compute-ldscores --output-prefix ./Data/GWAS/Nalls23andMe_2019/_genome_wide/PolyFun/output/PD_GWAS --bfile-chr /sc/orga/projects/pd-omics/data/1000_Genomes/Phase1/1000G.mac5eur. --chr 12 --allow-missing
*********************************************************************
* PolyFun (POLYgenic FUNctionally-informed fine-mapping)
* Version 1.0.0
* (C) 2019 Omer Weissbrod
*********************************************************************
[WARNING] no ld-wind argument specified. PolyFun will use --ld-cm 1.0
Traceback (most recent call last):
File "./echolocatoR/tools/polyfun/polyfun.py", line 902, in <module>
polyfun_obj.polyfun_main(args)
File "./echolocatoR/tools/polyfun/polyfun.py", line 835, in polyfun_main
self.compute_ld_scores(args)
File "./echolocatoR/tools/polyfun/polyfun.py", line 697, in compute_ld_scores
df_ldscores_chr = self.compute_ldscores_plink_chr(args, chr_num, df_bins_chr)
File "./echolocatoR/tools/polyfun/polyfun.py", line 747, in compute_ldscores_plink_chr
raise ValueError('Not all SNPs were assigned a bin (meaning some SNPS are not in the annotation files)')
ValueError: Not all SNPs were assigned a bin (meaning some SNPS are not in the annotation files)
Hi!
I'm trying to run the following example from the Wiki page:
#run fine-mapper
python3 run_finemapper.py \
--ld LD_temp/chr1_46000001_49000001 \
--sumstats example_data/chr1.finemap_sumstats.txt.gz \
--n 383290 \
--chr 1 \
--start 46000001 \
--end 49000001 \
--method susie \
--max-num-causal 5 \
--out output/finemap.UKB.1.46000001.49000001.gz
...but I'm receiving the following error:
*********************************************************************
* Fine-mapping Wrapper
* Version 1.0.0
* (C) 2019 Omer Weissbrod
*********************************************************************
[INFO] Loaded an LD matrix for 19286 SNPs from LD_temp/chr1_46000001_49000001.npz
[INFO] Loading sumstats file...
[INFO] Loaded sumstats for 966 SNPs
Traceback (most recent call last):
File "run_finemapper.py", line 115, in <module>
cache_dir=args.cache_dir)
File "/Users/schilder/Desktop/Fine_Mapping/echolocatoR/tools/polyfun/finemapper.py", line 428, in __init__
self.RNULLType = rpy2.rinterface.RNULLType
AttributeError: module 'rpy2.rinterface' has no attribute 'RNULLType'
I checked and rpy2 is up to date for both Python 2 and 3.7 (the latter of which I'm using to run the command).
Here's my versioning details:
MacBook Pro, Mojave V10.14.5
rpy2==3.1.0
R version 3.6.1 (2019-07-05), Action of the Toes
Thanks in advance,
Brian
I am attempting to run step 3. Compute LD-scores for each SNP bin of PolyFun approach 3: Computing prior causal probabilities non-parametrically. My code is essentially the same as the example. The only difference is I have added the flag --allow-missing
mkdir -p LD_cache
python polyfun.py \
--compute-ldscores \
--output-prefix output/testrun \
--ld-ukb \
--ld-dir LD_cache \
--chr 1 \
--allow-missing
When I run this code with the current HEAD commit df5c072, I get the following error message:
Traceback (most recent call last):
File "bin/polyfun/polyfun.py", line 848, in <module>
polyfun_obj.polyfun_main(args)
File "bin/polyfun/polyfun.py", line 776, in polyfun_main
self.compute_ld_scores(args)
File "bin/polyfun/polyfun.py", line 637, in compute_ld_scores
df_ldscores_chr = compute_ldscores_chr(df_bins_chr, ld_dir)
File "/redacted/bin/polyfun/compute_ldscores_from_ld.py", line 193, in compute_ldscores_chr
if args.ukb:
File "/redacted/mambaforge/envs/polyfun/lib/python3.7/site-packages/pandas/core/generic.py", line 5179, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'ukb'
The issue appears to be that polyfun.py
passes a pandas DataFrame:
Line 637 in df5c072
when compute_ldscores_chr()
expects args
to be the first argument:
polyfun/compute_ldscores_from_ld.py
Line 190 in df5c072
I was going to attempt to fix this by passing args
to compute_ldscores_chr()
and replacing args.ukb
with args.ld_ukb
. However, I noticed that args.ukb
is valid when invoking compute_ldscores_from_ld.py
directly, as documented in Computing LD-scores with pre-computed UK Biobank LD matrices.
polyfun/compute_ldscores_from_ld.py
Line 302 in df5c072
Thus it's unclear to me how best to update compute_ldscores_chr()
to support both of these use cases. In case it could be helpful, I traced the function signature change from compute_ldscores_chr(df_annot_chr, ld_dir, no_cache=False)
to compute_ldscores_chr(args, df_annot_chr)
in commit b6674ce, which is the same commit where compute_ldscores_ukb.py
was renamed to compute_ldscores_from_ld.py
.
Hi Omer,
So I finally got my Conda issues sorted out! Since then things have been running pretty smoothly, until I got to this step. I'm realizing it's because I haven't provided any plink reference files.
In the documentation it says to use a closely related reference database (e.g. 1000 Genomes), but since I'm using UKBB for every other step, is there somewhere I can find the UKBB ref files? (either on your server or UKBB's).
Thanks in advance!
Brian
(polyfun_venv) [schilb03@lh03c03 Fine_Mapping]$ python ./echolocatoR/tools/polyfun/polyfun.py --compute-ldscores --output-prefix ./Data/GWAS/Nalls23andMe_2019/_genome_wide/PolyFun/output/PD_GWAS --bfile-chr /sc/orga/projects/pd-omics/tools/polyfun/annotations/baselineLF2.2.UKB/reference. --allow-missing
*********************************************************************
* PolyFun (POLYgenic FUNctionally-informed fine-mapping)
* Version 1.0.0
* (C) 2019 Omer Weissbrod
*********************************************************************
[WARNING] no ld-wind argument specified. PolyFun will use --ld-cm 1.0
Traceback (most recent call last):
File "./echolocatoR/tools/polyfun/polyfun.py", line 898, in <module>
check_files(args)
File "./echolocatoR/tools/polyfun/polyfun.py", line 130, in check_files
get_file_name(args, 'bim', chr_num, verify_exists=True)
File "./echolocatoR/tools/polyfun/polyfun.py", line 238, in get_file_name
raise IOError('%s file not found: %s'%(file_type, file_name))
OSError: bim file not found: /sc/orga/projects/pd-omics/tools/polyfun/annotations/baselineLF2.2.UKB/reference.1.bim
The very end of [this wiki page]((https://github.com/omerwe/polyfun/wiki/1.-Computing-prior-causal-probabilities-with-PolyFun) says "see below" but the documentation it's referring to is now on another page.
Hey Omer,
Finally getting around to trying the finemapper script. One thing that might be nice is being able to specify different locations/prefixes for LD and SumStats files (I noticed it uses ld_prefix for both). In my case, I have sum stats from the PD_GWAS and LD from UKBB.
E.g.:
def load_ld(ld_prefix, sumstats_prefix):
...
...
Thanks,
Brian
In regards to the--max-num-causal
argument described here:
From my understanding, I thought Susie took the max number of SNPs as a parameter and iteratively fine-maps all scenarios from 1 to max SNPs to identify the optimal number of causal SNPs for the given data. FINEMAP on the other hand takes the exact number of SNPs. Do I have this mixed up?
Hello,
I am trying to run the compute_ldscores.py to calculate LDscores for my annotations. I am get an error at this step [INFO] Applying initial ldscores loop, this is the error:
Traceback (most recent call last):
File "compute_ldscores.py", line 190, in
df_ldscores = compute_ldscores(args)
File "compute_ldscores.py", line 149, in compute_ldscores
ldscores = geno_array.ldScoreVarBlocks(block_left, args.chunk_size, annot=annot_values)
File "/home/doug/Documents/Polyfun/polyfun/ldsc_polyfun/ldscore.py", line 127, in ldScoreVarBlocks
return self.corSumVarBlocks(block_left, c, func, snp_getter, annot)
File "/home/doug/Documents/Polyfun/polyfun/ldsc_polyfun/ldscore.py", line 200, in corSumVarBlocks
cor_sum[l_A:l_A+b, :] += np.dot(rfuncAB, annot[l_B:l_B+c, :])
File "<array_function internals>", line 6, in dot
TypeError: can't multiply sequence by non-int of type 'float'
I am not sure what is causing this error, do you have any ideas?
This is the command I am using:
python compute_ldscores.py
--bfile ../AD31chr1polyfun
--annot ../gFIUDWchr1anno.gz
--out ../chr1.parquet
--ld-wind-kb 2000
Where AD31chrpolyfun is the stem for plink formatted bed,bim,fam files. The snp name format in the plink files is chr:pos:A1:A2, where A1 and A2 are organised alphabetically.
The format of gFIUDWchr1anno.gz looks like this
SNP CHR BP A1 A2 rsID Z N ANNO1 ANNO2
1:100000012:G:T 1 100000012 T G rs10875231 -1.2578195807 652942.54 1 0
Both files have the exact same snps in them. Any ideas what I can do to resolve the problem?
Cheers,
Doug
Hey Omer,
Awesome tool. Really enjoying it so far in our simulations. It would be great to include a flag to output the penalized taus, similar to the --print-coefficient
flag in LDSC.
Thanks!
While running finemapper.py, susieR sometimes crashes with the error:
rpy2.rinterface_lib.embedded.RRuntimeError: Error in (function (bhat, shat, R, n, var_y = 1, XtX, Xty, yty, maf = NULL, :
XtX is not a symmetric matrix.
I was able to fix it with the code below. It will be good to have a fix for this.
727 try:
728 #make LD matrix symmetric
729 ldm = np.triu(self.df_ld.values)
730 ldm = ldm + ldm.T - np.diag(np.diag(ldm))
731
732 susie_obj = self.susieR.susie_suff_stat(
733 bhat=bhat.reshape((m,1)),
734 shat=np.ones((m,1)),
735 R=ldm,
736 n=self.n,
737 L=num_causal_snps,
738 scaled_prior_variance=(0.0001 if (prior_var is None) else prior_var),
739 estimate_prior_variance=(prior_var is None),
740 residual_variance=(self.R_null if (residual_var is None) else residual_var),
741 estimate_residual_variance=(residual_var is None),
742 verbose=verbose,
743 prior_weights=(prior_weights.reshape((m,1)) if use_prior_causal_prob else self.R_null)
744 )
Hey Omer,
I've been looking for ways to save space on my computer when using the UKB LD files, so I extended the load_ld()
function in run_finemapper.py
to be able to download LD directly from the alkesgroup server.
I also added a helper function find_ld_prefix()
to identify which LD file you need based on some info.
Just wanted to share these in case you found them helpful to integrate into PolyFun.
Best,
Brian
import numpy as np
import pandas as pd
import logging
import os
import scipy.sparse as sparse
from io import BytesIO
import requests
def find_ld_prefix(chrom, min_pos):
# min_pos = 40590585; chrom=12
alkes_url = "https://data.broadinstitute.org/alkesgroup/UKBB_LD"
bp_starts = list(range(1, 252000001, 1000000))
bp_ends = [x+3000000 for x in bp_starts]
i = max([i for i,x in enumerate(bp_starts) if x<=min_pos])
prefix = "chr%d_%d_%d" % (chrom, bp_starts[i], bp_ends[i])
ld_prefix = os.path.join(alkes_url, prefix)
return ld_prefix
def load_ld(ld_prefix, server=True):
# ld_prefix="https://data.broadinstitute.org/alkesgroup/UKBB_LD/chr12_41000001_44000001"
# load SNPs info
snps_filename_parquet = ld_prefix + '.parquet'
snps_filename_gz = ld_prefix + '.gz'
if os.path.exists(snps_filename_parquet) | server==False:
df_ld_snps = pd.read_parquet(snps_filename_parquet)
elif os.path.exists(snps_filename_gz) | server==True:
df_ld_snps = pd.read_table(snps_filename_gz, delim_whitespace=True)
else:
raise ValueError('couldn\'t find SNPs file %s or %s' % (snps_filename_parquet, snps_filename_gz))
# load LD matrix
R_filename = ld_prefix + '.npz'
if server:
print("Reading .npz file from alkesgroup server")
r = requests.get(R_filename, stream=True)
R = sparse.load_npz(BytesIO(r.raw.read())).toarray()
else:
if not os.path.exists(R_filename):
raise IOError('%s not found' % (R_filename))
R = sparse.load_npz(R_filename).toarray()
R = R + R.T
assert np.allclose(np.diag(R), 1.0)
logging.info('Loaded an LD matrix for %d SNPs from %s' % (R.shape[0], R_filename))
# sanity checks
assert R.shape[0] == R.shape[1]
if R.shape[0] != df_ld_snps.shape[0]:
raise ValueError('LD matrix has a different number of SNPs than the SNPs file')
return R, df_ld_snps
It seems SNP was missing from the columns that get renamed. Added this.
lines 75-77
# BMS edit:: SNP col wasn't being renamed below
return df.rename( columns={snp_column: 'SNP', allele1_col: 'A1', allele0_col: 'A2', a1freq_col: 'MAF', bp_column: 'BP',
chr_column: 'CHR', info_col: 'INFO', beta_col: 'BETA', se_col: 'SE', pvalue_col: 'P', z_col: 'Z'}, errors='ignore')
PolyFun seems to be working fine on my laptop with the example annotations, but when I try to run it on our cluster with the full annotations I'm running into some errors I can't seem to figure out.
Input:
ml python/3.7.3 && python3 ./echolocatoR/tools/polyfun/munge_polyfun_sumstats.py --sumstats /sc/orga/projects/pd-omics/data/nallsEtAl2019/combined_meta/nallsEtAl2019_allSamples_allVariants.mod.txt --n 1474097 --out ./Data/GWAS/Nalls23andMe_2019/_genome_wide/PolyFun/sumstats_munged.parquet --min-info 0 --min-maf 0
Output:
[INFO] Reading sumstats file...
./echolocatoR/tools/polyfun/munge_polyfun_sumstats.py:201: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_sumstats = pd.read_table(args.sumstats, delim_whitespace=True)
[INFO] Done in 9.44 seconds
Traceback (most recent call last):
File "./echolocatoR/tools/polyfun/munge_polyfun_sumstats.py", line 205, in <module>
df_sumstats = rename_df_columns(df_sumstats, min_info_score=args.min_info, min_maf=args.min_maf)
File "./echolocatoR/tools/polyfun/munge_polyfun_sumstats.py", line 76, in rename_df_columns
chr_column:'CHR', info_col:'INFO', beta_col:'BETA', se_col:'SE', pvalue_col:'P', z_col:'Z'}, errors='ignore')
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/util/_decorators.py", line 197, in wrapper
return func(*args, **kwargs)
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/core/frame.py", line 4025, in rename
return super(DataFrame, self).rename(**kwargs)
File "/hpc/packages/minerva-centos7/py_packages/3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 1072, in rename
'argument "{0}"'.format(list(kwargs.keys())[0]))
TypeError: rename() got an unexpected keyword argument "errors"
It seems do be related to the summ stats renaming but I can't find anything wrong with that part of the code.
are you planning to include the "polyfun version" of files related to --ref-ld-chr and --w-ld-chr in your download ftp site:
https://data.broadinstitute.org/alkesgroup/LDSCORE/
python polyfun.py
--compute-h2-L2
--no-partitions
--output-prefix output/testrun
--sumstats example_data/sumstats.parquet
--ref-ld-chr example_data/annotations.
--w-ld-chr example_data/weights.
I am trying to run LDSC with multiple annotation files. I would like to specify the set of annotations from each file but I believe there is a bug in the code that does not allow that to happen. I noticed in the function
read_ld_sumstats
function the following code is problematic:
if args.anno is not None:
cols_to_keep = np.zeros(len(ref_ld.columns), dtype=np.bool)
annotations = args.anno.split(',')
if np.any(~np.isin(annotations, ref_ld.columns.str[:-2])):
raise ValueError('Not all annotations specified with --anno are found in the LD scores file')
cols_to_keep = (ref_ld.columns.str[:-2].isin(annotations)) | (ref_ld.columns.isin(['CHR', 'SNP']))
assert np.sum(cols_to_keep) == len(annotations)+2
M_cols_to_keep = ref_ld.drop(columns=['CHR', 'SNP']).columns.str[:-2].isin(annotations)
assert np.sum(M_cols_to_keep) == len(annotations)
ref_ld = ref_ld.loc[:, cols_to_keep]
M_annot = M_annot[:, M_cols_to_keep]
log.log('Keeping only annotations specified with --anno')
the code
ref_ld.columns.str[:-2]
is a problem because if there are multiple files then the ref_ld matrix will have column names with something like "L2_0" or "L2_1" instead of just L2. This causes a mismatch in the annotation names and the column names in the ref_ld column. I thought I could modify it by just putting a -4 throughout that part but that seems to cause an error with the following function call
overlap_matrix, M_tot = _read_annot(args, log)
It's not clear how I can easily fix this problem. I'm wondering if you have any suggestions for this situation.
Chirag Lakhani
Hi Omer,
finemapper.py pads a 0 in front of chromosome number 1-9, this creates a snp mismatch between the bcor file and the input passed from finemapper.py script during the execution of FINEMAP. There should likely be a note in the Wiki about it.
If the .bcor file is prepared independent of finemapper.py then the chromosome position can be represented as 1 or 01.
if df_z['CHR'].iloc[0]<10: df_z['CHR'] = '0' + df_z['CHR'].astype(str)
hello there,
Thanks for this wonderful software and Wikipedia. I was able to install it seamlessly using conda.
I have GWAS summary statistics for an admixed minority population. The summary statistics follow CHR:POS:Allele1:Allele2 format. Data are imputed in-house, therefore SNP names follow this pattern than rsids.
I would like create parquet format for the summary stats file.
May I know what column headers are needed with munge_polyfun_sumstats.py
? Also, is there any specific order or columns that is must with the script?
with SNP rsids, chromosome and base pair info, and either a p-value, an effect size estimate and its standard error, a Z-score or a p-value.
Sample data from BOLT-LMM has columns as: SNP, CHR, BP, INFO, BETA, SE
Thanks.
Hi Omer,
There is this code block in the polyfun_utils.py that checks for indels.
def set_snpid_index(df, copy=False, allow_duplicates=False):
if copy:
df = df.copy()
is_indel = (df['A1'].str.len()>1) | (df['A2'].str.len()>1)
df['A1_first'] = (df['A1'] < df['A2']) | is_indel
df['A1s'] = df['A2'].copy()
the logic here creates new snpid index which in the case of indels fails when the those indels are flipped in the summary stats compared to the LD panel. example below:
snpid
22.51164013.T.TGTG 22_51164013_226995 22 51164013 T TGTG
this ends up throwing a warning X variants with sumstats were not found in the LD file and will be omitted
I am not sure if this is the intended behavior or a bug. If i comment out the is_indel in polyfun_utils.py then the finemapper.py identifies the flipped snps between the LD panel and the summary stats that are indels. Any help here is appreciated.
Best
Jerome
Hi Omer,
Susie code base has been improved and the current version adds a bunch of bug fix and performance improvements.
One of the things that has changed is the susie_bhat function
761 except:
762 susie_obj = self.susieR.susie_bhat(
763 bhat=bhat.reshape((m,1)),
764 shat=np.ones((m,1)),
765 R=self.df_ld.values,
766 n=self.n,
767 L=num_causal_snps,
768 scaled_prior_variance=(0.0001 if (prior_var is None) else prior_var),
769 estimate_prior_variance=(prior_var is None),
770 residual_variance=(self.R_null if (residual_var is None) else residual_var),
771 estimate_residual_variance=(residual_var is None),
772 verbose=verbose,
773 prior_weights=(prior_weights.reshape((m,1)) if use_prior_causal_prob else self.R_null)
774 )
which is removed from the package and the susie_rss is been used instead. This might throw an error for people using the newer susie version if the code exits with an exceptions
Should these be calculated as part of my compute_ldscore.py
run? I thought it was just something like the number of SNPs in each annotation, but I'm not sure what the equivalent is for continuous annotations. I'm not finding the answer easily in the LDSC wiki either. Thanks!
Hi,
I tried to run the sample data successfully. Then, I tried to analyze a significant hit with UKB data downloaded. After renaming the files according to example_data names, I am able to run the first step:
python munge_polyfun_sumstats.py
--sumstats chr19.sumstat.txt.gz
--n 637154
--out chr19.sumstat_munged.parquet
--min-info 0.6
--min-maf 0.001
python polyfun.py
--compute-h2-L2
--no-partitions
--output-prefix 19_40853017_41853107_sumstats_munged
--sumstats 19_40853017_41853107_sumstats_munged.parquet
--ref-ld-chr ../baselineLF2.2.UKB/annotations.
--w-ld-chr ../baselineLF2.2.UKB/weights.
python polyfun/polyfun.py \
--compute-h2-L2
--no-partitions
--output-prefix 19_40853017_41853107_sumstats_munged
--sumstats 19_40853017_41853107_sumstats_munged.parquet
--ref-ld-chr ../baselineLF2.2.UKB/annotations.
--w-ld-chr ../baselineLF2.2.UKB/weights.
[INFO] Reading summary statistics from 19_40853017_41853107_sumstats_munged.parquet ...
[INFO] Read summary statistics for 5042 SNPs.
[INFO] Reading reference panel LD Score from ../baselineLF2.2.UKB/annotations.[1-22] ...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [05:01<00:00, 13.69s/it]
[INFO] Read reference panel LD Scores for 19386297 SNPs.
[INFO] Reading regression weight LD Score from ../baselineLF2.2.UKB/weights.[1-22] ...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [02:08<00:00, 5.83s/it]
[INFO] Read regression weight LD Scores for 18275613 SNPs.
[INFO] After merging with reference panel LD, 4791 SNPs remain.
[INFO] After merging with regression SNP LD, 4784 SNPs remain.
[WARNING] number of SNPs is smaller than 200k; this is almost always bad.
Traceback (most recent call last):
File "polyfun/polyfun.py", line 848, in
polyfun_obj.polyfun_main(args)
File "polyfun/polyfun.py", line 772, in polyfun_main
self.polyfun_h2_L2(args)
File "polyfun/polyfun.py", line 594, in polyfun_h2_L2
self.run_ldsc(args, use_ridge=True, nn=False, evenodd_split=False, keep_large=False)
File "polyfun/polyfun.py", line 231, in run_ldsc
nnls_exact=args.nnls_exact
File "polyfun/ldsc_polyfun/regressions.py", line 414, in init
nnls_exact=nnls_exact
File "/group/tools/polyfun/ldsc_polyfun/regressions.py", line 257, in init
skip_ridge_jackknife=skip_ridge_jackknife, num_chr_sets=num_chr_sets)
File "polyfun/ldsc_polyfun/jackknife.py", line 582, in init
assert len(np.unique(chr_num)) > 1
AssertionError
Any suggestion?
I'm trying to understand how the regularization is affected by the magnitude of the coefficient for continuous annotations (which is affected by the scale at which the continuous annotation is reported). Should novel continuous annotations be normalized prior to inclusion in polyfun?
Thanks!
Gabe
My schizophrenia summary data does not have a MAF column. Can I simply set the MAF to 0?Is that going to make a big difference
Hello, I used Polyfun to get the prior probability files of more than 800,000 SNPs on 22 chromosomes of my schizophrenia, and then wanted to calculate the posterior probability. However, I see that the reference you gave is divided according to chromosome segments, are there some segments that don't get SNPs, resulting in? I used my prior probability file to test the FineMap method of Example 3 in Method 3 you gave, and the posterior probability was only 1 or 0. Could you tell me why? And I'm a little confused about the choice of parameters. Is there an efficient way to get the PIPs of all SNPs on 22 chromosomes?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.