zhengxwen / hibag Goto Github PK

View Code? Open in Web Editor NEW

29.0 6.0 7.0 41.25 MB

R package – HLA Genotype Imputation with Attribute Bagging (development version only)

Home Page: https://hibag.s3.amazonaws.com/index.html

R 41.62% C++ 57.37% TeX 0.42% C 0.59%

hla mhc imputation snp r bioinformatics gpu

hibag's Introduction

HLA Genotype Imputation with Attribute Bagging

Kernel Version: 1.5

GNU General Public License, GPLv3

Features

HIBAG is a state of the art software package for imputing HLA types using SNP data, and it relies on a training set of HLA and SNP genotypes. HIBAG can be used by researchers with published parameter estimates instead of requiring access to large training sample datasets. It combines the concepts of attribute bagging, an ensemble classifier method, with haplotype inference for SNPs and HLA types. Attribute bagging is a technique which improves the accuracy and stability of classifier ensembles using bootstrap aggregating and random variable selection.

Bioconductor Package

Release Version: 1.40.0

http://www.bioconductor.org/packages/HIBAG/

Changes in Bioconductor Version (since v1.26.0, Y2020):

Kernel Version: v1.5
The kernel v1.5 generates the same training model as v1.4, but 2-6x faster, by taking advantage of Intel AVX, AVX2 and AVX512 intrinsics if available

Changes in Bioconductor Version (since v1.14.0, Y2017):

Kernel Version: v1.4
The kernel v1.4 outputs exactly the same model parameter estimates as v1.3, and the model training with v1.4 is 1.2 times faster than v1.3
Modify the kernel to support the GPU extension

Changes in Bioconductor Version (since v1.3.0, Y2013):

Kernel Version: v1.3
Optimize the calculation of hamming distance using SSE2 and hardware POPCNT instructions if available
Hardware POPCNT: 2.4x speedup for large-scale data, compared to the implementation in v1.2.4
SSE2 popcount implementation without hardware POPCNT: 1.5x speedup for large-scale data, compared to the implementation in v1.2.4

Package Author & Maintainer

Dr. Xiuwen Zheng

Pre-fit Model Download

https://hibag.s3.amazonaws.com/index.html
Platform-specific HLARES models: https://hibag.s3.amazonaws.com/hlares_index.html

Citation

Zheng, X. et al. HIBAG-HLA genotype imputation with attribute bagging. Pharmacogenomics Journal 14, 192-200 (2014). doi: 10.1038/tpj.2013.18

Zheng, X. (2018) Imputation-Based HLA Typing with SNPs in GWAS Studies. In: Boegel S. (eds) HLA Typing. Methods in Molecular Biology, Vol 1802. Humana Press, New York, NY. doi: 10.1007/978-1-4939-8546-3_11

Installation

Bioconductor repository:

source("http://bioconductor.org/biocLite.R")
biocLite("HIBAG")

Development version from Github (for developers/testers only):

library("devtools")
install_github("zhengxwen/HIBAG")

The install_github() approach requires that you build from source, i.e. make and compilers must be installed on your system -- see the R FAQ for your operating system; you may also need to install dependencies manually.

Acceleration

CPU with Intel Intrinsics

GCC (>= v6.0) is strongly recommended to compile the HIBAG package (Intel ICC is not suggested).
HIBAG::hlaSetKernelTarget("max") can be used to maximize the algorithm efficiency.

GPU with OpenCL

HIBAG.gpu, requiring HIBAG (>= v1.28.0).

hibag's People

Contributors

Stargazers

Watchers

Forkers

biostatqian janeshen91 adiamb shubhamsaini jiaozexin suraj-adewale

hibag's Issues

Imputed genotypes to train HIBAG Models?

Hello, Its not really an issue, but wonder if we can use imputed genotypes to train a HIBAG model? If so what are pros and cons of this approach in relation to using the genotyped snps? thank you for this fantastic package
Aditya

Amino Acid test with continuous outcome

I'm trying to use HIBAG with continuous outcome to do allelic test and amino acid test, my allelic test ran really fast and has no issue. However, when running the amino acid test, it asked me to convert my outcome to factor, the association for amino acid test ran really slow and didn't output any result. I was wondering if it is capable for continuous phenotype.

In theory the "glm" model is supposed to handle this, but the documentation of HIBAG somehow makes me think it's not designed for AA test with continuous outcome.

Please let me know if I'm understanding it right.

Training in parallel causes error for loci "other" than HLA

I am trying to train HIBAG models for loci other than HLA. Training the models in parallel causes the following error:

> hlaAllele(true_kir_types_train$sample.id, H1=true_kir_types_train$allele1, H2=true_kir_types_train$allele2, locus="any")
....
....
....
> hlaParallelAttrBagging(cl, train.allele, traingeno, nclassifier=10, auto.save="output.RData")

Calculating matching proportion:
Error in hlaCombineAllele(res, rv[[i]]) : 
  H1$pos.start == H2$pos.start is not TRUE

However, if I change the locus parameter in hlaAllele to some HLA gene, no error is thrown:

> hlaAllele(true_kir_types_train$sample.id, H1=true_kir_types_train$allele1, H2=true_kir_types_train$allele2, locus="A")

hg38 hapmap data

In my recent usage of HIBAG, I have hla typed genetic data called from the hg38 reference genome. Though I was able to lifotver the hapmap training data positions, I believe it will be much more accurate if the HapMap_CEU_Geno data were available with the hg38 assembly, especially since this assembly has been out for 10 years. When do you plan to add this data or have it available in the package?

hlaConvSequence function: No matching: 02:02

Dear Xiuwen,

Thank you so much for developing the wonderful HIBAG package.

I am currently using HIBAG to analyze the association between HLA and an outcome. However, I encountered an issue with the hlaConvSequence function, specifically an error indicating "No matching for HLA_DQB1*02:02." This results in missing sequences for the relevant samples, which leads to a reduction in my sample size.

Is there a way to resolve this issue to prevent the reduction in sample size? Any guidance or suggestions you can provide would be greatly appreciated.

Possible Incorrect Accuracy Estimates from HIBAG - Please help !

It seems that HiBAG does NOT include those samples for accuracy calculation that have alleles that are NOT found in the training model. As an example, I have a training dataset where there are NO copies of the A23:17 allele. However, in my test dataset there are many copies of that allele. I see that any samples that had at least one copy of A23:17 has been removed from the accuracy calculation. I am not sure if this is intended or if I am missing something ?

missing SNPs

Hi,
I am trying to predict the HLA-B using the pre-fitted model and got the following output.
I do not understand why there are 72.6% missing SNPs for the Pos+Allele matching type. Is that a normal phenomenon, given the highly polymorphic properties in HLA regions?

My raw dataset was genotyped using Global Screening Array and I used the corresponding model.

Thank you in advance.

###############Output#########
HIBAG model for HLA-B:
500 individual classifiers
791 SNPs
88 unique HLA alleles: 07:02, 07:04, 07:05, ...
Prediction:
based on the averaged posterior probabilities
Model assembly: hg19, SNP assembly: hg19
Matching the SNPs between the model and the test data:
match.type="--" missing SNPs #
Position 26 (3.3%) being used [1]
Pos+Allele 574 (72.6%)* [2]
RefSNP+Position 27 (3.4%)
RefSNP 27 (3.4%)
[1]: useful if ambiguous strands on array-based platforms
[2]: suggested if the model and test data have been matched to the same reference genome
Model platform: Illumina 1M Duo / Infinium Global Screening Array
of SNP loci with flipped alleles: 367
of SNP loci with swapped strands: 365
of samples: 4050
CPU flags: 64-bit
of threads: 8

"there are 0 individuals in common" and "IDs in PLINK bed are not unique!"

Hi, I have been trying to predict HLA allele type using HIBAG on two different datasets, one with all SNPs and the other with WGS data.

With the SNP dataset, I could not get the function hlaCompareAllele to work. The following is how I used the function;

> rv_ct0_sea730k <- hlaCompareAllele(true_b, hla_b_sea730k, call.threshold = 0)
Calling 'hlaCompareAllele': there are 0 individuals in common.

> rv_ct5_sea730k <- hlaCompareAllele(true_b, hla_b_sea730k, call.threshold = 0.5)
Calling 'hlaCompareAllele': there are 0 individuals in common.

I also tried training the data;

> sea730k_model <- hlaParallelAttrBagging(10, true_b, train.geno_sea730k, nclassifier = 100)
Error in .DynamicClusterCall(cl, fun = function(job, hla, snp, mtry, prune,  : 
  One node produced an error: There is no common sample between 'hla' and 'snp'.

With the WGS dataset, I also could not get hlaBED2Geno to work.

> geno_dusun <- hlaBED2Geno("BNF_HLA.bed","BNF_HLA.bim","BNF_HLA.fam", assembly = "hg38")
Open "BNF_HLA.bed" in the SNP-major mode.
Error in hlaBED2Geno("BNF_HLA.bed", "BNF_HLA.bim", "BNF_HLA.fam", assembly = "hg38") : 
  IDs in PLINK bed are not unique!

The WGS dataset was converted from vcf to plink format using the plink tool.

For both the WGS and SNP dataset, can I resolve this by adjusting the data to a certain format?

The SNP dataset is obtained from https://evolbio.ut.ee/SEA/ and
the WGS dataset is obtained from https://www.simonsfoundation.org/simons-genome-diversity-project/ with the focus being on the Southeast Asian two Dusun individuals.

Difference between h.pval and others

Dear colleage,

Many thanks for the fantastic tool. I was wondering what the difference between h.pval and the other p-values are? I appreciate there are differences between chi-squared and fishers but was unclear what the h.pval means when doing HLA association testing?

Many thanks

HLA typing for SNP2HLA output

Hi, I wonder if I could feed the bed, bim and fam files output from SNP2HLA into HIBAG to get the HLA class I and II typing.

Thank you in advance!

[Installation Error for HIBAG.gpu] Cannot find/define the path for opencl.h

Dear HIBAG author,

I have encountered a problem in order to install your HIBAG.gpu version of the package onto a supercomputer (reposted at zhengxwen/HIBAG.gpu#2).

The issue is defining the path for opencl.h file. Based on the documentation, I cannot find any place where I can define the path of opencl.h file. Is there a way to resolve this?

Here is the error log:

/usr/apps/general/spack/sw/linux-rhel8-zen/gcc-8.5.0/gcc-11.3.0-
cwx43q6qt46zl5olgckurx67xtg4nuyd/bin/g++ -std=gnu++14 -
I"/usr/apps/general/spack/sw/linux-rhel8-zen/gcc-11.3.0/r-4.2.0-
2liuw4vmic27cmqhyyt6jmvwbezn6mlx/rlib/R/include" -DNDEBUG  -
I'/usr/apps/general/spack/sw/linux-rhel8-zen/gcc-11.3.0/r-4.2.0-
2liuw4vmic27cmqhyyt6jmvwbezn6mlx/rlib/R/library/HIBAG/include' -
I/usr/local/include   -fpic  -g -O2  -c LibHLA_gpu.cpp -o LibHLA_gpu.o
In file included from LibHLA_gpu.cpp:37:
LibOpenCL.h:28:17: fatal error: CL/opencl.h: No such file or directory
   28 | #       include <CL/opencl.h>
      |                 ^~~~~~~~~~~~~
compilation terminated.
make: *** [/usr/apps/general/spack/sw/linux-rhel8-zen/gcc-11.3.0/r-4.2.0-
2liuw4vmic27cmqhyyt6jmvwbezn6mlx/rlib/R/etc/Makeconf:177: LibHLA_gpu.o] 
Error 1

Thanks!

VCF files

Hello!

Many than for great package!
I wanted to ask how can extract vcf with info score from hla prediction output?

cl <- makeCluster(20)
set.seed(1000)
parseCommandArgs(evaluate=TRUE)
model.obj <- get(load(file1))
model <- hlaModelFromObj(model.obj)
summary(model)
p1=plot(model)
yourgeno <- hlaBED2Geno(bed.fn=bed.file, fam.fn=fam.file, bim.fn=bim.file)
summary(yourgeno)
pred.guess <- hlaPredict(model, yourgeno, match.type="Position")

summary(pred.guess)

hlaAlleleToVCF(hlaAlleleSubset(pred.guess, 1:4),DS=TRUE, verbose=TRUE, outfn=vcf.out)

save(pred.guess, p1, file=file2)

After I execute this script I get a vcf fle with dosages but no infor score.

Many thanks!

Issue with hlaAttrBagClass

Dear Xiuwen Zheng

I yesterday installed the HIBAG package. Unfortunately I am stuck with the example which you provide on the webpage. In particular my R session gets stuck once dealing with hlaAttrBagClass.

Some info on my R session.

It would be great to have your help!

Thanks a lot,

Nicolas

Association testing across all HLA type simultaneously

Many thanks for this fantastic tool.

I was wondering if you had a suggestion for testing all the inferred HLA types simultaneously in a case control cohort? The vignette just looks at a single HLA type at a time. I could run this scripts per HLA type manually then correct the P-values for multiple testing manually but I was wondering if you had implemented such a tool internally?

All the best

Old aa alignment files

The package is shipped with aa alignment files from 2015. Would it be possible to update these files to current release?
Cheers

Invalid prefix in the PLINK BED file error message

The hlaBED2Geno function fails when I try to read in the BED file. If I use it as written, I get the error message: Error in hlaBED2Geno(fam.fn = ".fam", bim.fn = ".bim", bed.fn = ".bed") :
Cannot open the file .bed.

If I name the bed.fn separately, I get the message in the subject line. PLINK has no issues with the BED file. I made a PED and MAP file and then remade the BED file.

Thank you for any assistance.
Martha Butterworth
[email protected]

The numbers of SNPs are not consistent

Hi , I am using Hibag for HLA imputation. My dataset is GSA genotyped data.

I found when I use the function "hlaBED2Geno" to import the genotype data, there are around 8000 SNPS in the MHC region. But when I use plink to extract MHC region from my genotype data, there are around 40,000 SNPs.

I am wondering where the discrepancy comes from. And for the HLA Manhattan plot, I am trying to merge the SNPs in the MHC region with imputed HLA alleles. Is there any function that I can use in HIBAG to accomplish that?

Thank you in advance!

Force monomorphic SNPs in model?

I have a dataset that I am training a HIBAG model on, which I would then like to combine with an existing HIBAG model so that I can incorporate HLA alleles outside of my training set. In order to do so, I have set the snpid in my training data to the pre-existing model's snp.id set (N = 966 SNPs). I can confirm that the length of my train.geno$snp.sel is in fact 966, but when I begin to train my HIBAG model it immediately removes 6 SNPs with the line: Exclude 6 monomorphic SNPs

Removing monomorphic SNPs before training a HIBAG model makes plenty of sense, but is there any parameter that allows me to force these SNPs into the model? Without those 6 SNPs in my model, I cannot combine my own trained model with the other model of interest. Here is the error code I receive: Error: identical(obj1$snp.id, obj2$snp.id) is not TRUE, where the only differences between the snp.id values are the 6 monomorphic SNP sites. Is there some way to force monomorphic SNPs into the model so that I can ultimately combine them?

association test

Hi Xiuwen,
I would like to ask how to understand the result of the association test. Is it necessary to calculate the Bonferroni-corrected p-values?
I also have some questions on convert alleles to Plink format. If I understand correctly, genotyped information is needed in this process. However, I would like to ask if it is possible to directly convert hlaAllele to Plink format.
Thank you in advance!

Error Installing HIBAG

On R 4.1.2. Have tried to install using BiocManager, using the dev version, and from source. Each time have the same error:

LibHLA_ext_avx512vpopcnt.cpp: In function ‘int hamm_d(const TGenoStruct_512vpopcnt&, const HLA_LIB::THaplotype&, const HLA_LIB::THaplotype&)’:
LibHLA_ext_avx512vpopcnt.cpp:173:15: error: ‘_mm_popcnt_epi64’ was not declared in this scope

Can anyone suggest how to install?

Question: Genotype QC for training

Since many snps in the HLA region typically don't obey the HWE, do you recommend not applying a HWE filter for quality control steps before doing the training? can you advice please? thank you

HLA prediction quality

Is there any way to obtain the quality (r2 or INFO) of imputing HLA-* ?
So far, after doing the prediction I can only see the alleles and only one probability.

More than 50%of SNPs are missing

Hi, I am currently working on the HLA imputation of a Norwegian cohort, using SNPs. Every time I launch the HIBAG R script, whether I use the pre-fit models or build and predict in parallel, I always get the warning “More than 50% of SNPs are missing!”. I have checked the input PLINK files I use, and they contain most of SNPs IDs (252/275) coming from the HapMap_CEU_Geno$snp.id list. Could you tell me what can trigger this warning to appear in the R code you wrote, so I can correct my use of your software?
Thank you.

Error loading sample pre-built models

I'm trying to get Hibag working with just the pre-built example data to get myself started and I'm already stuck. On Hibag v1.20 running on R 3.5.1, this is what I am experiencing:

library (HIBAG)
HIBAG (HLA Genotype Imputation with Attribute Bagging)
Kernel Version: v1.4
Supported by Streaming SIMD Extensions (SSE2) [64-bit]
model_obj <- get (load ("European-HLA4-hg19.RData"))
model <- hlaModelFromObj (model_obj)
Error in hlaModelFromObj(model_obj) :
inherits(obj, "hlaAttrBagObj") is not TRUE
`

I tried "solving the problem myself" by making a copy of the hlaModelFromObj function and removing the stopifnot () check at the beginning, but that did not work. I've unfortunately reached the level of my R debugging skills. I can look into model_obj and do a summary () of it and the data seems valid.

If you can tell me what I am doing wrong, feel free to let me know.

Thanks!

glm does not work

Hi, the association analysis showed "glm does not work" . How to fix it ?

hlaAssocTest(hla_a, Outcome ~ h, data=HLA_A, prob.threshold=0.5,showOR=TRUE,model = "additive")

Logistic regression (additive model) with 983 individuals:
Warning in hlaAssocTest.hlaAlleleClass(hla_a, Outcome ~ h, data = HLA_A, :
glm does not work.
[-] [h] %.[-] %.[h] chisq.st chisq.p fisher.p
01:01 1636 330 54.5 44.2 1.124e+01 <0.001* <0.001*
02:06 1961 5 52.9 0.0 3.684e+00 0.055 0.023*

hlaAssocTest()

Dear Xiuwen,

I am doing association test between disease and hla. When I run the hlaAssocTest(), i am getting the following message:
"Warning in hlaAssocTest.hlaAlleleClass(hla, disease ~ h, data = df_short) :
glm does not work."

I found that HLA types are from 2015. Could it be the problem that some of my hlas are not in the database of this package?
Would it be right if I just add the new updated HLAs to HIBAG/extdata/v3.22.0/hla_nom_g.txt ?
Do you know what could be the reason why glm doesn't work?

thank you!

Installation failed: Not Found (404) with install_github("zhengxwen/HIBAG.gpu")

Hi,

Following the README.md Acceleration section to use GPU with OpenCL, but got problem:

> install_github("zhengxwen/HIBAG.gpu")
Downloading GitHub repo zhengxwen/HIBAG.gpu@master
from URL https://api.github.com/repos/zhengxwen/HIBAG.gpu/zipball/master
Installation failed: Not Found (404)

> packageVersion("HIBAG")
[1] ‘1.17.2’
>

Is there anything i'm doing wrong?