GithubHelp home page GithubHelp logo

genepi's Introduction

GenEpi

GenEpi is a package to uncover epistasis associated with phenotypes by a machine learning approach, developed by Yu-Chuan Chang at c4Lab of National Taiwan University and Taiwan AI Labs

The architecture and modules of GenEpi.

Introduction

GenEpi is designed to group SNPs by a set of loci in the gnome. For examples, a locus could be a gene. In other words, we use gene boundaries to group SNPs. A locus can be generalized to any particular regions in the genome, e.g. promoters, enhancers, etc. GenEpi first considers the genetic variants within a particular region as features in the first stage, because it is believed that SNPs within a functional region might have a higher chance to interact with each other and to influence molecular functions.

GenEpi adopts two-element combinatorial encoding when producing features and models them by L1-regularized regression with stability selection In the first stage (STAGE 1) of GenEpi, the genotype features from each single gene will be combinatorically encoded and modeled independently by L1-regularized regression with stability selection. In this way, we can estimate the prediction performance of each gene and detect within-gene epistasis with a low false positive rate. In the second stage (STAGE 2), both of the individual SNP and the within-gene epistasis features selected by STAGE 1 are pooled together to generate cross-gene epistasis features, and modeled again by L1-regularized regression with stability selection as STAGE 1. Finally, the user can combine the selected genetic features with environmental factors such as clinical features to build the final prediction models.

Standalone App

(Latest Update!) The standalone and installation free app - AppGenEpi (v.beta) is now released. Just download it and have fun.

OS Version Link
MacOS Catalina AppGenEpi_MacOS_beta
Linux CentOS 7 AppGenEpi_Linux_beta

The snapshot of AppGenEpi.

For MacOS

.1) unzip AppGenEpi_MacOS_beta.zip; 2) drag AppGenEpi.app to Applications; 3) allow permission for running AppGenEpi.app by setting System Preferences > Security & Privacy (We are not identified developers so far.).

For Linux

.1) change the directory to AppGenEpi; 2) use ./AppGenEpi to run it.

Citing

Please considering cite the following paper if you use GenEpi in a scientific publication:

[1] Yu-Chuan Chang, June-Tai Wu, Ming-Yi Hong, Yi-An Tung, Ping-Han Hsieh, Sook Wah Yee, Kathleen M. Giacomini, Yen-Jen Oyang, and Chien-Yu Chen. "Genepi: Gene-Based Epistasis Discovery Using Machine Learning." BMC Bioinformatics 21, 68 (2020). https://doi.org/10.1186/s12859-020-3368-2

Quickstart

This section gets you started quickly. The completed GenEpi's documentation please find on Welcome to GenEpi’s docs!

Installation

$ pip install GenEpi

NOTE: GenEpi is a memory-consuming package, which might cause memory errors when calculating the epistasis of a gene containing a large number of SNPs. We recommend that the memory for running GenEpi should be over 256 GB.

Running a quick test

Please use following command to run a quick test, you will obtain all the outputs of GenEpi in your current folder.

$ GenEpi -g example -p example -o ./

Interpreting the main results table

GenEpi will automatically generate three folders (snpSubsets, singleGeneResult, crossGeneResult) beside your .GEN file. You could go to the folder crossGeneResult directly to obtain your main table for episatasis in Result.csv.

RSID -Log102 p-value) Odds Ratio Genotype Frequency Gene Symbol
rs157580_BB rs2238681_AA 8.4002 9.3952 0.1044 TOMM40
rs449647_AA rs769449_AB 8.0278 5.0877 0.2692 APOE
rs59007384_BB rs11668327_AA 8.0158 12.0408 0.0824 TOMM40
rs283811_BB rs7254892_AA 8.0158 12.0408 0.0824 PVRL2
rs429358_AA 5.7628 0.1743 0.5962 APOE
rs73052335_AA rs429358_AA 5.6548 0.1867 0.5714 APOC1*APOE

The first column lists each feature by its RSID and the genotype (denoted as RSID_genotype), the pairwise epistatis features are represented using two SNPs. The last column describes the genes where the SNPs are located according to the genomic coordinates. We used a star sign to denote the epistasis between genes. The p-values of the χ2 test (the quantitative task will use student t-test) are also included. The odds ratio significantly away from 1 also indicates whether the features are potential causal or protective genotypes. Since low genotype frequency may cause unreliable odds ratios, we also listed this information in the table.

Options

For checking all the optional arguments, please use --help:

$ GenEpi --help

You will obtain the following argument list:

usage: GenEpi [-h] -g G -p P [-s S] [-o O] [-m {c,r}] [-k K] [-t T]
              [--updatedb] [-b {hg19,hg38}] [--compressld] [-d D] [-r R]

optional arguments:
  -h, --help      show this help message and exit
  -g G            filename of the input .gen file
  -p P            filename of the input phenotype
  -s S            self-defined genome regions
  -o O            output file path
  -m {c,r}        choose model type: c for classification; r for regression
  -k K            k of k-fold cross validation
  -t T            number of threads

update UCSC database:
  --updatedb      enable this function
  -b {hg19,hg38}  human genome build

compress data by LD block:
  --compressld    enable this function
  -d D            threshold for compression: D prime
  -r R            threshold for compression: R square

Meta

Chester (Yu-Chuan Chang) - [email protected]
Distributed under the MIT license. See LICENSE for more information.
https://github.com/Chester75321/GenEpi/

genepi's People

Contributors

chester75321 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

genepi's Issues

Cannot Isntal GenEPI

I am getting Some errors while installing GenEpi?

Could not find a version that satisfies the requirement PyQt5-Qt5>=5.15.2 (fro m PyQt5>=5.14.0->GenEpi) (from versions: )
No matching distribution found for PyQt5-Qt5>=5.15.2 (from PyQt5>=5.14.0->GenEpi )

Please guide me

Inquiry about the datasets.

Hello, I've been studying your paper and code and it is very interesting. I think that I understand the general concepts but I'm very new to working with genetics data so I have a few questions.

Could you please point me to the source of the example data? (sample.gen and sample.csv.)

I understand that sample.gen is the genotype data, and sample.gen is the phenotype data but I've yet to find the source of either of them.

From what I've managed to figure out so far:
Genotype data is: [chromosome_num, SNP_reference_sequence_id, position_on_chromosome, base_pairs?, ?].
Phenotype data is : [?, Odds ratio, class]

I've managed to get similar data to yours by using the GAMATES software that you mentioned in your paper but a lot of data that is present in the example inputs is missing.

Thank you for your time.

EDIT: I found the documentation
https://genepi.readthedocs.io/en/latest/format.html#input-genotype-data

problem about download GenEpi

Hi
I have a problem when I download your software.
I had an ERROR message: No matching distribution found for matplotlib>=3.1.1.

The matplotlib in my python 3.7 has updated to 3.0.3 and cannot have the newest version.
My Ubuntu version is 16.04
What can I do to download GenEpi concisely?

ModuleNotFoundError in randomised_11.py

Hi,
I keep getting this error below. All the "sklearn.module" in the "randomized_l1.py file" cannot be loaded (ModuleNotFounderror)

Is it because there is a circular dependency somewhere?

(YOUR_PYTHON) [yrj21@login-e-9 GenEpi]$ GenEpi
Traceback (most recent call last):
  File "path/bin/GenEpi", line 5, in <module>
    from genepi.GenEpi import main
  File "path/lib/python3.6/site-packages/genepi/__init__.py", line 15, in <module>    from .step4_singleGeneEpistasis_Logistic import SingleGeneEpistasisLogistic
  File "path/python3.6/site-packages/genepi/step4_singleGeneEpistasis_Logistic.py", line 34, in <module>
    from genepi.tools import randomized_l1
  File "path/python3.6/site-packages/genepi/tools/__init__.py", line 9, in <module>
    from . import randomized_l1
  File "path/python3.6/site-packages/genepi/tools/randomized_l1.py", line 24, in <module>
    from sklearn.linear_model.base import _preprocess_data
ModuleNotFoundError: No module named 'sklearn.linear_model.base'

Error in step05

Hi,

I successfully run the GenEpi up to step 04 and getting an error in step 05 as below.

Traceback (most recent call last):
File "/home/n10398406/PD_Data/GenEpi_PD/Trial_chr21_gen/GenEpiL21/bin/GenEpi", line 8, in
sys.exit(main())
File "/home/n10398406/PD_Data/GenEpi_PD/Trial_chr21_gen/GenEpiL21/lib/python3.7/site-packages/genepi/GenEpi.py", line 209, in main
float_score_train, float_score_test = CrossGeneEpistasisLogistic(os.path.join(str_outputFilePath, "singleGeneResult"), str_inputFileName_phenotype, int_kOfKFold=int(args.k), int_nJobs=1)
TypeError: cannot unpack non-iterable float object

What can I do to get rid of this error? Really appreciated if you could help me.
Thank you.

Linduni

Compatibility with newer versions of scikit-learn

Hello,

I am running into some issues when running GenEpi, consequence of some changes to the interface of some scikit-learn functions. For the moment, I have run into the following issues:

from sklearn.cross_validation import KFold -> from sklearn.model_selection import KFold
from sklearn import grid_search -> from sklearn.model_selection import GridSearchCV
KFold function has changed.

Cheers,
Héctor.

Issue in UCSC database access

Hi,
I tried to get output files for example data using $ GenEpi -g example -p example -o ./ command. But, I got an error:
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'genome-mysql.cse.ucsc.edu' (timed out)")

I searched for "genome-mysql.cse.ucsc.edu'. But, it is not available.

So, It would be really appreciated if you can give any suggestion to overcome this problem.

Thank you in advance.

EM algorithm

Hi I was curious about the EM algorithm that used in the software, would it be possible to tell us the design of the algorithm based on which equation or reference? Thanks.

Best,
Yu-Ping

try:
### EM algorithm
for idx_loop in range(0, 10000):
### E(num_AB|prob_AB) = 2 * num_AABB + num_AABb + num_AaBB +
### (prob_AB * (1 + prob_AB - prob_A - prob_B) * num_AbBb) /
### ((prob_A - prob_AB) * (prob_B - prob_AB) + prob_AB * (1 + prob_AB - prob_A - prob_B))
float_num_AB_estimateByEM = 2 * float(np_contigency[0, 0]) + float(np_contigency[0, 1]) + float(np_contigency[1, 0]) + (float_probability_AB * (1 + float_probability_AB - float_probability_A - float_probability_B) * float(np_contigency[1, 1])) / ((float_probability_A - float_probability_AB) * (float_probability_B - float_probability_AB) + float_probability_AB * (1 + float_probability_AB - float_probability_A - float_probability_B))
float_probability_AB_estimateByEM = float_num_AB_estimateByEM / (int_num_subject * 2)
if abs(float_probability_AB_estimateByEM - float_probability_AB) < 0.0000001:
break
else:
float_probability_AB =

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.