GithubHelp home page GithubHelp logo

kelly1210 / expresto Goto Github PK

View Code? Open in Web Editor NEW

This project forked from krishnanlab/expresto

0.0 1.0 0.0 115.09 MB

Data, results, and code accompanying the manuscript: "A Flexible, Interpretable, and Accurate Approach for Imputing the Expression of Unmeasured Genes" "

License: Other

Shell 1.71% Python 8.39% Jupyter Notebook 89.90%

expresto's Introduction

Expresto

This repository contains data and code to generate the results and reproduce the figures and tables found in A Flexible, Interpretable, and Accurate Approach for Imputing the Expression of Unmeasured Genes, published in Nucleic Acids Research. This work introduces a new method for imputing gene expression. The method introduced, SampleLASSO, uses the LASSO machine learning algorithm in a way that captures context specific biologically relevant information to guide imputation.

This repo provides:

  1. The data, results, and figures presented in the manuscript.
  2. Code to regenerate the results and figures.
  3. A function that allows a user to upload a dataset to be imputed, and then we use SampleLASSO to fill in the unmeasured genes and also report which other expression samples in the training data were the most helpful for imputation.

Section 1: Pre-computed Data, Results, and Figures/Tables

Data

The data used in this study (networks, embeddings, and genesets) is available on Zenodo. To get the data run

sh get_data.sh

Results

PDF versions of the figures can be found in figures/. The notebook that generates the figures can be found at src/make_figures.ipynb.

Section 2: Regenerating the Results and Figures/Tables

Dependencies

This code was tested on an Anaconda distribution of python. The major packages used are:

python 3.7 
numpy 1.16.4
scipy 1.3.0
pandas 0.24.2
scikit-learn 0.20.3
matplotlib 3.0.3
seaborn 0.9.0
statsmodels 0.9.0
tensorflow-gpu 1.14.0 (this was run with python 3.6)
keras-gpu 2.2.4 (this was run with python 3.6)

The parallelization of the code was tested with Slurm on the high performance computing cluster at Michigan State University.

Running LASSO and KNN code

  1. main.py: Main script that generates imputed values
  2. main_utls.py: Helper function for main.py
  3. main_slurm.py: A python script that will submit numerous jobs through slurm
  4. run_GeneKNN_val_jobs.sh, run_GeneLasso_val_job.sh, run_SampleKNN_val_jobs.sh, run_SampleLasso_val_job.sh are scripts that start running the relevant jobs.
  5. main_knitting.py: Combines all predictions for one hyperparameter set into one file
  6. main_evalautions.py: Makes a file that has evaluations for different metrics

Running DNN code

  1. DNN_main.py: Main script that generates imputed values, and makes the evaluation file
  2. DNN_slurm.py: A python script that submits all relevant DNN jobs

Running GAN code

  1. GGAN_main.py: Main script that generates imputed values, and makes the evaluation file
  2. GGAN_slurm.py: A python script that submits all relevant GGAN jobs
  3. weightnorm.py: This a utility file for GGAN_mian.py

Running SEEK data code

  1. seek_*.py: These files generate the results, where the * is replaced with an identifer for a given imputation method
  2. seek_slurm.py: A python script that submits all relevant SEEK jobs

Running Normalization code

  1. Normalization_Analysis.py: Main script that generates normalization analysis results
  2. Normalization_Analysis.sb: An sbatch file that allocates a slurm job for normalization script

Running Beta Analysis code

  1. beta_main.py: Main script that generates imputed values
  2. betas_slurm.py: A python script that submits the jobs through slurm
  3. betas_knitting_evals_move.py: This combines all predictions for one hyperparameter set into one file and make a file for evaluations of different metrics

Section 3: User function for imputing any data

To impute an new data use the function found at src/user_function.py which as the following arguments

  1. -mgf, --measured_genes_file: The path to a tab separated file where the rows are the different genes, the first column contains the gene IDs and the rest of the columns contain the expression data to be imputed.
  2. -t, --targets: The path to a text file containing the gene IDs of unmeasrued genes that need to be imputed. If this path is not given, then all the genes in the training set that are not in the measured_genes_file will be imputed
  3. -td, --training_data: The path to the data to be used for training (right now need to be a numpy array that has samples along the rows and genes along the columns)
  4. -id, --gene_ids: The path to the file that maps the columns in the training data to gene IDs
  5. -tk, --training_key: The path that maps the GSE and GSM IDs to the samples in the training set
  6. -upd, --use_all_paper_data: If this argument is set to either Microarray or RNAseq the function will ignore arguments 3-5 and just use the pre-supplied data used this work.

An example to run is

cd src
python user_function.py -mgf ../data/example_data.tsv -t ../data/example_targets.tsv -td ../data/Microarray_Trn_Exp.npy -id ../data/GeneIDs.txt -tk ../data/Microarray_Trn_Key.tsv

This function output 4 files into the directory user_results in a subdirectory that is label with the timestamp YYYY-MM-DD-HH-SS

  1. predictions.tsv: A tab separated file with the first column being the Gene IDs and the rest of the columns being the imputed expression values
  2. top_betas.tsv: A tab separated file where for each GSM that was imputed, it gives back 100 training samples with the highest model coefficients
  3. unusable_measured_genes.txt: A text file containing gene IDs in the uploaded measured_genes_file that were not in the training set
  4. unusable_targets.txt: A text file that list gene IDs of target genes not imputed because they were also in the measured_genes_file

Section 4: Additional Information

Support

For support please contact Chris Mancuso at [email protected] or Jake Canfield at [email protected].

License

See LICENSE.md for license information for all data used in this project.

Citation

If you use this work, please cite:
Mancuso CA, Canfield JL, Singla D, Krishnan A (2020) A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes. Nucleic Acids Research, 48:e125 https://doi.org/10.1093/nar/gkaa881.

Authors

Christopher A Mancuso#, Jake Canfield#, Deepak Singla, Arjun Krishnan*

# These authors are joint first authors.
* General correspondence should be addressed to AK at [email protected].

Funding

This work was primarily supported by US National Institutes of Health (NIH) grants R35 GM128765 to AK and in part by MSU start-up funds to AK and NIH F32 Fellowship F32GM134595 for CM.

Acknowledgements

We are grateful for the support from the members of the Krishnan Lab.

Referecnes

ARCHS4

  • Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan A. Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications 9. Article number: 1366 (2018), doi:10.1038/s41467-018-03751-6

NCBI GEO

  • Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository Nucleic Acids Res. 2002 Jan 1;30(1):207-10

  • Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5.

URSA-HD

  • Lee YS, Krishnan A, Oughtred R, Rust J, Chang CS, Ryu J, Kristensen VN, Dolinski K, Theesfeld CL, Troyanskaya OG. (2019) A Computational Framework for Genome-wide Characterization of the Human Disease Landscape Cell Systems 8(2):P152-162 DOI: 10.1016/j.cels.2018.12.010

  • Lee YS, Krishnan A, Zhu Q, Troyanskaya OG. (2013) Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies. Bioinformatics 29(23):3036-44 DOI https://doi.org/10.1093/bioinformatics/btt529

SEEK

  • Zhu A, Wong AK, Krishnan A, Aure MR, Tadych A, Zhang R, Corney DC, Greene CS, Bongo LA, Kristensen VN, Charikar M, Li K & Troyanskaya OG (2015) Targeted exploration and analysis of large cross-platform human transcriptomic compendia Nature Methods 12(3):211-4 DOI: 10.1038/nmeth.3249

expresto's People

Contributors

christophermancuso avatar arjunkrish avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.