Expresto

This repository contains data and code to generate the results and reproduce the figures and tables found in A Flexible, Interpretable, and Accurate Approach for Imputing the Expression of Unmeasured Genes, published in Nucleic Acids Research. This work introduces a new method for imputing gene expression. The method introduced, SampleLASSO, uses the LASSO machine learning algorithm in a way that captures context specific biologically relevant information to guide imputation.

This repo provides:

The data, results, and figures presented in the manuscript.
Code to regenerate the results and figures.
A function that allows a user to upload a dataset to be imputed, and then we use SampleLASSO to fill in the unmeasured genes and also report which other expression samples in the training data were the most helpful for imputation.

Section 1: Pre-computed Data, Results, and Figures/Tables

Data

The data used in this study (networks, embeddings, and genesets) is available on Zenodo. To get the data run

sh get_data.sh

Results

PDF versions of the figures can be found in figures/. The notebook that generates the figures can be found at src/make_figures.ipynb.

Section 2: Regenerating the Results and Figures/Tables

Dependencies

This code was tested on an Anaconda distribution of python. The major packages used are:

python 3.7 
numpy 1.16.4
scipy 1.3.0
pandas 0.24.2
scikit-learn 0.20.3
matplotlib 3.0.3
seaborn 0.9.0
statsmodels 0.9.0
tensorflow-gpu 1.14.0 (this was run with python 3.6)
keras-gpu 2.2.4 (this was run with python 3.6)

The parallelization of the code was tested with Slurm on the high performance computing cluster at Michigan State University.

Running LASSO and KNN code

main.py: Main script that generates imputed values
main_utls.py: Helper function for main.py
main_slurm.py: A python script that will submit numerous jobs through slurm
run_GeneKNN_val_jobs.sh, run_GeneLasso_val_job.sh, run_SampleKNN_val_jobs.sh, run_SampleLasso_val_job.sh are scripts that start running the relevant jobs.
main_knitting.py: Combines all predictions for one hyperparameter set into one file
main_evalautions.py: Makes a file that has evaluations for different metrics

Running DNN code

DNN_main.py: Main script that generates imputed values, and makes the evaluation file
DNN_slurm.py: A python script that submits all relevant DNN jobs

Running GAN code

GGAN_main.py: Main script that generates imputed values, and makes the evaluation file
GGAN_slurm.py: A python script that submits all relevant GGAN jobs
weightnorm.py: This a utility file for GGAN_mian.py

Running SEEK data code

seek_*.py: These files generate the results, where the * is replaced with an identifer for a given imputation method
seek_slurm.py: A python script that submits all relevant SEEK jobs

Running Normalization code

Normalization_Analysis.py: Main script that generates normalization analysis results
Normalization_Analysis.sb: An sbatch file that allocates a slurm job for normalization script

Running Beta Analysis code

beta_main.py: Main script that generates imputed values
betas_slurm.py: A python script that submits the jobs through slurm
betas_knitting_evals_move.py: This combines all predictions for one hyperparameter set into one file and make a file for evaluations of different metrics

Section 3: User function for imputing any data

To impute an new data use the function found at src/user_function.py which as the following arguments

-mgf, --measured_genes_file: The path to a tab separated file where the rows are the different genes, the first column contains the gene IDs and the rest of the columns contain the expression data to be imputed.
-t, --targets: The path to a text file containing the gene IDs of unmeasrued genes that need to be imputed. If this path is not given, then all the genes in the training set that are not in the measured_genes_file will be imputed
-td, --training_data: The path to the data to be used for training (right now need to be a numpy array that has samples along the rows and genes along the columns)
-id, --gene_ids: The path to the file that maps the columns in the training data to gene IDs
-tk, --training_key: The path that maps the GSE and GSM IDs to the samples in the training set
-upd, --use_all_paper_data: If this argument is set to either Microarray or RNAseq the function will ignore arguments 3-5 and just use the pre-supplied data used this work.

An example to run is

cd src
python user_function.py -mgf ../data/example_data.tsv -t ../data/example_targets.tsv -td ../data/Microarray_Trn_Exp.npy -id ../data/GeneIDs.txt -tk ../data/Microarray_Trn_Key.tsv

This function output 4 files into the directory user_results in a subdirectory that is label with the timestamp YYYY-MM-DD-HH-SS

predictions.tsv: A tab separated file with the first column being the Gene IDs and the rest of the columns being the imputed expression values
top_betas.tsv: A tab separated file where for each GSM that was imputed, it gives back 100 training samples with the highest model coefficients
unusable_measured_genes.txt: A text file containing gene IDs in the uploaded measured_genes_file that were not in the training set
unusable_targets.txt: A text file that list gene IDs of target genes not imputed because they were also in the measured_genes_file

Section 4: Additional Information

Support

For support please contact Chris Mancuso at [email protected] or Jake Canfield at [email protected].

License

See LICENSE.md for license information for all data used in this project.

Citation

If you use this work, please cite:
Mancuso CA, Canfield JL, Singla D, Krishnan A (2020) A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes. Nucleic Acids Research, 48:e125 https://doi.org/10.1093/nar/gkaa881.

Authors

Christopher A Mancuso#, Jake Canfield#, Deepak Singla, Arjun Krishnan*

# These authors are joint first authors.
* General correspondence should be addressed to AK at [email protected].

Funding

This work was primarily supported by US National Institutes of Health (NIH) grants R35 GM128765 to AK and in part by MSU start-up funds to AK and NIH F32 Fellowship F32GM134595 for CM.

Acknowledgements

We are grateful for the support from the members of the Krishnan Lab.

Referecnes

ARCHS4

Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan A. Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications 9. Article number: 1366 (2018), doi:10.1038/s41467-018-03751-6

NCBI GEO

Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository Nucleic Acids Res. 2002 Jan 1;30(1):207-10
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013 Jan;41(Database issue):D991-5.

URSA-HD

Lee YS, Krishnan A, Oughtred R, Rust J, Chang CS, Ryu J, Kristensen VN, Dolinski K, Theesfeld CL, Troyanskaya OG. (2019) A Computational Framework for Genome-wide Characterization of the Human Disease Landscape Cell Systems 8(2):P152-162 DOI: 10.1016/j.cels.2018.12.010
Lee YS, Krishnan A, Zhu Q, Troyanskaya OG. (2013) Ontology-aware classification of tissue and cell-type signals in gene expression profiles across platforms and technologies. Bioinformatics 29(23):3036-44 DOI https://doi.org/10.1093/bioinformatics/btt529

SEEK

Zhu A, Wong AK, Krishnan A, Aure MR, Tadych A, Zhang R, Corney DC, Greene CS, Bongo LA, Kristensen VN, Charikar M, Li K & Troyanskaya OG (2015) Targeted exploration and analysis of large cross-platform human transcriptomic compendia Nature Methods 12(3):211-4 DOI: 10.1038/nmeth.3249

kelly1210 / expresto Goto Github PK

expresto's Introduction