GithubHelp home page GithubHelp logo

greenelab / linear_signal Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 2.0 68.49 MB

Comparing the performance of linear and nonlinear models in transcriptomic prediction

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 90.29% Python 9.41% Shell 0.01% R 0.28%
analysis deep-learning gene-expression machine-learning tool

linear_signal's Introduction

Linear and Nonlinear Signals

This repo contains the code to reproduce the results of the manuscript "The Effects of Nonlinear Signal on Expression-Based Prediction Performance". In short, we compare linear and nonlinear models in multiple prediction tasks, and find that their predictive ability is roughly equivalent. Further, this similarity is despite the fact that predictive nonlinear signal exists in the data for each of the tasks.

model comparison figure

Installation

Python dependencies

The Python dependencies for this project are managed via Conda. To install them and activate the environment, use the following commands in bash:

conda env create --file environment.yml
conda activate linear_models

R setup

The R dependencies for this project are managed via Renv. To set up Renv for the repository, use the commands below within R while working in the linear_signals repo:

install.packages('renv')
renv::init()
renv::restore()

Sex prediction setup

Before running scripts involving sex prediction, you need to download the Flynn et al. labels from this link and put the results in the saged/data directory. Because of the settings on the figshare repo it isn't possible to incorporate that part of the data download into the Snakefile, otherwise I would.

Neptune setup

If you want to log training results, you will need to sign up for a free neptune account here.

  1. The neptune module is already installed as part of the saged conda environment, but you'll need to grab an API token from the website.
  2. Create a neptune project for storing your logs.
  3. Store the token in secrets.yml in the format neptune_api_token: "<your_token>", and update the neptune_config file to use your info.

Reproducing results

The pipeline to download all the data, and produce all the results shown in the manuscript is managed by Snakemake. To reproduce all results files and figures, run

snakemake -j <NUM_CORES>

Successfully running the full pipeline takes a few months on a single machine. For reference specs, my machine has an 64 GB of RAM, an AMD Ryzen 7 3800xt processor and an NVIDIA 3090 GPU). You can get by with less ram, vRAM, and processor cores, by reducing the degree of paralellism. I imagine the analyses can comfortably fit on a machine with 32GB of ram and a ~1080ti GPU, but I haven't tested the pipeline in such an environment.

If you want to speed up the process and see similar results, you can run the pipeline without hyperparameter optimization with

snakemake -s no_hyperopt_snakefile -j <NUM_CORES>

If you are going to be running the pipeline in a cluster environment, it may be helpful to read through the file slurm_snakefile. This blog post might also be helpful.

Intermediate steps

When running the full pipeline via snakemake, the data required will be automatically downloaded (excluding the sex prediction labels mentioned in the section below). If you'd like to skip the data download (and in doing so save yourself about a week of downloading and processing things), you can rehydrate this Zenodo archive into the data/ dir.

Likewise, if you'd like to download the results files, they can be found here. If you only need the saved models, they can be found here.

Directory Layout

File/dir Description
Snakefile Contains the rules Snakemake uses to run the full project
environment.yml Lists the python dependencies and their versions in a format readable by Conda
neptune.yml Lists information for Neptune logging
secrets.yml Stores neputne API token (see Neptune setup section)
data/ Stores the raw and intermediate data files used for training models
dataset_configs/ Stores config information telling Dataset objects how to construct themselves
figures/ Contains images visualizing the results of the various analyses
logs/ Holds serialized versions of trained models
model_configs/ Stores config information for models such as default hyperparameters
notebook/ Stores notebooks used for visualizing results
results/ Records the accuracies of the models on various tasks
src/ The source code used to run the analyses
test/ Tests for the source code (runnable with pytest)

linear_signal's People

Contributors

ben-heil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

ben-heil nmateyko

linear_signal's Issues

Documentation: Describe Metadata

keep_ratios.py and other benchmark scripts require metadata files to build the RefineBioDataset objects. At some point I need to document what these files look like and which parts are actually needed to construct the dataset objects

See #37 (comment) for more

Label encoding can change if I'm not careful

Right now labels are encoded at dataset creation time using sklearn.LabelEncoder, which sorts the labels before encoding them method one, method 2. This means that for classification where the labels are the same between datasets, the labels' encodings should also be the same. However, if there are more labels in the training set than validation set, the validation might be comparing apples to oranges.

This might be the cause of the bad performance in the all_labels analysis, and definitely is something to make sure to get right. I'll be adding the logic to account for this (LabeledDataset.get_label_encoder() and LabeledDataset.set_label_encoder()) to the next PR

CanonicalPathways distribution

Here's the distribution of the number of genes in the canonicalpathways dataset from PLIER.

Obtained via

library(PLIER)
data(canonicalPathways)
hist(colSums(canonicalPathways), breaks=30)

Screenshot from 2021-09-22 09-13-54

Recount3 Label Stats

More documentation than issue, here are the sample counts and study counts for the manually labeled samples in recount3

Samples per tissue:
image

Unique studies per tissue:
image

Training Performance Issues

Scaling to ~100k samples leads to a substantial slowdown in the non-model update parts of the training code. It takes awhile to subset samples, split data by study, and even pull given samples from a dataset using their IDs.

I'll need to look into how I can speed these processes up.
Ideas:

  • Use less data (boooo)
  • Presplit data into train/test sets with different seeds. Expensive(ish) in terms of disk space, but should be much faster
  • Look into why index access into a 100k element dataframe is so slow. Is it a limitation of pandas itself, or am I doing something inefficient (like is my access pattern correct in the light of pandas using column major order)

Setup steps missing from snakemake

I cloned the repo to my local computer to run two branches' evaluations at once, and I found that the Snakefile doesn't include the logic to download and the compendium and subset it to include only the samples containing blood gene expression

Nondeterminism Issue

The CV splits used in the all_labels analysis (and probably others) aren't the same between runs. I feel like this could be another set being cast to a list issue, but I'm unsure.

Code to check:

  • get_cv_splits
  • dataset constructors

Weird Predictions

sample_split_control and tissue_split_control initialize the output size of the neural network models to be the same as the input size (because they were based off the imputation pretraining script).

Because neural nets are neural nets, they just learn to not use the extra 20k classes, but I should fix that issue and rerun in the future

Feature: Early Stopping

PytorchSupervised models should implement early stopping at some point. The current train/tune split doesn't work great though, maybe use ideas from #28 (comment) ?

Benchmarking Code

The code required to compare different models will have a lot of different components, so I'll list the elements here and check them off as they're implemented.

  • Create a base Dataset class

  • Implement a Dataset class for refinebio data

  • Implement extracting data for a given label

  • Implement subsetting data based on label

  • Implement train/test split and cross-validation, by study instead of by sample

  • Create a base Model class (probably using the sklearn API?)

  • Implement Models for skl methods

  • Implement Models for some semi-supervised methods

  • Implement non-semi-supervised versions of the same models, or separate unsupervised and supervised steps

  • Implement data simulation/augmentation with VAEs

  • Determine a way to parallelize the benchmarking process

  • Write the logic to take a dataset and benchmark models on it. Probably build this into the dataset object itself?

CI is slow

It takes 7 minutes to use conda to download dependencies each time. I should probably just switch to the setup-miniconda action and enable caching.

`utils.load_compendium_data` is slow

For small models ~40 percent of the runtime (~3.5 minutes) is spend just loading the subset compendium into a dataframe. This appears to be due to the slowness of the pandas.read_csv function as opposed to a disk reading limitation, because loading the file into memory with compendium_file.readlines() only takes 8 seconds.

Potential fixes:

  • Pickle the dataframe and load it
  • Load the file into a numpy array and use the numpy constructor for the dataframe
  • Try dask again for parallelism since we appear to be CPU bound instead of disk bound

PytorchSupervised Memory Leak

https://ui.neptune.ai/ben-heil/saged/e/SAG-4/monitoring and https://ui.neptune.ai/ben-heil/saged/e/SAG-5/monitoring correspond to the first and second fold of cross validation in all_label_comparison.py. The second fold starts with a higher GPU memory usage, and as a result errors out.

This implies there is gpu memory being held across folds. It isn't immediately apparent why this is happening, and it doesn't fit in the scope of the current PR, so I've opened this issue.

Refactor

There's a lot of duplicated code between PytorchSupervised and PytorchPseudolabel that can be condensed.

Likewise more of the benchmark scripts' code should be pulled into utility functions and tested

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.