greenelab / linear_signal Goto Github PK

Comparing the performance of linear and nonlinear models in transcriptomic prediction

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 90.29% Python 9.41% Shell 0.01% R 0.28%

analysis deep-learning gene-expression machine-learning tool

linear_signal's Issues

Label encoding can change if I'm not careful

Right now labels are encoded at dataset creation time using sklearn.LabelEncoder, which sorts the labels before encoding them method one, method 2. This means that for classification where the labels are the same between datasets, the labels' encodings should also be the same. However, if there are more labels in the training set than validation set, the validation might be comparing apples to oranges.

This might be the cause of the bad performance in the all_labels analysis, and definitely is something to make sure to get right. I'll be adding the logic to account for this (LabeledDataset.get_label_encoder() and LabeledDataset.set_label_encoder()) to the next PR

Feature: Early Stopping

PytorchSupervised models should implement early stopping at some point. The current train/tune split doesn't work great though, maybe use ideas from #28 (comment) ?

Refactor

There's a lot of duplicated code between PytorchSupervised and PytorchPseudolabel that can be condensed.

Likewise more of the benchmark scripts' code should be pulled into utility functions and tested

Outliers in lupus/tb model performance. Maybe a data leak?

See https://github.com/greenelab/saged/blob/331a77a57564b02d8d40b1dd7b66b53a0f1b50c5/notebook/analysis/benchmark_results.ipynb tb and lupus visualizations

Need to add `create_simulation_metadata` to snakefile

Setup steps missing from snakemake

I cloned the repo to my local computer to run two branches' evaluations at once, and I found that the Snakefile doesn't include the logic to download and the compendium and subset it to include only the samples containing blood gene expression

CI is slow

It takes 7 minutes to use conda to download dependencies each time. I should probably just switch to the setup-miniconda action and enable caching.

Training Performance Issues

Scaling to ~100k samples leads to a substantial slowdown in the non-model update parts of the training code. It takes awhile to subset samples, split data by study, and even pull given samples from a dataset using their IDs.

I'll need to look into how I can speed these processes up.
Ideas:

Use less data (boooo)
Presplit data into train/test sets with different seeds. Expensive(ish) in terms of disk space, but should be much faster
Look into why index access into a 100k element dataframe is so slow. Is it a limitation of pandas itself, or am I doing something inefficient (like is my access pattern correct in the light of pandas using column major order)

Nondeterminism Issue

The CV splits used in the all_labels analysis (and probably others) aren't the same between runs. I feel like this could be another set being cast to a list issue, but I'm unsure.

Code to check:

get_cv_splits
dataset constructors

Recount3 Label Stats

More documentation than issue, here are the sample counts and study counts for the manually labeled samples in recount3

Samples per tissue:

Unique studies per tissue:

`utils.load_compendium_data` is slow

For small models ~40 percent of the runtime (~3.5 minutes) is spend just loading the subset compendium into a dataframe. This appears to be due to the slowness of the pandas.read_csv function as opposed to a disk reading limitation, because loading the file into memory with compendium_file.readlines() only takes 8 seconds.

Potential fixes:

Pickle the dataframe and load it
Load the file into a numpy array and use the numpy constructor for the dataframe
Try dask again for parallelism since we appear to be CPU bound instead of disk bound

https://ui.neptune.ai/ben-heil/saged/e/SAG-4/monitoring and https://ui.neptune.ai/ben-heil/saged/e/SAG-5/monitoring correspond to the first and second fold of cross validation in all_label_comparison.py. The second fold starts with a higher GPU memory usage, and as a result errors out.

This implies there is gpu memory being held across folds. It isn't immediately apparent why this is happening, and it doesn't fit in the scope of the current PR, so I've opened this issue.

CanonicalPathways distribution

Here's the distribution of the number of genes in the canonicalpathways dataset from PLIER.

Obtained via

library(PLIER)
data(canonicalPathways)
hist(colSums(canonicalPathways), breaks=30)

greenelab / linear_signal Goto Github PK

linear_signal's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs