openproblems-bio / openproblems Goto Github PK

View Code? Open in Web Editor NEW

289.0 289.0 77.0 74.86 MB

Formalizing and benchmarking open problems in single-cell genomics

License: MIT License

Python 41.22% Shell 0.13% Jupyter Notebook 52.23% Dockerfile 0.66% R 1.48% TeX 4.28%

openproblems's People

Contributors

Stargazers

Watchers

openproblems's Issues

Add data denoising task

Describe the problem concisely.
Include references to papers where the task is attempted.

Propose datasets
Include links to at least one publicly available dataset that could be used.

Propose methods
Include links to codebases of at least two methods that perform the task.

Propose metrics
Describe at least one metric by which to measure success on the task. It must be able to be applied to the proposed datasets.

Create sandbox/development versions of datasets

Migrate scran normalization to pure Python

Add Data Integration SoTA methods

[Task Proposal] - Marker genes identification

Describe the problem concisely.
Cell types are historically characterized by morphology and functional assays, but one of the main advantages of single-cell transcriptomics is their ability to define cell identities and their underlying regulatory networks. We have developed scRFE to produce interpretable gene lists for all cell types in a given dataset that can be used to design experiments or to further aid with biological data mining, but it would be great to benchmark scRFE and other methods in the community

Propose datasets
So far we used Tabula Muris Senis

Propose methods
NEED TO DO

Propose metrics
Compare with ground truth marker genes (https://docs.google.com/spreadsheets/d/1SsKS2vMvZqLdJ8hOv_dBeXo3k4Gcwcy-oPBhFTCo0tY/edit?usp=sharing)

Create script to run only a single task in testing

For time saving in active development

Output results as JSON

Name, Paper, Paper URL, Paper year, Website, Dataset, Metric, Value, Ranksum rank, Runtime.

[table]
[[row]]
data = ["Method", "kNN AUC", "MSE", "Paper"]
[[row]]
data = ["Procrustes", "0.00298303", "0.833141", "Procrustes Methods in the Statistical Analysis of Shape"]
[[row]]
data = ["Cheat", "0.0587961", "0" , "Cheat"]
[[row]]
data = ["MNN", "0.062768", "1.05657", "Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors"]

Add normalization functions to existing methods

Prototyping sandbox (maybe colab)

Create a place to compute all metrics on an arbitrary output to facilitate development

Reduce dependencies for using the codebase

Move stuff to optional deps

Cross-species label projection

Describe the problem concisely.

Many biological problems are first studied in nonhuman animals, and those cell types are often carefully characterized in e.g. mouse. We would like to take the cell type labels from a mouse dataset, and project it onto another species, such as human.

Propose datasets

Mouse: Tabula Muris Senis lung - https://tabula-muris-senis.ds.czbiohub.org/
Human: Human Lung Cell Atlas - https://hlca.ds.czbiohub.org/

We have datasets for both of these, subset down to 1:1 orthologous genes across the two species, with a unified "narrow group" of cell types across the species.

Propose methods

Existing batch correction or dataset alignment methods can work here

Mutual nearest neighbors: https://www.nature.com/articles/nbt.4091
Batch-balanced k-nearest neighbors: https://github.com/MarioniLab/FurtherMNN2018

Propose metrics

Metrics for how correct the labels were predicted.

F1 score
Adjusted Rand Index

Fix zebrafish data URL

Failing on Travis.

urllib.error.HTTPError: HTTP Error 404: Not Found

Differential expression testing

Describe the problem concisely.
Detecting differentially expression genes or other features in augmented data to find correct type I error control. Task can be modeled after (this paper)[http://www.nature.com/doifinder/10.1038/nmeth.4612].

Propose datasets
Datasets used in the study are made available as the "conquer" repository.

Propose methods
EdgeR, MAST, Diffxpy, limma, etc.

Propose metrics
Metrics from the paper: Type I error control, enrichment of labels on false-positive genes, etc.

Add Zebrafish data

Add issue templates

propose a new task
propose a new dataset
propose a new metric
propose adding an existing dataset to another task

Display code versions in website output

mnnpy is not maintained

What is the method?
MNN correct via fastMNN

Where is the code located?
https://rdrr.io/github/LTLA/batchelor/man/fastMNN.html

Which task(s) could it be used for?
Already implemented in multimodal data integration

Upload intermediate processed data files to S3

Create a tab widget for each task on netlify website

Add software version number for each method in result JSON

I think this might need to be defined for each package. Like

def _magic_impute(adata):
    import magic
    __version__ = magic.__version__

Automatically add new datasets to website

I added a new dataset but it didn't appear on the website. https://github.com/singlecellopenproblems/SingleCellOpenProblems/blob/master/website/data/results/multimodal_data_integration/citeseq_cmbc.json

Infrastructure (back-end) thoughts and concerns

Hey everyone,

During the call today I had a few thoughts and concerns about the current infrastructure (back-end). Disclaimer: I'm not 100% familiar with the current back-end, so please point out of if I'm making any incorrect assumptions about how things are currently working. I'm mostly relying on my limited experience of adding a new task and on brief descriptions of the infrastructure from Scott.

I really like the "snakemake approach" where new "builds" are only triggered when one of the input changes. In thinking about the broad scope of this project (many tasks, many datasets, many methods), it seems clear that a new "build" should only be triggered when a new dataset or a new method gets added, and only for those specific datasets/methods (no need to re-run all the other methods on all the other datasets). Currently I think everything gets run with each new commit, and I don't think that that's a sustainable solution.
In thinking about this snakemake approach some more, I think it's interesting to think about what should happen when a new metric gets added to a particular task. Naively, this would require all methods to run on all datasets again, which seems like a huge effort. The only way to avoid this would be to store the result of each method on each dataset. This way, any new metric could be run simply by comparing the stored method result to the "ground truth". This suggests that metric calculation should be decoupled from actually running the methods. Currently, method results are not stored anywhere.
The previous points 1) and 2) suggest to me that a we should store both the docker images (one docker image for each method and each version thereof) and input data (one file/directory per dataset) as files on a server (probably a cloud bucket). This would make it clear when we would have to run a method, namely when we encounter a previously unseen combination of docker file and input data. This also seems very different from how things are set up currently, where both methods and datasets are fundamentally represented as code, not as files, which in my mind makes it very difficult to implement a logic on what has already been run and what still needs to be run.

In summary, I really think the infrastructure needs to be modified to allow more modularity in terms of when to run which method on which dataset and for decoupling metrics from everything else.

Add multi omic SoTA

Connect netlify

1.3 M cell mouse brain dataset

What is the dataset?
Describe it briefly and include a citation.

Where is the data located?
Include a link to the publicly available data.

Which tasks could it be used for?
If the dataset is to be used for a task that is not yet included in the code base, use the issue template propose a new task instead.

Add MARS to label projection

What is the method?
MARS is a meta-learning approach for identifying and annotating cell types. MARS overcomes the heterogeneity of cell types by transferring latent cell representations across multiple datasets. MARS uses deep learning to learn a cell embedding function as well as a set of landmarks in the cell embedding space. here

Where is the code located?
https://github.com/snap-stanford/mars

Which task(s) could it be used for?
Label projection

Add Pancreas data

Data integration / batch removal task

Describe the problem concisely.
Data integration task after the recent (benchmarking paper)[https://www.biorxiv.org/content/10.1101/2020.05.22.111161v2]

Propose datasets
All datasets used in the paper can be used. Quite a few are already on Figshare (here)[https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-integration_task_datasets_Immune_and_pancreas/12420968]

Propose methods
Classical methods are already out there: MNN, Seurat v3, scVI, BBKNN, CONOS, etc.

Propose metrics
14 metrics are in the paper already, these would be a good starting point.

Trajectory Inference comparison task

Describe the problem concisely.
Comparison of trajectory inference methods as performed by the Saeys group in (this paper)[http://www.nature.com/articles/s41587-019-0071-9].

Propose datasets
Several simulated and real datasets are available in the dynverse package.

Propose methods
72 methods are named in the paper: PAGA, Slingshot, Monocle v3, DPT, etc.

Propose metrics
Dynverse includes a comparison of TI results after mapping the results to a standardized framework for topology comparison.

Labs to contact

Scale data for logistic regression

/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/sklearn/linear_model/_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:

    https://scikit-learn.org/stable/modules/preprocessing.html

Please also refer to the documentation for alternative solver options:

    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Use sklearn.preprocessing.StandardScaler().fit(X_train)

label_propagation: hide ground truth labels

I feel as though the ground-truth labels should not just be visible to everyone doing the task. Instead they should be np.nan for the adata.obs['is_train'] == False. That way we can be sure that nobody trains their model on more than the training data (specifically impt for hyperparameter optimization, which we can't control otherwise). That would however require the metrics to load the ground truth labels from somewhere else.

Can we hide objects in github?

Predicting gene expression from chromatin accessibility

Describe the problem concisely.
Include references to papers where the task is attempted.

The reference includes over 50 methods to calculate chromatin accessibility target genes, and benchmark by correlating with the gene expression, specifically for the figure 2b. They released the software at the GitHub repo, two of the typical models are shown in figure 2c and the manual gene score section .

Propose datasets

Propose methods

52 known models measured by peak or read distance to TSS with different constraints
Linear or tree-based regressor with K nearest neighbor cells

Propose metrics

Given ATAC-seq profile for a cell, predict the expression for that cell (MSE over all cells)
Given ATAC-seq profile for a cell type, predict the expression for each cell type (Average spearman correlation for a cluster), this is similar to concept of multi-modal trajectory (see fig.5a in SHARE-seq paper)
For Routine single cell ATAC-seq and RNA-seq, Given ATAC-seq profile for a cell's K nearest neighbors, predict the expression of the K nearest cells average expression.

Memory leak on Travis in evaluate.py

https://travis-ci.com/github/singlecellopenproblems/SingleCellOpenProblems/jobs/382491851#L970

Metric - Dataset mapping

We discussed today that it may be useful to have a mapping between metrics and datasets for a task so that you can run a subset of metrics on a particular dataset. E.g., a denoising dataset may have a ground truth assigned, so that you can build a simple metric to test if the data corresponds to the ground truth. This may however not be the case for other datasets.

The alternative would be to use a new task for these datasets.. that might make the task names a bit less clear though.

When should we normalize the data?

Options:

Download raw counts, preprocess adata.X in openproblems/data, keep raw counts in adata.layers["counts"].
Download raw counts, pass raw counts to methods. Provide normalization recipes in openproblems/utils.py.

Zebrafish data fails to evaluate due to dense data in `obsm`

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the master branch of openproblems.

https://travis-ci.com/github/singlecellopenproblems/SingleCellOpenProblems/jobs/376995264#L910

File "evaluate.py", line 34, in evaluate_task
    result.append(evaluate_dataset(task, dataset))
  File "evaluate.py", line 23, in evaluate_dataset
    r = evaluate_method(task, adata.copy(), method)
  File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/anndata/_core/anndata.py", line 1449, in copy
    obsm=self.obsm.copy(),
  File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/anndata/_core/aligned_mapping.py", line 87, in copy
    d[k] = v.copy()

MemoryError: Unable to allocate 2.45 GiB for an array with shape (26022, 25313) and data type float32

Propose methods
UMAP, PHATE, PCA

Propose metrics
https://github.com/scottgigante/DEMaP
https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/manifold/_mds.py#L20

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the master branch of openproblems.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Minimal code sample (that we can copy&paste without having any data)

import openproblems
adata = openproblems.tasks.label_projection.datasets.pancreas_batch()
type(adata.X)
# <class 'numpy.ndarray'>
adata.X.sum(axis=0)
# 0.0