openproblems-bio / openproblems Goto Github PK
View Code? Open in Web Editor NEWFormalizing and benchmarking open problems in single-cell genomics
License: MIT License
Formalizing and benchmarking open problems in single-cell genomics
License: MIT License
Describe the problem concisely.
Include references to papers where the task is attempted.
Propose datasets
Include links to at least one publicly available dataset that could be used.
Propose methods
Include links to codebases of at least two methods that perform the task.
Propose metrics
Describe at least one metric by which to measure success on the task. It must be able to be applied to the proposed datasets.
Describe the problem concisely.
Cell types are historically characterized by morphology and functional assays, but one of the main advantages of single-cell transcriptomics is their ability to define cell identities and their underlying regulatory networks. We have developed scRFE to produce interpretable gene lists for all cell types in a given dataset that can be used to design experiments or to further aid with biological data mining, but it would be great to benchmark scRFE and other methods in the community
Propose datasets
So far we used Tabula Muris Senis
Propose methods
NEED TO DO
Propose metrics
Compare with ground truth marker genes (https://docs.google.com/spreadsheets/d/1SsKS2vMvZqLdJ8hOv_dBeXo3k4Gcwcy-oPBhFTCo0tY/edit?usp=sharing)
For time saving in active development
Name, Paper, Paper URL, Paper year, Website, Dataset, Metric, Value, Ranksum rank, Runtime.
[table]
[[row]]
data = ["Method", "kNN AUC", "MSE", "Paper"]
[[row]]
data = ["Procrustes", "0.00298303", "0.833141", "Procrustes Methods in the Statistical Analysis of Shape"]
[[row]]
data = ["Cheat", "0.0587961", "0" , "Cheat"]
[[row]]
data = ["MNN", "0.062768", "1.05657", "Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors"]
Create a place to compute all metrics on an arbitrary output to facilitate development
Move stuff to optional deps
Describe the problem concisely.
Many biological problems are first studied in nonhuman animals, and those cell types are often carefully characterized in e.g. mouse. We would like to take the cell type labels from a mouse dataset, and project it onto another species, such as human.
Propose datasets
We have datasets for both of these, subset down to 1:1 orthologous genes across the two species, with a unified "narrow group" of cell types across the species.
Propose methods
Existing batch correction or dataset alignment methods can work here
Propose metrics
Metrics for how correct the labels were predicted.
Failing on Travis.
urllib.error.HTTPError: HTTP Error 404: Not Found
Describe the problem concisely.
Detecting differentially expression genes or other features in augmented data to find correct type I error control. Task can be modeled after (this paper)[http://www.nature.com/doifinder/10.1038/nmeth.4612].
Propose datasets
Datasets used in the study are made available as the "conquer" repository.
Propose methods
EdgeR, MAST, Diffxpy, limma, etc.
Propose metrics
Metrics from the paper: Type I error control, enrichment of labels on false-positive genes, etc.
What is the method?
MNN correct via fastMNN
Where is the code located?
https://rdrr.io/github/LTLA/batchelor/man/fastMNN.html
Which task(s) could it be used for?
Already implemented in multimodal data integration
I think this might need to be defined for each package. Like
def _magic_impute(adata):
import magic
__version__ = magic.__version__
I added a new dataset but it didn't appear on the website. https://github.com/singlecellopenproblems/SingleCellOpenProblems/blob/master/website/data/results/multimodal_data_integration/citeseq_cmbc.json
Hey everyone,
During the call today I had a few thoughts and concerns about the current infrastructure (back-end). Disclaimer: I'm not 100% familiar with the current back-end, so please point out of if I'm making any incorrect assumptions about how things are currently working. I'm mostly relying on my limited experience of adding a new task and on brief descriptions of the infrastructure from Scott.
I really like the "snakemake approach" where new "builds" are only triggered when one of the input changes. In thinking about the broad scope of this project (many tasks, many datasets, many methods), it seems clear that a new "build" should only be triggered when a new dataset or a new method gets added, and only for those specific datasets/methods (no need to re-run all the other methods on all the other datasets). Currently I think everything gets run with each new commit, and I don't think that that's a sustainable solution.
In thinking about this snakemake approach some more, I think it's interesting to think about what should happen when a new metric gets added to a particular task. Naively, this would require all methods to run on all datasets again, which seems like a huge effort. The only way to avoid this would be to store the result of each method on each dataset. This way, any new metric could be run simply by comparing the stored method result to the "ground truth". This suggests that metric calculation should be decoupled from actually running the methods. Currently, method results are not stored anywhere.
The previous points 1) and 2) suggest to me that a we should store both the docker images (one docker image for each method and each version thereof) and input data (one file/directory per dataset) as files on a server (probably a cloud bucket). This would make it clear when we would have to run a method, namely when we encounter a previously unseen combination of docker file and input data. This also seems very different from how things are set up currently, where both methods and datasets are fundamentally represented as code, not as files, which in my mind makes it very difficult to implement a logic on what has already been run and what still needs to be run.
In summary, I really think the infrastructure needs to be modified to allow more modularity in terms of when to run which method on which dataset and for decoupling metrics from everything else.
What is the dataset?
Describe it briefly and include a citation.
Where is the data located?
Include a link to the publicly available data.
Which tasks could it be used for?
If the dataset is to be used for a task that is not yet included in the code base, use the issue template propose a new task instead.
What is the method?
MARS is a meta-learning approach for identifying and annotating cell types. MARS overcomes the heterogeneity of cell types by transferring latent cell representations across multiple datasets. MARS uses deep learning to learn a cell embedding function as well as a set of landmarks in the cell embedding space. here
Where is the code located?
https://github.com/snap-stanford/mars
Which task(s) could it be used for?
Label projection
Describe the problem concisely.
Data integration task after the recent (benchmarking paper)[https://www.biorxiv.org/content/10.1101/2020.05.22.111161v2]
Propose datasets
All datasets used in the paper can be used. Quite a few are already on Figshare (here)[https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-integration_task_datasets_Immune_and_pancreas/12420968]
Propose methods
Classical methods are already out there: MNN, Seurat v3, scVI, BBKNN, CONOS, etc.
Propose metrics
14 metrics are in the paper already, these would be a good starting point.
Describe the problem concisely.
Comparison of trajectory inference methods as performed by the Saeys group in (this paper)[http://www.nature.com/articles/s41587-019-0071-9].
Propose datasets
Several simulated and real datasets are available in the dynverse package.
Propose methods
72 methods are named in the paper: PAGA, Slingshot, Monocle v3, DPT, etc.
Propose metrics
Dynverse includes a comparison of TI results after mapping the results to a standardized framework for topology comparison.
/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/sklearn/linear_model/_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Use sklearn.preprocessing.StandardScaler().fit(X_train)
I feel as though the ground-truth labels should not just be visible to everyone doing the task. Instead they should be np.nan
for the adata.obs['is_train'] == False
. That way we can be sure that nobody trains their model on more than the training data (specifically impt for hyperparameter optimization, which we can't control otherwise). That would however require the metrics to load the ground truth labels from somewhere else.
Can we hide objects in github?
Describe the problem concisely.
Include references to papers where the task is attempted.
The reference includes over 50 methods to calculate chromatin accessibility target genes, and benchmark by correlating with the gene expression, specifically for the figure 2b. They released the software at the GitHub repo, two of the typical models are shown in figure 2c and the manual gene score section .
Propose datasets
Propose methods
Propose metrics
We discussed today that it may be useful to have a mapping between metrics and datasets for a task so that you can run a subset of metrics on a particular dataset. E.g., a denoising dataset may have a ground truth assigned, so that you can build a simple metric to test if the data corresponds to the ground truth. This may however not be the case for other datasets.
The alternative would be to use a new task for these datasets.. that might make the task names a bit less clear though.
Options:
adata.X
in openproblems/data
, keep raw counts in adata.layers["counts"]
.openproblems/utils.py
.https://travis-ci.com/github/singlecellopenproblems/SingleCellOpenProblems/jobs/376995264#L910
File "evaluate.py", line 34, in evaluate_task
result.append(evaluate_dataset(task, dataset))
File "evaluate.py", line 23, in evaluate_dataset
r = evaluate_method(task, adata.copy(), method)
File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/anndata/_core/anndata.py", line 1449, in copy
obsm=self.obsm.copy(),
File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/anndata/_core/aligned_mapping.py", line 87, in copy
d[k] = v.copy()
MemoryError: Unable to allocate 2.45 GiB for an array with shape (26022, 25313) and data type float32
In the label projection task, all the methods have the same version number
WARNING: Testing via this command is deprecated and will be removed in a future version. Users looking for a generic test entry point independent of test runner are encouraged to use tox.
Describe the problem concisely.
Preserving measures of stress / manifold topology in low dimensional embeddings of single cell data
Propose datasets
Include links to at least one publicly available dataset that could be used.
Propose methods
UMAP, PHATE, PCA
Propose metrics
https://github.com/scottgigante/DEMaP
https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/manifold/_mds.py#L20
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
import openproblems
adata = openproblems.tasks.label_projection.datasets.pancreas_batch()
type(adata.X)
# <class 'numpy.ndarray'>
adata.X.sum(axis=0)
# 0.0
Website
Tasks
Longer term
What is the dataset?
Protein and RNA https://www.nature.com/articles/nmeth.4380
Where is the data located?
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE100866
Which tasks could it be used for?
Multimodal data integration
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.