GithubHelp home page GithubHelp logo

openproblems-bio / openproblems Goto Github PK

View Code? Open in Web Editor NEW
289.0 289.0 77.0 74.86 MB

Formalizing and benchmarking open problems in single-cell genomics

License: MIT License

Python 41.22% Shell 0.13% Jupyter Notebook 52.23% Dockerfile 0.66% R 1.48% TeX 4.28%

openproblems's People

Contributors

atong01 avatar danielstrobl avatar dbdimitrov avatar dburkhardt avatar dependabot[bot] avatar giovp avatar github-actions[bot] avatar lazappi avatar luckymd avatar m0hammadl avatar michalk8 avatar mvinyard avatar mxposed avatar olgabot avatar qinqian avatar rcannood avatar scottgigante avatar scottgigante-immunai avatar vitkl avatar wes-lewis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

openproblems's Issues

Add data denoising task

Describe the problem concisely.
Include references to papers where the task is attempted.

Propose datasets
Include links to at least one publicly available dataset that could be used.

Propose methods
Include links to codebases of at least two methods that perform the task.

Propose metrics
Describe at least one metric by which to measure success on the task. It must be able to be applied to the proposed datasets.

[Task Proposal] - Marker genes identification

Describe the problem concisely.
Cell types are historically characterized by morphology and functional assays, but one of the main advantages of single-cell transcriptomics is their ability to define cell identities and their underlying regulatory networks. We have developed scRFE to produce interpretable gene lists for all cell types in a given dataset that can be used to design experiments or to further aid with biological data mining, but it would be great to benchmark scRFE and other methods in the community

Propose datasets
So far we used Tabula Muris Senis

Propose methods
NEED TO DO

Propose metrics
Compare with ground truth marker genes (https://docs.google.com/spreadsheets/d/1SsKS2vMvZqLdJ8hOv_dBeXo3k4Gcwcy-oPBhFTCo0tY/edit?usp=sharing)

Cross-species label projection

Describe the problem concisely.

Many biological problems are first studied in nonhuman animals, and those cell types are often carefully characterized in e.g. mouse. We would like to take the cell type labels from a mouse dataset, and project it onto another species, such as human.

Propose datasets

We have datasets for both of these, subset down to 1:1 orthologous genes across the two species, with a unified "narrow group" of cell types across the species.

Propose methods

Existing batch correction or dataset alignment methods can work here

Propose metrics

Metrics for how correct the labels were predicted.

  • F1 score
  • Adjusted Rand Index

Differential expression testing

Describe the problem concisely.
Detecting differentially expression genes or other features in augmented data to find correct type I error control. Task can be modeled after (this paper)[http://www.nature.com/doifinder/10.1038/nmeth.4612].

Propose datasets
Datasets used in the study are made available as the "conquer" repository.

Propose methods
EdgeR, MAST, Diffxpy, limma, etc.

Propose metrics
Metrics from the paper: Type I error control, enrichment of labels on false-positive genes, etc.

Add issue templates

  • propose a new task
  • propose a new dataset
  • propose a new metric
  • propose adding an existing dataset to another task

Infrastructure (back-end) thoughts and concerns

Hey everyone,

During the call today I had a few thoughts and concerns about the current infrastructure (back-end). Disclaimer: I'm not 100% familiar with the current back-end, so please point out of if I'm making any incorrect assumptions about how things are currently working. I'm mostly relying on my limited experience of adding a new task and on brief descriptions of the infrastructure from Scott.

  1. I really like the "snakemake approach" where new "builds" are only triggered when one of the input changes. In thinking about the broad scope of this project (many tasks, many datasets, many methods), it seems clear that a new "build" should only be triggered when a new dataset or a new method gets added, and only for those specific datasets/methods (no need to re-run all the other methods on all the other datasets). Currently I think everything gets run with each new commit, and I don't think that that's a sustainable solution.

  2. In thinking about this snakemake approach some more, I think it's interesting to think about what should happen when a new metric gets added to a particular task. Naively, this would require all methods to run on all datasets again, which seems like a huge effort. The only way to avoid this would be to store the result of each method on each dataset. This way, any new metric could be run simply by comparing the stored method result to the "ground truth". This suggests that metric calculation should be decoupled from actually running the methods. Currently, method results are not stored anywhere.

  3. The previous points 1) and 2) suggest to me that a we should store both the docker images (one docker image for each method and each version thereof) and input data (one file/directory per dataset) as files on a server (probably a cloud bucket). This would make it clear when we would have to run a method, namely when we encounter a previously unseen combination of docker file and input data. This also seems very different from how things are set up currently, where both methods and datasets are fundamentally represented as code, not as files, which in my mind makes it very difficult to implement a logic on what has already been run and what still needs to be run.

In summary, I really think the infrastructure needs to be modified to allow more modularity in terms of when to run which method on which dataset and for decoupling metrics from everything else.

1.3 M cell mouse brain dataset

What is the dataset?
Describe it briefly and include a citation.

Where is the data located?
Include a link to the publicly available data.

Which tasks could it be used for?
If the dataset is to be used for a task that is not yet included in the code base, use the issue template propose a new task instead.

Add MARS to label projection

What is the method?
MARS is a meta-learning approach for identifying and annotating cell types. MARS overcomes the heterogeneity of cell types by transferring latent cell representations across multiple datasets. MARS uses deep learning to learn a cell embedding function as well as a set of landmarks in the cell embedding space. here

Where is the code located?
https://github.com/snap-stanford/mars

Which task(s) could it be used for?
Label projection

Data integration / batch removal task

Describe the problem concisely.
Data integration task after the recent (benchmarking paper)[https://www.biorxiv.org/content/10.1101/2020.05.22.111161v2]

Propose datasets
All datasets used in the paper can be used. Quite a few are already on Figshare (here)[https://figshare.com/articles/dataset/Benchmarking_atlas-level_data_integration_in_single-cell_genomics_-integration_task_datasets_Immune_and_pancreas/12420968]

Propose methods
Classical methods are already out there: MNN, Seurat v3, scVI, BBKNN, CONOS, etc.

Propose metrics
14 metrics are in the paper already, these would be a good starting point.

Trajectory Inference comparison task

Describe the problem concisely.
Comparison of trajectory inference methods as performed by the Saeys group in (this paper)[http://www.nature.com/articles/s41587-019-0071-9].

Propose datasets
Several simulated and real datasets are available in the dynverse package.

Propose methods
72 methods are named in the paper: PAGA, Slingshot, Monocle v3, DPT, etc.

Propose metrics
Dynverse includes a comparison of TI results after mapping the results to a standardized framework for topology comparison.

Scale data for logistic regression

/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/sklearn/linear_model/_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:

    https://scikit-learn.org/stable/modules/preprocessing.html

Please also refer to the documentation for alternative solver options:

    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Use sklearn.preprocessing.StandardScaler().fit(X_train)

label_propagation: hide ground truth labels

I feel as though the ground-truth labels should not just be visible to everyone doing the task. Instead they should be np.nan for the adata.obs['is_train'] == False. That way we can be sure that nobody trains their model on more than the training data (specifically impt for hyperparameter optimization, which we can't control otherwise). That would however require the metrics to load the ground truth labels from somewhere else.

Can we hide objects in github?

Predicting gene expression from chromatin accessibility

Describe the problem concisely.
Include references to papers where the task is attempted.

The reference includes over 50 methods to calculate chromatin accessibility target genes, and benchmark by correlating with the gene expression, specifically for the figure 2b. They released the software at the GitHub repo, two of the typical models are shown in figure 2c and the manual gene score section .

Propose datasets

Propose methods

  • 52 known models measured by peak or read distance to TSS with different constraints
  • Linear or tree-based regressor with K nearest neighbor cells

Propose metrics

  • Given ATAC-seq profile for a cell, predict the expression for that cell (MSE over all cells)
  • Given ATAC-seq profile for a cell type, predict the expression for each cell type (Average spearman correlation for a cluster), this is similar to concept of multi-modal trajectory (see fig.5a in SHARE-seq paper)
  • For Routine single cell ATAC-seq and RNA-seq, Given ATAC-seq profile for a cell's K nearest neighbors, predict the expression of the K nearest cells average expression.

Metric - Dataset mapping

We discussed today that it may be useful to have a mapping between metrics and datasets for a task so that you can run a subset of metrics on a particular dataset. E.g., a denoising dataset may have a ground truth assigned, so that you can build a simple metric to test if the data corresponds to the ground truth. This may however not be the case for other datasets.

The alternative would be to use a new task for these datasets.. that might make the task names a bit less clear though.

When should we normalize the data?

Options:

  1. Download raw counts, preprocess adata.X in openproblems/data, keep raw counts in adata.layers["counts"].
  2. Download raw counts, pass raw counts to methods. Provide normalization recipes in openproblems/utils.py.

Zebrafish data fails to evaluate due to dense data in `obsm`

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the master branch of openproblems.

https://travis-ci.com/github/singlecellopenproblems/SingleCellOpenProblems/jobs/376995264#L910

File "evaluate.py", line 34, in evaluate_task
    result.append(evaluate_dataset(task, dataset))
  File "evaluate.py", line 23, in evaluate_dataset
    r = evaluate_method(task, adata.copy(), method)
  File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/anndata/_core/anndata.py", line 1449, in copy
    obsm=self.obsm.copy(),
  File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/anndata/_core/aligned_mapping.py", line 87, in copy
    d[k] = v.copy()

MemoryError: Unable to allocate 2.45 GiB for an array with shape (26022, 25313) and data type float32

Use tox for testing

WARNING: Testing via this command is deprecated and will be removed in a future version. Users looking for a generic test entry point independent of test runner are encouraged to use tox.

Pancreas data is dense and contains empty genes

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the master branch of openproblems.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Minimal code sample (that we can copy&paste without having any data)

import openproblems
adata = openproblems.tasks.label_projection.datasets.pancreas_batch()
type(adata.X)
# <class 'numpy.ndarray'>
adata.X.sum(axis=0)
# 0.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.