GithubHelp home page GithubHelp logo

broadinstitute / cell-health Goto Github PK

View Code? Open in Web Editor NEW
35.0 6.0 9.0 3.33 GB

Predicting Cell Health with Morphological Profiles

License: MIT License

R 0.44% Shell 0.07% Jupyter Notebook 48.69% HTML 50.44% Python 0.36% TeX 0.01%
carpenter-lab morphological-profiling crispr cell-painting cancer 2015-07-01-cell-health

cell-health's Introduction

Predicting cell health phenotypes using image-based morphology profiling

DOI

Gregory P. Way, Maria Kost-Alimova, Tsukasa Shibue, William F. Harrington, Stanley Gill, Federica Piccioni, Tim Becker, Hamdah Shafqat-Abbasi, William C. Hahn, Anne E. Carpenter, Francisca Vazquez, Shantanu Singh

2020

Table of contents generated with markdown-toc

Summary

Cell health can be altered by genetic and chemical perturbations. An increased understanding of these perturbation mechanisms is directly relevant for drug discovery and personalized medicine. Here and in an accompanying paper, we present two novel cell imaging assays that together measure 70 different aspects of cell health, such as proliferation, apoptosis, and cell cycle stalling. However, these assays require expensive reagents and do not scale well. Therefore, we also developed a machine learning solution to predict cell health readouts directly from a separate assay, known as Cell Painting. In contrast to the Cell Health assays, Cell Painting is inexpensive, high-throughput, and unbiased (reagents are not targeted). We predict many cell health indicators with high performance, but other readouts could not be predicted. We validated our predictions by using orthogonal readouts and by applying the models to a large set of 1,500 drugs from the Drug Repurposing Hub. Cell health predictions for drugs can be browsed at https://broad.io/cell-health-app. We confirmed mitotic arrest, reactive oxygen species, and DNA damage in G1 cell cycle based phenotypes via PLK, proteasome, and aurora kinase/tubulin inhibition, respectively. In the future, we can use this approach to determine the cell health consequences of any perturbation in cells. We conducted this project using open science principles with open data and open source code.

The following repository stores a complete analysis pipeline using Cell Painting data to predict readouts from the Cell Health assays.

We first developed the customized microscopy assays we collectively call "Cell Health". The Cell Health assays are comprised of two different reagent panels: "Cell cycle" and "viability". Together, these two panels use reagents which mark different cell health phenotypes.

Assay/Dye Phenotype Panel
Caspase 3/7 Apoptosis Viability
DRAQ7 Cell death Viability
CellROX Reactive oxygen species Viability
EdU Cellular profileration Cell cycle
Hoechst DNA content Cell cycle
pH3 Cell division Cell cycle
gH2Ax DNA damage Cell cycle

We hypothesized that we can use unbiased and high dimensional Cell Painting profiles to predict cell health readouts.

Approach

This overview figure outlines the Cell Health assays, the Cell Painting assay, and our machine learning approach.

approach

Data processing and modeling approach. (a) Example images and workflow from the Cell Health assays. We apply a series of manual gating strategies (see Methods) to isolate cell subpopulations and to generate cell health readouts for each perturbation. (top) In the “Cell Cycle” panel, in each nucleus we measure Hoechst, EdU, PH3, and gH2AX. (bottom) In the “Cell Viability” panel, we capture digital phase contrast images, measure Caspase 3/7, DRAQ7, CellROX, and (b) Example Cell Painting image across five channels, plus a merged representation across channels. The image is cropped from a larger image and shows ES2 cells. Below are the steps applied in an image-based profiling pipeline, after features have been extracted from each cell’s image. (c) Modeling approach where we fit 70 different regression models using CellProfiler features derived from Cell Painting images to predict Cell Health readouts.

Data

Access

All data are publicly available.

Cell Painting

Data Level Location Notes
Images 1 Image Data Resource (IDR) Accession idr0080
SQLite file (single cell profiles ) 2 NIH Figshare https://doi.org/10.35092/yhjc.9995672 0.download-data/data (see 0.download-data/README.md)
Aggregated profiles with well information (metadata) 3 1.generate-profiles/data/profiles suffix: <PLATE>_augmented.csv.gz
Normalized aggregated profiles with metadata 4a 1.generate-profiles/data/profiles suffix: <PLATE>_normalized.csv.gz
Normalized and feature selected aggregated profiles with metadata 4b 1.generate-profiles/data/profiles suffix: <PLATE>_normalized_feature_select.csv.gz
Consensus profiles 5 1.generate-profiles/data/consensus Perturbation profiles created by summarizing replicates

Cell Health

Data Level Location Notes
Cell health readouts Raw 1.generate-profiles/data/raw Per cell health panel (cell cycle and viability) per cell line
Cell health readouts Normalized 1.generate-profiles/data/labels/normalized_cell_health_labels.tsv
Cell health signatures Consensus 1.generate-profiles/data/consensus

Drug Repurposing Hub

We apply all of our trained cell health models to Cell Painting data from the Drug Repurposing Hub. These data are available at https://doi.org/10.5281/zenodo.3928744.

Summary

We collected Cell Painting measurements using CRISPR perturbations. The experiment targeted 59 genes, which included 119 unique guides (~2 per gene), across 3 cell lines. The cell lines included A549, ES2, and HCC44.

Cell Line Primary Site
A549 Lung cancer
ES2 Ovarian cancer
HCC44 Lung cancer

About 60% of all CRISPR guides were reproducible. This is consistent with previous genetic perturbations (Rohban et al. 2017). It is important to note that we are not actually interested in the CRISPR treatment specifically, but instead, just its corresponding readout in each cell health assay.

CRISPR Correlation

Median pairwise Pearson correlation of CRISPR guide replicate profiles (y axis) compared against Median pairwise Pearson correlation of CRISPR guides targeting the same gene or construct. We removed biological replicates when calculating the same-gene correlations. The three different cell lines (A549, ES2, and HCC44) are shown in different colors and in different facets of the figure. We generated the profiles by median aggregating CellProfiler measurements for all single cells within each well of a Cell Painting experiment (see Methods for more processing details). The text labels represent the proportion of gene and guide profiles with “strong phenotypes”. In other words, these profiles had replicate correlations greater than 95% of non-replicate pairwise Pearson correlations in the particular cell line. The dotted red line represents this 95% cutoff in the null distribution and the blue dotted line is y = x, which shows a strong consistency across CRISPR guide constructs.

Pipeline

The full analysis pipeline consists of the following steps:

Order Module Description
0.download-data Download cell painting data Retrieve single cell profiles archived on Figshare
1.generate-profiles Generate profiles Generate and process cell painting and cell health assay readouts
2.replicate-reproducibility Determine replicate reproducibility Determine the extent to which the CRISPR perturbations result in reproducible signatures
3.train Train machine learning models to predict cell health assays Train and visualize regression models using cell painting data to predict cell health assay readouts
4.apply Apply the models Apply the trained models to the Drug Repurposing Hub data to predict drug perturbation effect
5.validate-repurposing Validate the models Use orthogonal readouts to validate the Drug Repurposing Hub predictions
6.ml-robustness Interrogate robustness of ML predictions Assess sample size, feature groups, and cell line holdouts to probe ML robustness

Each analysis module should be run in order. View each module for specific instructions on how to reproduce results.

analysis-pipeline.sh stores information on how to reproduce all analysis modules.

Computational Environment

We use conda as a package manager. To install conda see instructions.

We recommend installing conda by downloading and executing the .sh file and accepting defaults.

To create the computational environment, run the following:

# Make sure the repo is cloned
conda env create --force --file environment.yml
conda activate cell-health

Machine Learning Approach

We performed the following approach:

  1. Split data into 85% training and 15% test sets.
  2. Normalized data using the EMPTY controls in each plate (moderated z-score).
  3. Selected optimal hyperparamters using 5-fold cross-validation.
  4. Trained elastic net regression models to predict each of the 70 cell health assay readouts, independently.
  5. Trained using shuffled data as well.
  6. Report performance on training and test sets.

Results

Regression Model Performance

Initial results indicate that many of the cell health phenotypes can be predicted with high performance using our approach. However, there are many cell line specific differences.

Regression Model Performance

Test set model performance of predicting 70 cell health readouts with independent regression models. Performance for each phenotype is shown, sorted by decreasing R2 performance. The bars are colored based on the primary measurement metadata (see Supplementary Table S3), and they represent performance aggregated across the three cell lines. The points represent cell line specific performance. Points falling below -1 are truncated to -1 on the x axis. See 3.train for a full depiction.

Model Interpretation

Because we used a logistic regression classifier, we can readily interpret the output features. These features were derived from CellProfiler and represent different measurements of cell morphology. Shown above is a summary of coefficients from all 70 cell health models. We observed that each contribute to classifying various facets of cell health. Many different categories of cell morphology features contribute to cell health predictions.

Model Interpretation

The importance of each class of Cell Painting features in predicting 70 cell health readouts. Each square represents the mean absolute value of model coefficients weighted by test set R2 across every model. The features are broken down by compartment (Cells, Cytoplasm, and Nuclei), channel (AGP, Nucleus, ER, Mito, Nucleolus/Cyto RNA), and feature group (AreaShape, Neighbors, Channel Colocalization, Texture, Radial Distribution, Intensity, and Granularity). For a complete description of all features, see the handbook: http://cellprofiler-manual.s3.amazonaws.com/CellProfiler-3.0.0/index.html Dark gray squares indicate “not applicable”, meaning either that there are no features in the class or the features did not survive an initial preprocessing step. Note that for improved visualization we multiplied the actual model coefficient value by 100.

Application to Drug Repurposing Hub

We applied the trained models to Cell Painting data from the Drug Repurposing Hub (Corsello et al. 2017). These data represent ~1,500 compound perturbations in ~6 dose points in A549 cells.

Collapsing the Drug Repurposing Hub Cell Painting data into UMAP coordinates, we observed many associated Cell Health predictions. For example, predicted G1 Cell Count and predicted ROS had clear gradients in UMAP space. However, there is not exactly a 1-1 relationship. The proteasome inhibitors (DMSO and Bortezomib) are known to induce ROS, while PLK inhibitors are known to induce cell death by blocking mitosis entry. A single PLK inhibitor (HMN-214) showed a strong dose relationship with predicted G1 count.

lincs

Validating Cell Health models to Cell Painting data from The Drug Repurposing Hub. (a) The results of the dose alignment between the PRISM assay and the Drug Repurposing Hub data. This view indicates that there was not a one-to-one matching between perturbation doses. (b) Comparing viability estimates from the PRISM assay to the predicted number of live cells in the Drug Repurposing Hub. The PRISM assay estimates viability by measuring barcoded A549 cells after an incubation period. (c) Drug Repurposing Hub profiles stratified by G1 cell count and ROS predictions. Bortezomib and MG-132 are proteasome inhibitors and are used as positive controls; DMSO is a negative control. We also highlight all PLK inhibitors in the dataset. (d) HMN-214 is an example of a PLK inhibitor that shows strong dose response for G1 cell count predictions. (e) Tubulin and aurora kinase inhibitors are predicted to have high Number of gH2AX spots in G1 cells compared to other compounds and controls. (f) Barasertib (AZD1152) is an aurora kinase inhibitor that is predicted to have a strong dose response for Number of gH2AX spots in G1 cells predictions.

Drug Repurposing Hub: Exploratory Tool

We applied all predictions and present them in an easy-to-navigate webapp at https://broad.io/cell-health-app

Citation

If you use data, results, or insights from this repository, please consider citing our publication:

Gregory P. Way, Maria Kost-Alimova, Tsukasa Shibue, William F. Harrington, Stanley Gill, Federica Piccioni, Tim Becker, Hamdah Shafqat-Abbasi, William C. Hahn, Anne E. Carpenter, Francisca Vazquez, and Shantanu Singh. Predicting cell health phenotypes using image-based morphology profiling. Molecular Biology of the Cell 2021 32:9, 995-1005. DOI: https://doi.org/10.1091/mbc.E20-12-0784.

If you use the Cell Painting images, please also consider citing our public data set on IDR:

Gregory P. Way, Maria Kost-Alimova, Tsukasa Shibue, William F. Harrington, Stanley Gill, Federica Piccioni, Tim Becker, Hamdah Shafqat-Abbasi, William C. Hahn, Anne E. Carpenter, Francisca Vazquez, and Shantanu Singh. Cell health phenotypes can be predicted from unbiased image-based morphology readouts. Image Data Resource 2020, screen 2701:accession 0080. https://doi.org/10.17867/10000153.

If you use the single cell profiles of CellProfiler features, please also consider citing our public dataset on Figshare:

Gregory P. Way, Maria Kost-Alimova, Tsukasa Shibue, William F. Harrington, Stanley Gill, Federica Piccioni, Tim Becker et al. (2019): Cell Health - Cell Painting Single Cell Profiles. The NIH Figshare Archive. Dataset. https://doi.org/10.35092/yhjc.9995672.v5

cell-health's People

Contributors

cells2numbers avatar gwaybio avatar shntnu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

cell-health's Issues

Change classification binarization

Currently, we only calculate the median and separate out high vs. low values to symbolize 1 vs. 0 for all cell health classification models. We should design a better way to binarize. Perhaps using otsu thresholding or something similar.

More Morpheus Heatmaps

  • Cell Health Target Variables in each cell line independently
  • Cell Painting data across samples

Predicting Cell Health Dyes

In #67 I added visualizations that demonstrate model performance on cell health features when predicting certain dyes. I provide results and interpretations below.

Regression Performance - Training vs. Testing

performance_summary_rsquared_assay

Lots of cell health phenotypes captured with specific dyes are predicted consistently well. For example, DRAQ7-based phenotypes are predicted well, as are many gH2Ax-based phenotypes. pH3-based phenotypes are not predicted well. Also, as a negative control, the CRISPR efficiencies are not predicted well in training or in testing. Cell Painting will not capture CRISPR efficiency, and, therefore, should not be a predictable phenotype.

Comparing Regression to Classification Models

performance_summary_assay_classification_vs_regression

Interestingly, some phenotypes that cannot be predicted with regression, have good performance with a classification model. This suggests that some phenotypes are more susceptible to noisy intermediate values, and outliers are more informative of biology (see example of vb_percent_all_late_apoptosis below.

The specific phenotypes are not visualized in this figure, but, in order of AUC (decreasing performance):

'vb_percent_all_late_apoptosis'
'cc_g2_ph3_pos_high_n_spots_h2ax_mean'
'vb_percent_all_apoptosis'
'cc_cc_high_n_spots_h2ax_mean'
'cc_g1_high_n_spots_h2ax_mean'
'vb_percent_all_early_apoptosis'
'cc_edu_pos_high_n_spots_h2ax_mean'
'cc_g2_ph3_neg_n_spots_mean'
'cc_mitosis_ph3_neg_n_spots_mean'

Classifying All Late Apoptosis (Example)

vb_percent_all_late_apoptosis_dist

vb_percent_all_late_apoptosis_performance

Summary

Certain dyes are predicted better than others, some phenotypes can be predicted with classification models much better than regression models.

Meeting 12/16/2019 - Notes

  • Genes have annotations to function relating to specific cell health phenotypes (#53 (comment))
  • There are RNAseq data for this perturbations too. Can we compare predictive performance? Maybe certain cell health models are better with either datatype? What about combined?
    • We may also already have L1000 data for these perturbations (#89)
  • ES2 Cell Line has superior performance in CRISPR experiments. It has high CAS9 expression and behaves nicely (#80 (comment))
  • There may be room for various improvements in model building
    • Adjust for single cell penetrance (only build profiles with cells with infections)
    • There may be a copy number effect here for CRISPR knockouts. Could adjust scores by some sort of copy number effect.
  • Perform leave one cell out experiments - this will tell us how well models might generalize to new cell lines.
  • Perform CCA on X and Y matrices and look at loadings to determine clearly, which cell painting features are aligned with cell health phenotypes
    • This is more informative than looking at specific feature names (#81)

Increase Viability Performance

The regression models do not seem to predict viability variables well. Explore why this is the case and see if we can engineer features that may improve performance.

Cell IDs

I need to add a cell_id column. Currently, I am relying on the index number and this is very easy to break. Needs fixing!

Select Interesting Compounds to Validate

We will use the Repurposing Hub dashboard to select specific compounds to validate in the lab.

We will select two kinds of compounds:

  1. Compounds that are predicted to have well characterized effects (Positive Controls)
  2. Compounds that are predicted to have unexpected effects that could help to define MOA

We will use this issue to document the molecular validation approach (which can then later be transferred to a methods section)

Where do the DMSO treated profiles land in Repurposing UMAP?

In #83, I add code that performs this update.

Summary

There appear to be substantial plate effects in the Repurposing Hub data, at least in UMAP space and using consensus DMSO data.

DMSO Figure

umap_repurposing_cell_painting_dose_consensus

There are at least three distinct groupings of DMSO controls.

DMSO by Well

umap_repurposing_cell_painting_dose_consensus_dmso

The clustering appears to be driven by plate layout to some extent.

Standard Deviation of Cell Health model scores for DMSO and Compounds

dmso_vs_compound_standard_deviation

Should we expect a certain amount of variation across cell health model output scores?
As expected, every cell health model has higher standard deviation in compounds compared to DMSO consensus (except two models that output the same score no matter what). Also, the amount of variance in the model outputs is directly associated with the performance. Note that only models with test set Rsquared > 0 were used in scaling. All models with test set Rsquared < 0 are shown with more transparency.

Next Steps

  • Visualized above are consensus profiles, where do DMSO's land in well-level profiles?
  • CMAP used two positive control compounds (below), where do they land in the UMAP space?
    • BRD-K50691590-001-02-2 (Bortezomib)
    • BRD-K60230970-001-10-0 (MG-132 {Proteosome Inhibitor})
  • Determine if we need to use whitening to adjust plate effects?
    • Alternatively, since we are ultimately interested in the outputs of the cell health models, should we allow for the profiles to vary like this?

cc @AnneCarpenter @shntnu @hkhawar

Reprocess image-based profiles with `robustize_mad`

Currently, I am using robustize to normalize profiles (see here).

However, we noticed that this function is slightly different than the cytominer robustize. See broadinstitute/lincs-cell-painting#3 (comment). Since then, I've implemented the cytominer robustize in cytomining/pycytominer#72.

We observed a slight downtick in replicate reproducibility in #112 comparing pycytominer to cytominer profiles. The cytominer profiles were originally calculated using cytominer_scripts/normalize.R which defaults to robustize in the archive.

I will reprocess all image-based profiles with the updated robustize_mad pycytominer profiles.

High Audit `null_threshold`?

As I am reproducing this analysis, I noticed particularly high null_threshold values across cell lines. Specifically:

plate_map_name null_threshold fraction_strong Metadata_cell_line
0 DEPENDENCIES1_HCC44 0.356182 0.491525 HCC44
1 DEPENDENCIES1_ES2 0.435187 0.506098 ES2
2 DEPENDENCIES1_A549 0.700114 0.341463 A549

I have a couple of questions about how this null_threshold was generated.

First, I see that audit.R from cytominer_scripts was called four different times in analysis_log.sh. Is it safe to assume the later lines overwrite the previous calls? Or am I looking at this wrong?

Second (and if the later lines overwrite), could the reason the null_thresholds are so large b/c so many metadata variables are included in the -p (--group_by) flag of the audit.R call? (line 146 here). Would this then "lower" the amount of randomization and thus not adequately represent a null? Am I reading this function correctly?

Cell health cell count model is correlated to CellProfiler cc_cc_n_objects but does not use the feature in the model

I start by testing if viability readouts in the PRISM assay for A549 are similar to cell health model estimates. The DepMap data are accessed here.

The data ☝️ are available under CC BY 4.0 and is distributed (with processing code) in #104. I also make sure to attribute Corsello et al. 2019 in a README file included in #104 (in data/ folder)

In this module, I first process A549 viability data from the Cancer Dependency Map. Next, I merge the viability data with the cell health model scores applied to the same perturbations in A549.

Important Notes

  • Not all doses measured are consistent between the two datasets. I recoded doses to a range of 1-7 as we did previously, and I calculate the Spearman correlation between the two viability estimates.
  • In talking with the DepMap folks (Anup, Jacob, and Nick) at CLUE office hours (this was very useful btw) it seems there are some caveats to this data:
    • It is a PRISM screen, so it will not be as accurate as A549-specific CellTiter-Glo readouts
    • The CTG assay data exist somewhere (probably) but where it exists is not open source

Results

We see high Spearman correlation between the two viability estimates (which is pretty cool!)

viability_results

cc @AnneCarpenter @shntnu

Reprocess Cell Painting Features

Because we're using a machine learning approach that has feature selection embedded, it will be useful to to skip the feature selection steps of the cytominer pipeline.

I will also add the whitening step.

Documenting Roadblocks

I will note here a couple of things that I could not get through while trying to reproduce this analysis

  1. Call to process.sh in analysis_log.sh. (note that broadinstitute/cytominer_scripts#22 has deprecated process.sh in favor of collate.R, but a simple replace did not work)
  2. Reference to the LUAD dataset in backend here and reference to the ORF dataset in backend here did not resolve. Maybe I missed something here about pulling in this external data as well!

BRAF1 and BRAF

Both genes IDs are included in the Metadata_gene_name field. BRAF1 is a synonym of BRAF. Should we rename BRAF1 in the dataset? This is an important step since we are performing several steps in aggregate across genes and guides.

From this experiment, `ITGAV` and `MYC` are the highest correlating genes across guides

A question arose about selecting CRISPR guides to use as positive controls in a cell painting experiment.

Based on the limited data:

guide_correlation

it appears that the most reproducible profiles are from CRISPR guides targeting ITGAV, KIF11, MYC, POLR2D, and PSMA1.

I generated a morpheus heatmap comparing aggregate correlation (calculating mean well profiles) for these genes (and a negative control LacZ).

crispr_reproducible_genes

The results seem to indicate that many of these reproducible profiles across guides are also highly similar within cell line.

Check Ploidy of Cell Lines

CRISPR efficacy (reproducibility) is different across cell lines. Check if its related to cell line ploidy

Uploading Image Files to IDR and BBBC

Uploading the images into the public domain is a very important part of the research process. I will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.

We will use this issue to outline the required steps. First, I will need to restore the image files from aws glacier storage. @shntnu - can you link the most recent resources?


edit

data how available: https://idr.openmicroscopy.org/webclient/?show=screen-2701

Data Processing and Machine Learning Approaches

This issue will discuss the machine learning approach to predict cell health labels using cell painting features.

Goal

What is the extent we can predict certain cell health outcomes? The cell health outcomes are described in feature_mapping_annotated.

Data

We have cell painting and cell health readout data for the same three cell lines (A549, ES2, and HCC44). In each data type and using CRISPR, collaborators knocked down a total of 59 genes and controls using 119 different guides.

Cell Painting

Cell painting data were acquired across these guides and cell lines. There were about 6 replicates per guide. Because we cannot map wells between experiments, and can only compare at the condition level, we collapsed replicate guides into median profiles.

This resulted in a 357 x 247 (profiles by features) matrix (119 guides * 3 cell lines).

Cell Health

Cell health assay readouts were collected by other collaborators and represent 72 different cell health readouts. We also median collapsed measurements across guides. There were about 4 replicates per guide.

The final cell health target matrix was 357 x 72.

Training and Testing Split

In #22, we split the data randomly into 85% training and 15% testing sets.

Machine Learning

Our goal is to assess how well the cell painting features can capture the signal of each of the 72 cell health outputs.

We approached this machine learning problem in three different ways:

  1. Raw cell health measurements
  2. Transform cell health measurements to a scale between 0 and 1
  3. Binarize cell health labels into high/low
  • I am currently using kmeans to find two clusters in each of the 72 cell health labels independently.

The first two approaches require a regression approach, while the third can be approached as a classification problem.

Cell Health Dyes

The dyes used to generate cell health labels correspond to the following:

Dye Target
Hoechst DNA Dye
EdU S Phase
pH3 Mitosis
gH2AX DNA Damage

Cleaning Cell Health Assay Label Data

I will document data cleaning issues for processing the Cell Health assay readout labels here

  • G1/S feature is only captured in A549
  • There are many duplicate rows in the cell cycle spreadsheets (e.g. A549 B9 Empty is duplicated)
  • A549 has some additional guides not found in ES2 and HCC44. What are they?
    • MAP4K1_ACTN4
    • CDK4_CYP27B1-2F
    • Tsai_HEK293_4
    • Tsai_VEGFA-1
    • Tsai_VEGFA_2
    • Wang_K562_22_Nongenic-1
    • Wang_K562_22_Nongenic-2
    • Note that A549 has 83 EMPTY guides, while ES2 and HCC44 have 112 each (i.e. maybe they are also controls)

Add Functionality to Permute Test Set

Currently, the test set is permuted once and performance metrics reported on this single permutation. As @jccaicedo suggested in a recent checkin, it would be great to generate a distribution of permutations to compare performance against (and simulate p value)

Get Sample Images

We have access to Cell Painting Images

We need to look at them! Will be good for presentations.

Unfortunately, we no longer have cell health images.

What is "Chr2" in `Metadata_gene_name`?

I assume that it means "Chromosome 2"? Is this correct? So there was a CRISPR guide to target anywhere on the gene? Or is this some kind of control?

I see that the number of CRISPR constructs targeting Chr2 is very high.

gene num constructs
Chr2 144
PSMA1 48
ORC4 48
POLR2D 48
PPIB 48
LacZ 24
Luc 24
... ...

cc @shntnu

Not all Models are Fit

Based on #41 - 416 models are saved, but there are a total of 420 models.

This means that 4 models cannot be fit. Track this down.

Regression Performance in Test sets

Based on #39 I am getting seemingly strange values for R2 and MSE in the test set for regression performance. This may be a bug, and, if so, will be important to track down!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.