broadinstitute / cell-health Goto Github PK

Predicting Cell Health with Morphological Profiles

License: MIT License

R 0.44% Shell 0.07% Jupyter Notebook 48.69% HTML 50.44% Python 0.36% TeX 0.01%

carpenter-lab morphological-profiling crispr cell-painting cancer 2015-07-01-cell-health

cell-health's Introduction

Predicting cell health phenotypes using image-based morphology profiling

Gregory P. Way, Maria Kost-Alimova, Tsukasa Shibue, William F. Harrington, Stanley Gill, Federica Piccioni, Tim Becker, Hamdah Shafqat-Abbasi, William C. Hahn, Anne E. Carpenter, Francisca Vazquez, Shantanu Singh

2020

Summary
Approach
Data
- Access
- Summary
Pipeline
- Computational Environment
Machine Learning Approach
Results
Citation

Table of contents generated with markdown-toc

Summary

Cell health can be altered by genetic and chemical perturbations. An increased understanding of these perturbation mechanisms is directly relevant for drug discovery and personalized medicine. Here and in an accompanying paper, we present two novel cell imaging assays that together measure 70 different aspects of cell health, such as proliferation, apoptosis, and cell cycle stalling. However, these assays require expensive reagents and do not scale well. Therefore, we also developed a machine learning solution to predict cell health readouts directly from a separate assay, known as Cell Painting. In contrast to the Cell Health assays, Cell Painting is inexpensive, high-throughput, and unbiased (reagents are not targeted). We predict many cell health indicators with high performance, but other readouts could not be predicted. We validated our predictions by using orthogonal readouts and by applying the models to a large set of 1,500 drugs from the Drug Repurposing Hub. Cell health predictions for drugs can be browsed at https://broad.io/cell-health-app. We confirmed mitotic arrest, reactive oxygen species, and DNA damage in G1 cell cycle based phenotypes via PLK, proteasome, and aurora kinase/tubulin inhibition, respectively. In the future, we can use this approach to determine the cell health consequences of any perturbation in cells. We conducted this project using open science principles with open data and open source code.

The following repository stores a complete analysis pipeline using Cell Painting data to predict readouts from the Cell Health assays.

We first developed the customized microscopy assays we collectively call "Cell Health". The Cell Health assays are comprised of two different reagent panels: "Cell cycle" and "viability". Together, these two panels use reagents which mark different cell health phenotypes.

Assay/Dye	Phenotype	Panel
Caspase 3/7	Apoptosis	Viability
DRAQ7	Cell death	Viability
CellROX	Reactive oxygen species	Viability
EdU	Cellular profileration	Cell cycle
Hoechst	DNA content	Cell cycle
pH3	Cell division	Cell cycle
gH2Ax	DNA damage	Cell cycle

We hypothesized that we can use unbiased and high dimensional Cell Painting profiles to predict cell health readouts.

Approach

This overview figure outlines the Cell Health assays, the Cell Painting assay, and our machine learning approach.

Data processing and modeling approach. (a) Example images and workflow from the Cell Health assays. We apply a series of manual gating strategies (see Methods) to isolate cell subpopulations and to generate cell health readouts for each perturbation. (top) In the “Cell Cycle” panel, in each nucleus we measure Hoechst, EdU, PH3, and gH2AX. (bottom) In the “Cell Viability” panel, we capture digital phase contrast images, measure Caspase 3/7, DRAQ7, CellROX, and (b) Example Cell Painting image across five channels, plus a merged representation across channels. The image is cropped from a larger image and shows ES2 cells. Below are the steps applied in an image-based profiling pipeline, after features have been extracted from each cell’s image. (c) Modeling approach where we fit 70 different regression models using CellProfiler features derived from Cell Painting images to predict Cell Health readouts.

Data

Access

All data are publicly available.

Cell Painting

Data	Level	Location	Notes
Images	1	Image Data Resource (IDR)	Accession `idr0080`
SQLite file (single cell profiles )	2	NIH Figshare https://doi.org/10.35092/yhjc.9995672	`0.download-data/data` (see `0.download-data/README.md`)
Aggregated profiles with well information (metadata)	3	1.generate-profiles/data/profiles	suffix: `<PLATE>_augmented.csv.gz`
Normalized aggregated profiles with metadata	4a	1.generate-profiles/data/profiles	suffix: `<PLATE>_normalized.csv.gz`
Normalized and feature selected aggregated profiles with metadata	4b	1.generate-profiles/data/profiles	suffix: `<PLATE>_normalized_feature_select.csv.gz`
Consensus profiles	5	1.generate-profiles/data/consensus	Perturbation profiles created by summarizing replicates

Cell Health

Data	Level	Location	Notes
Cell health readouts	Raw	1.generate-profiles/data/raw	Per cell health panel (cell cycle and viability) per cell line
Cell health readouts	Normalized	1.generate-profiles/data/labels/normalized_cell_health_labels.tsv
Cell health signatures	Consensus	1.generate-profiles/data/consensus

Drug Repurposing Hub

We apply all of our trained cell health models to Cell Painting data from the Drug Repurposing Hub. These data are available at https://doi.org/10.5281/zenodo.3928744.

Summary

We collected Cell Painting measurements using CRISPR perturbations. The experiment targeted 59 genes, which included 119 unique guides (~2 per gene), across 3 cell lines. The cell lines included A549, ES2, and HCC44.

Cell Line	Primary Site
A549	Lung cancer
ES2	Ovarian cancer
HCC44	Lung cancer

About 60% of all CRISPR guides were reproducible. This is consistent with previous genetic perturbations (Rohban et al. 2017). It is important to note that we are not actually interested in the CRISPR treatment specifically, but instead, just its corresponding readout in each cell health assay.

Median pairwise Pearson correlation of CRISPR guide replicate profiles (y axis) compared against Median pairwise Pearson correlation of CRISPR guides targeting the same gene or construct. We removed biological replicates when calculating the same-gene correlations. The three different cell lines (A549, ES2, and HCC44) are shown in different colors and in different facets of the figure. We generated the profiles by median aggregating CellProfiler measurements for all single cells within each well of a Cell Painting experiment (see Methods for more processing details). The text labels represent the proportion of gene and guide profiles with “strong phenotypes”. In other words, these profiles had replicate correlations greater than 95% of non-replicate pairwise Pearson correlations in the particular cell line. The dotted red line represents this 95% cutoff in the null distribution and the blue dotted line is y = x, which shows a strong consistency across CRISPR guide constructs.

Pipeline

The full analysis pipeline consists of the following steps:

Order	Module	Description
0.download-data	Download cell painting data	Retrieve single cell profiles archived on Figshare
1.generate-profiles	Generate profiles	Generate and process cell painting and cell health assay readouts
2.replicate-reproducibility	Determine replicate reproducibility	Determine the extent to which the CRISPR perturbations result in reproducible signatures
3.train	Train machine learning models to predict cell health assays	Train and visualize regression models using cell painting data to predict cell health assay readouts
4.apply	Apply the models	Apply the trained models to the Drug Repurposing Hub data to predict drug perturbation effect
5.validate-repurposing	Validate the models	Use orthogonal readouts to validate the Drug Repurposing Hub predictions
6.ml-robustness	Interrogate robustness of ML predictions	Assess sample size, feature groups, and cell line holdouts to probe ML robustness

Each analysis module should be run in order. View each module for specific instructions on how to reproduce results.

analysis-pipeline.sh stores information on how to reproduce all analysis modules.

Computational Environment

We use conda as a package manager. To install conda see instructions.

We recommend installing conda by downloading and executing the .sh file and accepting defaults.

To create the computational environment, run the following:

# Make sure the repo is cloned
conda env create --force --file environment.yml
conda activate cell-health

Machine Learning Approach

We performed the following approach:

Split data into 85% training and 15% test sets.
Normalized data using the EMPTY controls in each plate (moderated z-score).
Selected optimal hyperparamters using 5-fold cross-validation.
Trained elastic net regression models to predict each of the 70 cell health assay readouts, independently.
Trained using shuffled data as well.
Report performance on training and test sets.

Results

Regression Model Performance

Initial results indicate that many of the cell health phenotypes can be predicted with high performance using our approach. However, there are many cell line specific differences.

Test set model performance of predicting 70 cell health readouts with independent regression models. Performance for each phenotype is shown, sorted by decreasing R2 performance. The bars are colored based on the primary measurement metadata (see Supplementary Table S3), and they represent performance aggregated across the three cell lines. The points represent cell line specific performance. Points falling below -1 are truncated to -1 on the x axis. See 3.train for a full depiction.

Model Interpretation

Because we used a logistic regression classifier, we can readily interpret the output features. These features were derived from CellProfiler and represent different measurements of cell morphology. Shown above is a summary of coefficients from all 70 cell health models. We observed that each contribute to classifying various facets of cell health. Many different categories of cell morphology features contribute to cell health predictions.

The importance of each class of Cell Painting features in predicting 70 cell health readouts. Each square represents the mean absolute value of model coefficients weighted by test set R2 across every model. The features are broken down by compartment (Cells, Cytoplasm, and Nuclei), channel (AGP, Nucleus, ER, Mito, Nucleolus/Cyto RNA), and feature group (AreaShape, Neighbors, Channel Colocalization, Texture, Radial Distribution, Intensity, and Granularity). For a complete description of all features, see the handbook: http://cellprofiler-manual.s3.amazonaws.com/CellProfiler-3.0.0/index.html Dark gray squares indicate “not applicable”, meaning either that there are no features in the class or the features did not survive an initial preprocessing step. Note that for improved visualization we multiplied the actual model coefficient value by 100.

Application to Drug Repurposing Hub

We applied the trained models to Cell Painting data from the Drug Repurposing Hub (Corsello et al. 2017). These data represent ~1,500 compound perturbations in ~6 dose points in A549 cells.

Collapsing the Drug Repurposing Hub Cell Painting data into UMAP coordinates, we observed many associated Cell Health predictions. For example, predicted G1 Cell Count and predicted ROS had clear gradients in UMAP space. However, there is not exactly a 1-1 relationship. The proteasome inhibitors (DMSO and Bortezomib) are known to induce ROS, while PLK inhibitors are known to induce cell death by blocking mitosis entry. A single PLK inhibitor (HMN-214) showed a strong dose relationship with predicted G1 count.

Validating Cell Health models to Cell Painting data from The Drug Repurposing Hub. (a) The results of the dose alignment between the PRISM assay and the Drug Repurposing Hub data. This view indicates that there was not a one-to-one matching between perturbation doses. (b) Comparing viability estimates from the PRISM assay to the predicted number of live cells in the Drug Repurposing Hub. The PRISM assay estimates viability by measuring barcoded A549 cells after an incubation period. (c) Drug Repurposing Hub profiles stratified by G1 cell count and ROS predictions. Bortezomib and MG-132 are proteasome inhibitors and are used as positive controls; DMSO is a negative control. We also highlight all PLK inhibitors in the dataset. (d) HMN-214 is an example of a PLK inhibitor that shows strong dose response for G1 cell count predictions. (e) Tubulin and aurora kinase inhibitors are predicted to have high Number of gH2AX spots in G1 cells compared to other compounds and controls. (f) Barasertib (AZD1152) is an aurora kinase inhibitor that is predicted to have a strong dose response for Number of gH2AX spots in G1 cells predictions.

Drug Repurposing Hub: Exploratory Tool

We applied all predictions and present them in an easy-to-navigate webapp at https://broad.io/cell-health-app

Citation

If you use data, results, or insights from this repository, please consider citing our publication:

Gregory P. Way, Maria Kost-Alimova, Tsukasa Shibue, William F. Harrington, Stanley Gill, Federica Piccioni, Tim Becker, Hamdah Shafqat-Abbasi, William C. Hahn, Anne E. Carpenter, Francisca Vazquez, and Shantanu Singh. Predicting cell health phenotypes using image-based morphology profiling. Molecular Biology of the Cell 2021 32:9, 995-1005. DOI: https://doi.org/10.1091/mbc.E20-12-0784.

If you use the Cell Painting images, please also consider citing our public data set on IDR:

Gregory P. Way, Maria Kost-Alimova, Tsukasa Shibue, William F. Harrington, Stanley Gill, Federica Piccioni, Tim Becker, Hamdah Shafqat-Abbasi, William C. Hahn, Anne E. Carpenter, Francisca Vazquez, and Shantanu Singh. Cell health phenotypes can be predicted from unbiased image-based morphology readouts. Image Data Resource 2020, screen 2701:accession 0080. https://doi.org/10.17867/10000153.

If you use the single cell profiles of CellProfiler features, please also consider citing our public dataset on Figshare:

Gregory P. Way, Maria Kost-Alimova, Tsukasa Shibue, William F. Harrington, Stanley Gill, Federica Piccioni, Tim Becker et al. (2019): Cell Health - Cell Painting Single Cell Profiles. The NIH Figshare Archive. Dataset. https://doi.org/10.35092/yhjc.9995672.v5

cell-health's People

Contributors

Stargazers

Watchers

Forkers

gwaybio deepmeditativemind shntnu hillsbury salimdason mbergins hgomz axiomcura aissatech

cell-health's Issues

Change classification binarization

Currently, we only calculate the median and separate out high vs. low values to symbolize 1 vs. 0 for all cell health classification models. We should design a better way to binarize. Perhaps using otsu thresholding or something similar.

More Morpheus Heatmaps

Cell Health Target Variables in each cell line independently
Cell Painting data across samples

Predicting Cell Health Dyes

In #67 I added visualizations that demonstrate model performance on cell health features when predicting certain dyes. I provide results and interpretations below.

Regression Performance - Training vs. Testing

Lots of cell health phenotypes captured with specific dyes are predicted consistently well. For example, DRAQ7-based phenotypes are predicted well, as are many gH2Ax-based phenotypes. pH3-based phenotypes are not predicted well. Also, as a negative control, the CRISPR efficiencies are not predicted well in training or in testing. Cell Painting will not capture CRISPR efficiency, and, therefore, should not be a predictable phenotype.

Comparing Regression to Classification Models

Interestingly, some phenotypes that cannot be predicted with regression, have good performance with a classification model. This suggests that some phenotypes are more susceptible to noisy intermediate values, and outliers are more informative of biology (see example of vb_percent_all_late_apoptosis below.

The specific phenotypes are not visualized in this figure, but, in order of AUC (decreasing performance):

'vb_percent_all_late_apoptosis'
'cc_g2_ph3_pos_high_n_spots_h2ax_mean'
'vb_percent_all_apoptosis'
'cc_cc_high_n_spots_h2ax_mean'
'cc_g1_high_n_spots_h2ax_mean'
'vb_percent_all_early_apoptosis'
'cc_edu_pos_high_n_spots_h2ax_mean'
'cc_g2_ph3_neg_n_spots_mean'
'cc_mitosis_ph3_neg_n_spots_mean'

Classifying All Late Apoptosis (Example)

Summary

Certain dyes are predicted better than others, some phenotypes can be predicted with classification models much better than regression models.

Check L1000 for CRISPR profiles

It is possible that many of these perturbations were also measured with L1000

Meeting 12/16/2019 - Notes

Genes have annotations to function relating to specific cell health phenotypes (#53 (comment))
There are RNAseq data for this perturbations too. Can we compare predictive performance? Maybe certain cell health models are better with either datatype? What about combined?
- We may also already have L1000 data for these perturbations (#89)
ES2 Cell Line has superior performance in CRISPR experiments. It has high CAS9 expression and behaves nicely (#80 (comment))
There may be room for various improvements in model building
- Adjust for single cell penetrance (only build profiles with cells with infections)
- There may be a copy number effect here for CRISPR knockouts. Could adjust scores by some sort of copy number effect.
Perform leave one cell out experiments - this will tell us how well models might generalize to new cell lines.
Perform CCA on X and Y matrices and look at loadings to determine clearly, which cell painting features are aligned with cell health phenotypes
- This is more informative than looking at specific feature names (#81)

Complete Single Cell Upload

In #71, I introduced the download module. To speed development, I will merge #71 and make a note here that I need to fix the download and upload links once everything is deposited on figshare.

dependent on cytomining/pycytominer#51

Consensus UMAP PDF does not specify which model

There needs to be a title added to specify which cell health feature is being visualized: https://github.com/broadinstitute/cell-health/blob/master/4.apply/figures/repurposing_hub_umaps_consensus.pdf

Increase Viability Performance

The regression models do not seem to predict viability variables well. Explore why this is the case and see if we can engineer features that may improve performance.

Apply models to repurposing cell painting set

We can use this as a positive control. Analysis: observe if there are any compounds that cause known cell health phenotypes. Do the models predict these impacts?

How many "independent" cell health phenotype signals can be predicted well?

It appears like highly correlated phenotypes can be predicted well. How many of signals of independent cell health phenotypes can we predict?

Cell IDs

I need to add a cell_id column. Currently, I am relying on the index number and this is very easy to break. Needs fixing!

Select Interesting Compounds to Validate

We will use the Repurposing Hub dashboard to select specific compounds to validate in the lab.

We will select two kinds of compounds:

Compounds that are predicted to have well characterized effects (Positive Controls)
Compounds that are predicted to have unexpected effects that could help to define MOA

We will use this issue to document the molecular validation approach (which can then later be transferred to a methods section)

Where do the DMSO treated profiles land in Repurposing UMAP?

In #83, I add code that performs this update.

Summary

There appear to be substantial plate effects in the Repurposing Hub data, at least in UMAP space and using consensus DMSO data.

DMSO Figure

There are at least three distinct groupings of DMSO controls.

DMSO by Well

The clustering appears to be driven by plate layout to some extent.

Standard Deviation of Cell Health model scores for DMSO and Compounds

Should we expect a certain amount of variation across cell health model output scores?
As expected, every cell health model has higher standard deviation in compounds compared to DMSO consensus (except two models that output the same score no matter what). Also, the amount of variance in the model outputs is directly associated with the performance. Note that only models with test set Rsquared > 0 were used in scaling. All models with test set Rsquared < 0 are shown with more transparency.

Next Steps

Visualized above are consensus profiles, where do DMSO's land in well-level profiles?
CMAP used two positive control compounds (below), where do they land in the UMAP space?
- BRD-K50691590-001-02-2 (Bortezomib)
- BRD-K60230970-001-10-0 (MG-132 {Proteosome Inhibitor})
Determine if we need to use whitening to adjust plate effects?
- Alternatively, since we are ultimately interested in the outputs of the cell health models, should we allow for the profiles to vary like this?

cc @AnneCarpenter @shntnu @hkhawar

Investigate specific CRISPR genes

Similar to #52 - Do any of the genes we CRISPR targeted produce known cell health phenotypes? Do we predict these with our models?

What is the correlation across cell health labels?

Chatting with Masha today, we decided that it would be helpful to visualize the correlational structure of the cell health labels themselves. I will perform a morpheus analysis but use the cell health label data

Train Cell-Health Models with intersection of Repurposing Hub Features

Since one of the primary goals is to apply models to the repurposing set, it will be helpful to subset cell-health data to the intersection of features captured in the repurposing set compounds.

Machine Learning Covariate Features

I need to consider adding covariate features to the model:

Plate
Cell Line

Add Metadata Files

After merging #91 , I need to add all the metadata files associated with each plate

Consider Trimming Top Values by some Threshold

While normalizing by median may be robust to outliers, it may be worth considering trimming high values by some sort of standard deviation threshold.

Reprocess image-based profiles with `robustize_mad`

Currently, I am using robustize to normalize profiles (see here).

However, we noticed that this function is slightly different than the cytominer robustize. See broadinstitute/lincs-cell-painting#3 (comment). Since then, I've implemented the cytominer robustize in cytomining/pycytominer#72.

We observed a slight downtick in replicate reproducibility in #112 comparing pycytominer to cytominer profiles. The cytominer profiles were originally calculated using cytominer_scripts/normalize.R which defaults to robustize in the archive.

I will reprocess all image-based profiles with the updated robustize_mad pycytominer profiles.

High Audit `null_threshold`?

As I am reproducing this analysis, I noticed particularly high null_threshold values across cell lines. Specifically:

	plate_map_name	null_threshold	fraction_strong	Metadata_cell_line
0	DEPENDENCIES1_HCC44	0.356182	0.491525	HCC44
1	DEPENDENCIES1_ES2	0.435187	0.506098	ES2
2	DEPENDENCIES1_A549	0.700114	0.341463	A549

I have a couple of questions about how this null_threshold was generated.

First, I see that audit.R from cytominer_scripts was called four different times in analysis_log.sh. Is it safe to assume the later lines overwrite the previous calls? Or am I looking at this wrong?

Second (and if the later lines overwrite), could the reason the null_thresholds are so large b/c so many metadata variables are included in the -p (--group_by) flag of the audit.R call? (line 146 here). Would this then "lower" the amount of randomization and thus not adequately represent a null? Am I reading this function correctly?

Cell health cell count model is correlated to CellProfiler cc_cc_n_objects but does not use the feature in the model

I start by testing if viability readouts in the PRISM assay for A549 are similar to cell health model estimates. The DepMap data are accessed here.

The data ☝️ are available under CC BY 4.0 and is distributed (with processing code) in #104. I also make sure to attribute Corsello et al. 2019 in a README file included in #104 (in data/ folder)

In this module, I first process A549 viability data from the Cancer Dependency Map. Next, I merge the viability data with the cell health model scores applied to the same perturbations in A549.

Important Notes

Not all doses measured are consistent between the two datasets. I recoded doses to a range of 1-7 as we did previously, and I calculate the Spearman correlation between the two viability estimates.
In talking with the DepMap folks (Anup, Jacob, and Nick) at CLUE office hours (this was very useful btw) it seems there are some caveats to this data:
- It is a PRISM screen, so it will not be as accurate as A549-specific CellTiter-Glo readouts
- The CTG assay data exist somewhere (probably) but where it exists is not open source

Results

We see high Spearman correlation between the two viability estimates (which is pretty cool!)

cc @AnneCarpenter @shntnu

Add README to audit module

Across all models, are certain cell painting features more explanatory than others?

Exploring model coefficients across all models, what does this distribution look like?

Reprocess Cell Painting Features

Because we're using a machine learning approach that has feature selection embedded, it will be useful to to skip the feature selection steps of the cytominer pipeline.

I will also add the whitening step.

Documenting Roadblocks

I will note here a couple of things that I could not get through while trying to reproduce this analysis

Call to process.sh in analysis_log.sh. (note that broadinstitute/cytominer_scripts#22 has deprecated process.sh in favor of collate.R, but a simple replace did not work)
Reference to the LUAD dataset in backend here and reference to the ORF dataset in backend here did not resolve. Maybe I missed something here about pulling in this external data as well!

Link each compound to its page in the repurposing hub

@gwaygenomics
I couldn't find a way to link to a compound in https://clue.io/repurposing-app

But you could ask them :) https://clue.io/office-hours

Add AUC to ROC and PR curves plots

It would be helpful to be able to quickly scan what the AUCs are for each curve (real and shuffled, training and testing)

BRAF1 and BRAF

Both genes IDs are included in the Metadata_gene_name field. BRAF1 is a synonym of BRAF. Should we rename BRAF1 in the dataset? This is an important step since we are performing several steps in aggregate across genes and guides.

Add UMAP to dashboard

It would be very helpful to add UMAP to the dashboard explorer. Can do at the same time as #87

From this experiment, `ITGAV` and `MYC` are the highest correlating genes across guides

A question arose about selecting CRISPR guides to use as positive controls in a cell painting experiment.

Based on the limited data:

it appears that the most reproducible profiles are from CRISPR guides targeting ITGAV, KIF11, MYC, POLR2D, and PSMA1.

I generated a morpheus heatmap comparing aggregate correlation (calculating mean well profiles) for these genes (and a negative control LacZ).

The results seem to indicate that many of these reproducible profiles across guides are also highly similar within cell line.

Check Ploidy of Cell Lines

CRISPR efficacy (reproducibility) is different across cell lines. Check if its related to cell line ploidy

Generating Heterogeneity Features

A couple weeks ago @cells2numbers mentioned that we can use the heterogeneity features tested in @mrohban's nature communications paper for cell health predictions.

@cells2numbers did you also mention that you could generate these features using cytominer? Is this the appropriate resouce (in cytominergallery)?

Uploading Image Files to IDR and BBBC

Uploading the images into the public domain is a very important part of the research process. I will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.

We will use this issue to outline the required steps. First, I will need to restore the image files from aws glacier storage. @shntnu - can you link the most recent resources?

edit

data how available: https://idr.openmicroscopy.org/webclient/?show=screen-2701

Data Processing and Machine Learning Approaches

This issue will discuss the machine learning approach to predict cell health labels using cell painting features.

Goal

What is the extent we can predict certain cell health outcomes? The cell health outcomes are described in feature_mapping_annotated.

Data

We have cell painting and cell health readout data for the same three cell lines (A549, ES2, and HCC44). In each data type and using CRISPR, collaborators knocked down a total of 59 genes and controls using 119 different guides.

Cell Painting

Cell painting data were acquired across these guides and cell lines. There were about 6 replicates per guide. Because we cannot map wells between experiments, and can only compare at the condition level, we collapsed replicate guides into median profiles.

This resulted in a 357 x 247 (profiles by features) matrix (119 guides * 3 cell lines).

Cell Health

Cell health assay readouts were collected by other collaborators and represent 72 different cell health readouts. We also median collapsed measurements across guides. There were about 4 replicates per guide.

The final cell health target matrix was 357 x 72.

Training and Testing Split

In #22, we split the data randomly into 85% training and 15% testing sets.

Machine Learning

Our goal is to assess how well the cell painting features can capture the signal of each of the 72 cell health outputs.

We approached this machine learning problem in three different ways:

Raw cell health measurements
Transform cell health measurements to a scale between 0 and 1
Binarize cell health labels into high/low

I am currently using kmeans to find two clusters in each of the 72 cell health labels independently.

The first two approaches require a regression approach, while the third can be approached as a classification problem.

[Hotfix] Analyze Audits Notebook

Remove cell 8 from the notebook (it is redundant)

refs #4

Add MOA annotations to dashboard

MOA annotations are key to interpretation

Cell Health Dyes

The dyes used to generate cell health labels correspond to the following:

Dye	Target
Hoechst	DNA Dye
EdU	S Phase
pH3	Mitosis
gH2AX	DNA Damage

How were guides selected?

Efficacy of replicate correlation across guises could be inflated based on biased selection

Reprocess with Pycytominer - Compress Output

Compression is supported in cytomining/pycytominer#40 - make sure to compress output next iteration

Add single cell cell painting data to online resource

Cleaning Cell Health Assay Label Data

I will document data cleaning issues for processing the Cell Health assay readout labels here

G1/S feature is only captured in A549
There are many duplicate rows in the cell cycle spreadsheets (e.g. A549 B9 Empty is duplicated)
A549 has some additional guides not found in ES2 and HCC44. What are they?
- MAP4K1_ACTN4
- CDK4_CYP27B1-2F
- Tsai_HEK293_4
- Tsai_VEGFA-1
- Tsai_VEGFA_2
- Wang_K562_22_Nongenic-1
- Wang_K562_22_Nongenic-2
- Note that A549 has 83 EMPTY guides, while ES2 and HCC44 have 112 each (i.e. maybe they are also controls)

Change Colors of Target Variables in Performance Summary

The colors are confusing. Change!

Add Functionality to Permute Test Set

Currently, the test set is permuted once and performance metrics reported on this single permutation. As @jccaicedo suggested in a recent checkin, it would be great to generate a distribution of permutations to compare performance against (and simulate p value)

[Hotfix] Typo in x axis title of CRISPR replicate guide correlation

targetting -> targeting

Get Sample Images

We have access to Cell Painting Images

We need to look at them! Will be good for presentations.

Unfortunately, we no longer have cell health images.

What is "Chr2" in `Metadata_gene_name`?

I assume that it means "Chromosome 2"? Is this correct? So there was a CRISPR guide to target anywhere on the gene? Or is this some kind of control?

I see that the number of CRISPR constructs targeting Chr2 is very high.

gene	num constructs
Chr2	144
PSMA1	48
ORC4	48
POLR2D	48
PPIB	48
LacZ	24
Luc	24
...	...

cc @shntnu

broadinstitute / cell-health Goto Github PK

cell-health's Introduction

Predicting cell health phenotypes using image-based morphology profiling

Summary

Approach

Data

Access

Cell Painting

Cell Health

Drug Repurposing Hub

Summary

Pipeline

Computational Environment

Machine Learning Approach

Results

Regression Model Performance

Model Interpretation

Application to Drug Repurposing Hub

Drug Repurposing Hub: Exploratory Tool

Citation

cell-health's People

Contributors

Stargazers

Watchers

Forkers

cell-health's Issues

Regression Performance - Training vs. Testing

Comparing Regression to Classification Models

Classifying All Late Apoptosis (Example)

Summary

Summary

DMSO Figure

DMSO by Well

Standard Deviation of Cell Health model scores for DMSO and Compounds

Next Steps

Important Notes

Results

Goal

Data

Cell Painting

Cell Health

Training and Testing Split

Machine Learning

Recommend Projects

Recommend Topics

Recommend Org

Jobs