broadinstitute / lincs-cell-painting Goto Github PK

Processed Cell Painting Data for the LINCS Drug Repurposing Project

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 89.93% Python 6.99% HTML 3.06% Shell 0.03%

drug-repurposing cell-painting lincs cell-morphology data-repository

lincs-cell-painting's Issues

Make decisions up to consensus clearer

Talking to Mattias I noticed that there are some holes in terms of explaining what happens to get to the consensus data.

It does not become clear anywhere that 'mad_robustize' is the way of normalization for all the consensus data
As Shantanu has already pointed out, the consensus data should be all in one place.
Can we have the spherized_profiles in the profiler folder, makes more sense I think
Can we have a high-level explanation of the normalization techniques so people don't need to go into the actual pycytominer code to understand what norm by DMSO vs norm by plate means and also what the difference between Mad, standard and Mad-rob is
Finally, normalizing by plate is actually normalizing by the entire batch (136 plates) right? if so its a suboptimal name.

Sorry for all these suggestions at once. I don't what is on your plate (haha) right now @gwaygenomics so I'm unsure how to move forward with these suggestions.

CC: @FloHu, @shntnu @niranjchandrasekaran

Add cell count files

In #34, I added preliminary cell count files. However, they included an extra column and were tab separated (see #34 (comment))

in 2141da9 I removed the cell count files in order to process them more consistently. The files are currently being generated and this issue will be closed once they are added back.

Compute per channel percent strong

While working on CPJUMP-Stain2, @shntnu and I observed that the proportion of compounds with a strong signal (percent strong metric) was similar if the analysis was performed with individual channels or across all channels. We wanted to find out if this behavior was seen in other datasets as well.

I chose one of the platemaps (H-BIOA-002-1) from BBBC022 and computed the channel-wise correlation values. Based on the results below, it looks like BBBC022 also behaves similarly.

BBBC022 channel-wise percent strong comparison

Click to expand!

BBBBC022 all channels percent strong

Click to expand!

Performing this experiment with a larger dataset, such as LINCS, may help answer whether the above plots are technical artifacts or if this behavior is consistent across datasets.

Channel order

Hello! Where can I find which markers/stains each of the 5 image channels correspond to?

Make consensus_modz file available in GCT format

See #46

Save file as GCT after this line

lincs-cell-painting/consensus/scripts/nbconverted/build-consensus-signatures.py

Lines 206 to 208 in e6852b4

 consensus_df.to_csv( 

 consensus_file, sep=",", compression="gzip", float_format="%5g", index=False 

 )

Use https://github.com/cytomining/pycytominer/blob/master/pycytominer/write_gct.py

Note about citing clue.io

We concluded that it is ok to have Level 3-5 data on GitHub, although we will cite clue.io as the primary source and references for this data, similar to this README by @gwaygenomics

The following data will eventually be made available on clue.io/morphology

Level 1-5 + connectivity file of

REP (1571 compounds, A549, 48h, 6 doses)
LKCP (~360 compounds, MCF7/A549/U2OS, 6h/24h/48h, 3 doses)
DBG (a subset of LKCP)

Add second batch

@shntnu - I remember you mentioning that we have another batch of Cell Painting data for this project. Can you point me to where this data lives?

We should work towards getting this data on here and processed. @sMyn42 is looking for a good dataset to test different batch effect correction tools (e.g. Harmony) to extend his Summer Research project. Having the second batch on here will help!

Using InChIKey as the common field for mapping

I had previously settled on using InChIKey14 as the common field for mapping across different repurposing hub versions (#13) partly due to the success in manually mapping three compounds across all the versions (#11 (comment)). Also, since there are only 45/1514 compounds (#11 (comment)) from the repurposing profiles dataset that do not map to any broad_ids in the most recent repurposing hub version (20200324), this approach may be the most effective.

But given #17, it may be worth repeating this pipeline with InChIKey as the common field for merging as InChIKey does uniquely identify stereoisomers. My current assumption is that there will many more than 45 compounds from the repurposing profiles dataset that do not map to most recent broad_ids but I believe it will be useful to know the actual number.

@gwaygenomics I can begin by creating a new PR by modifying the mapping code (2.map-broad_id.ipynb) and perhaps you could re-run the rest of the pipeline to generate a table similar to #11 (comment)?

Second batch of lincs data

@shntnu - I remember you mentioning that we have another batch of Cell Painting data for this project. Can you point me to where this data lives?

Plate causing numerical issues

In the cell health project, I noticed some strange behavior with a specific plate.

The plate is SQ00015221 coming from plate map C-7161-01-LM6-011. The offending features seem to be based on Correlation_RWC.

We include the plate in this repo, but I am adding a note here that we should revisit. I have a sneaking suspicion that the issue stems from the missing values and zero issue noted in cytomining/pycytominer#79 and described in cytomining/cytominergallery#62

Add Download + Usage Instructions

In the past, I've noticed that some users of the data struggle to access data in git lfs. We need to add downloading (and perhaps submodule setup) instructions to the README of this repo.

Create consensus spherized profiles

Given that we create a single CSV file for spherized in this notebook, it will easiest to compute consensus in the same notebook.

The output should be stored at lincs-cell-painting/spherized_profiles/consensus and be named

2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_dmso_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_whole_plate_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_dmso_consensus_modz.csv.gz
2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_whole_plate_consensus_modz.csv.gz

i.e. median and modz consensus for each of the two Batch 1 files in this directory.

And same for Batch 2 (2017_12_05_Batch2)

Add consensus perturbation signatures

My current plan is as follows:

1. Process each plate independently (✅ in #34)
2. Generate an across-plate consensus signature on broad_sample and dose.
3. The consensus signature will be based on median and MODZ
4. Output one single file for the full consensus signature
5. Output a separate file for a feature selected consensus signature (derived after calculating consensus)

Unifying documentation for this step from #4 (comment) and #34 (comment)

Add conda environment

Ideally, I should have added a conda environment (and setup instructions) in #7 - but alas, I didn't!

In #13 @niranjchandrasekaran introduced rdkit as a dependency (this has inspired the creation of this issue). It looks like it does exist in conda-forge so it should not be a problem.

@niranjchandrasekaran - what version did you use?

Make single cell .SQLite files publicly available

We'd ideally like to make all single cell SQLite files publicly available. As @shntnu noted to me in a separate email, the lab has a process in place to accomplish this, which is great!

To summarize the plan that @shntnu outlined:

Step 1: Make SQLite files available via RODA

However, this step has two blocking tasks:

@shntnu is blocking this PR awslabs/open-data-registry#1003 (comment)
- Which in turn is blocked by @shntnu / @ErinWeisbart on broadinstitute/cellpainting-gallery#1

Step 2: Unarchive SQLite files

The big lift here is unarchiving the data

Unarchiving notes:

Previously performed in #57 (comment) using instructions in #57 (comment).

Step 3: Copy SQLite files

All we need to do here is copy the unarchived SQLite to s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/${batch}.

InChIKey14s can contain duplicate MOA/Target Info

In #12 we used InChIKey14 to map broad_ids and in #11 we discussed why this is important.

While processing some data, I noticed that InChiKey14s do not map uniquely to MOA and Targets. I guess this is not surprising given that drugs are often used for different indications in various clinical phases, but it is worth documenting here! It is dangerous to use InChIKeys14s to map directly to MOA/Targets.

For example, InChIKey14 KTEIFNKAUNYNJU maps to two MOA/Targets. However, it looks like the full InChIKey does map uniquely. I didn't comprehensively explore this.

@niranjchandrasekaran - maybe I missed this, but was there a reason to use InChiKey14 instead of the full InChiKey?

Create visualizations of the similarities among profiles

It is very useful for researchers to be able to browse a heat map with a dendrogram attached (or some other representation - it's hard!) and look at relationships among the samples (drugs in this case).

It would therefore be great to create such a visualization in sharable format, for Cell Painting profiles and also for L1000 profiles, to compare them qualitatively.

Add License

@shntnu - we should add an open source license to this repo before adding profiles. Have we thought about which license we should apply here?

Add whitening normalization to this repo

The profiles deposited in #34 do not include whitening normalization. Previously, (see #4 (comment)) I elected to leave the whitened data to a future data upload because of this caveat:

Pycytominer currently does have a whiten implementation, and I applied it to the two 4a profiles in a test case. The test case did not go smoothly, so it is likely I will need to tinker with the pycytominer implementation a bit (hard to estimate how long the delay will be).

@shntnu also notes in #4 (comment)

Going forward, we will very likely produce at least two different Level 4a profiles

whole-well z-scored

DMSO z-scored

because depending on the layout, one might be better than the other.

We will then produce corresponding 4b (normalized feature selected) versions of the two 4a profiles.
We will also produce corresponding 4w (normalized and whitened) versions of the two 4a profiles.

Consensus profiles are missing MOA and target information

@gwaygenomics Were MOA and target columns left out intentionally?

lincs-cell-painting/consensus/scripts/nbconverted/build-consensus-signatures.py

Lines 152 to 158 in e6852b4

 replicate_cols = [ 

 "Metadata_Plate_Map_Name", 

 "Metadata_broad_sample", 

 "Metadata_pert_well", 

 "Metadata_mmoles_per_liter", 

 "Metadata_dose_recode", 

 ]

These columns are not present

Metadata_moa
Metadata_target

and other cool stuff you'd find in `*_augmented.csv" files

Broad ID to MOA Discrepancy

Hello @gwaygenomics
I noticed that there is a discrepancy on some mappings from Broad ID to MOA.
For instance in repurposing_info_long.tsv, the Broad ID: BRD-K66035042-001-10-1 maps to the MOA: mucolytic agent.
While in repurposing_info_external_moa_map_resolved.tsv, the same Broad ID maps to the MOA: diuretic. Which .tsv file is correct? Thank you!

Plate SQ00015049 is not processing

Something is wrong with plate SQ00015049. We successfully processed all other plates except this one. Below is the error:

Now processing... Plate: SQ00015049
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    cursor, statement, parameters, context
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlite3.DatabaseError: database disk image is malformed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "profile.py", line 56, in <module>
    ap = AggregateProfiles(sql_file=sql_file, strata=strata, operation=aggregate_method)
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pycytominer/aggregate.py", line 86, in __init__
    self.load_image()
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pycytominer/aggregate.py", line 118, in load_image
    self.image_df = pd.read_sql(sql=image_query, con=self.conn)
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pandas/io/sql.py", line 438, in read_sql
    chunksize=chunksize,
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pandas/io/sql.py", line 1218, in read_query
    result = self.execute(*args)
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pandas/io/sql.py", line 1087, in execute
    return self.connectable.execute(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 976, in execute
    return self._execute_text(object_, multiparams, params)
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1151, in _execute_text
    parameters,
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1288, in _execute_context
    e, statement, parameters, cursor, context
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1482, in _handle_dbapi_exception
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    cursor, statement, parameters, context
  File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) database disk image is malformed
[SQL: select TableNumber, ImageNumber, Image_Metadata_Plate, Image_Metadata_Well from image]
(Background on this error at: http://sqlalche.me/e/4xp6)

Processing data using DeepProfiler

Juan asked this:

I need to process the LINCS dataset to proceed with the plan we discussed for LUAD. I'm going to need access to the images, which is a lot of data! Here is my plan to make it efficient, and I would like to get your feedback and recommendations:

Get all the plates back from Glacier for a few days. If I remember correctly, the entire set is 21TB or so.
Use EC2 instances to compress the images using DeepProfiler. I did this in the past and we can get down to 800GB or so.
Save the compressed dataset in S3 to work with it during the next couple of months, then send it to Glacier. In fact, with that size we can probably keep it in the DGX and the GPU-cluster too.

Does this make sense? Do you have any recommendations for me before moving forward?

By the way, the compression should take roughly 4 hours per plate (pessimistic estimate), and can be run in parallel, with multiple plates per machine (one per CPU). So using 30 cheap instances (with 4 cores each) in spot mode should do the trick in one day, including the operator's time :)

Required Steps for Depositing Profiles

I am working towards processing all Drug Repurposing data and adding the results in this repository. The cell health project (https://github.com/broadinstitute/cell-health) now requires that the data are uniformly processed, documented, and made available here.

I will outline below the necessary steps required to get the data and processing pipelines uploaded.

Make sure there are only small floating point differences between cytominer-derived profiles and pycytominer-derived profiles.
- We are discussing this in #3
- I noted a potential discrepancy in cytominer-based documentation that needs addressing
Implement broad sample specific annotations
- This implementation is a work in progress here cytomining/pycytominer#73
- @shntnu I will likely need some guidance on this specific point
Rerun the "all" profiles pipeline described in broadinstitute/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad#3 (currently a private repo)
- This needs to be rerun with the updated robustize_mad normalization strategy, which will also require a decision on whole-plate or DMSO-specific normalization.
Rerun 4.apply module in cell-health
- Only after steps 1-3 are complete, can I rerun the 4.apply module
- I will explore whether or not to make the lincs-cell-painting profile repository a submodule of the cell-health project

Update profile workflow figure

In #73 @michaelbornholdt fixed the workflow diagram for processing the LINCS cell painting profiles.

We have since realized that this workflow diagram is incorrect.

@michaelbornholdt - are you able to adjust the diagram and file a pull request to correct the figure? We will need to do this before we submit the manuscript (which will be soon!)

Adding profiles to dvc

I am working on this now.

Asks

@shntnu

Can you provide me with access (and a pointer) to which AWS bucket to use for permanent dvc storage and access?
~~I also remember you wanting me to document adding DVC to this repo somewhere else, but I cannot find the link to where you want me to document steps. Can you also provide me this pointer again?~~ Thanks! (see cross references below)

Cross references

A couple cross-references to track history of DVC discussions:

Discussing DVC after frozen data in #62
Some discussion of migration from git lfs to DVC after adding frozen data version 1 in #63
Discussing adding DVC to the standard profiling recipe cytomining/profiling-template#13

Consensus data still performs really for my metrics

This is a reminder to tackle soon.
I don't know what I am doing wrong but when I download some non spherized consensus data and run enrichment over it. The results are awful. From past experience I know that this is due to super high correlate values from the normalization.

I thought we had fixed this... maybe not.

Get per-plate evaluation metrics

use the cytominer-eval library. An example https://github.com/jump-cellpainting/develop-computational-pipeline/issues/4#issuecomment-693006903 is pasted below:

After installing with:

pip install git+https://github.com/cytomining/cytominer-eval@56bd9e545d4ce5dea8c2d3897024a4eb241d06db

This now works:

import pandas as pd
from cytominer_eval import evaluate
from pycytominer.cyto_utils import infer_cp_features

file = "https://github.com/broadinstitute/lincs-cell-painting/raw/master/profiles/2016_04_01_a549_48hr_batch1/SQ00014813/SQ00014813_normalized_feature_select_dmso.csv.gz"
df = pd.read_csv(file)

features = infer_cp_features(df)
meta_features = infer_cp_features(df, metadata=True)

replicate_groups = ["Metadata_broad_sample", "Metadata_mg_per_ml"]

evaluate(
    profiles=df,
    features=features,
    meta_features=meta_features,
    replicate_groups=replicate_groups,
    operation="percent_strong",
    percent_strong_quantile=0.95
)

# Output: 0.32598039215686275

operation="grit" and operation="precision_recall" are also implemented.

(see https://github.com/cytomining/cytominer-eval/blob/master/cytominer_eval/evaluate.py for details)

Java dependency in environment.yml

I keep getting this popup when running

conda env create --force --file environment.yml

No clue what this is about

I'm on mac OS 10.15.7

Create connectivities matrix

Create 1571x1571 matrix of connectivities between compounds. Details tbd.

Spherized data UMAP figures

In #63 I add spherized profiles for batch 1 and batch 2 LINCS data. Here is a birds-eye view of what the profiles look like:

Summary

It is hard to determine from this view exactly how much the profile quality has improved. The DMSO profiles are still distributed widely in the UMAP space, but many compounds form distinct islands. It also doesn't look too much different (at least by a cursory glance) to be too much different from a non-spherized (level 4 profiles) LINCS dataset (see here).

Batch 2 profiles are potentially more interesting. We see distinct islands separated by cell type! This is expected, but also quite exciting. I do not have UMAP coordinates with level 4 profiles.

Batch 1

Batch 2

Finalizing authors

We need to finalize authors for the version 1 release of this repository.

The authors should be those involved in the LINCS Cell Painting data creation. This means individuals involved in:

funding acquisition
experiment planning
assay optimization
data collection
data processing
data curation

I know that this is a complex task, but its complexity matches its importance. Once we define authors we can ensure proper attribution to all papers that use this data. @shntnu - please help me with this :)

Once we define authors, I will also:

Update the zenodo archive https://zenodo.org/record/3928744#.YLTxBpNKiQ4 (I'll also update this to version 1)
Mark this task as complete in the LINCS complementarity paper

Update pycytominer version

we have made substantial progress in pycytominer since version 0.1 release. We need to update the environment.yml file and update the profiling pipeline to account for this change.

Evaluate profile normalization strategies

Currently, all profiles are normalized based on mad_robustize. We can use this repository to systematically evaluate if one strategy is better.

Rationale

As noted in #4 (comment)

The default in cytominer_scripts/normalize.R is robustize. I assume that I should continue using this method.

Yes. Rationale: mostly empirical – robustize resulted in higher (compared to standardize) replicate correlations of Level 4 across a few experiments we tested this in.

Add --no-name gzip flag to compression file output

We are get annoying file diff triggers when reprocessing the pipeline, even if nothing changes in the file. This is important to fix so that we are able to isolate actual changes that result from reprocessing output data.

As @shntnu notes in #48 the reason why the gzip files are triggering positive diffs, is because of an added timestamp.

The way to remove the timestamp from the file is to pass a --no-name (-n) flag to the gzip command. See http://linuxcommand.org/lc3_man_pages/gzip1.html

Fortunately, it looks like pandas-dev/pandas#33398 has added the ability to include args to pandas gzip compression. This improvement will be included in pandas version 1.1, which is scheduled for an Aug 1 release.

Three Options

pandas v1.1 option (assuming that it solves this problem!)
base python option (outlined #48 (comment))
bash option (outlined #48 (comment)).

For the pandas or python option, the solution should ideally live in pycytominer. I've created a stub for this at cytomining/pycytominer#83

How do updated moa/target annotations influence moa/target recall?

A potentially fun analysis would be to evaluate how the different moa/target annotations updated over time (in different CLUE drugs/samples versions) influence moa/target recall.

Essentially, we would setup an eval framework (I imagine there is a traditional moa/target recall eval) where we use the same input profiles and alter the moa/target information as they have been updated over time.

If we see improvement over time this tells us that annotations are improving and, potentially, that there is even more room to improve categorization.

What to do with pert_ids with conflicting information? [RESOLVED]

I am following up #7 with an additional notebook to create a simple, basic mapping file with only a handful of columns. This includes creating a pert_id column, which is a 13 character subset of the full 22 character broad_id column. The additional 9 characters contain batch info about the compound. More details about this procedure is here: #5 (comment)

In generating this data, I noticed that 16 perturbations (by pert_id) contain conflicting information (by pert_iname, moa, or target). I paste all of the conflicting info below:

	pert_id	pert_iname	moa	target
0	BRD-A03204438	allopregnanolone	GABA receptor positive allosteric modulator	GABRA1
1	BRD-A03204438	pregnanolone	GABA receptor positive allosteric modulator	nan
2	BRD-K05674516	sofosbuvir	RNA polymerase inhibitor	nan
3	BRD-K05674516	PSI-7976	HCV inhibitor	nan
4	BRD-K17498618	betaxolol	adrenergic receptor antagonist	ADRB1
5	BRD-K17498618	cisatracurium	acetylcholine receptor antagonist	CHRNA2
6	BRD-K20672254	pyrantel-tartrate	acetylcholine receptor agonist	CHRNA1
7	BRD-K20672254	pyrantel-pamoate	neuromuscular blocker	nan
8	BRD-K25650355	physostigmine-salicylate	acetylcholinesterase inhibitor	nan
9	BRD-K25650355	physostigmine	cholinesterase inhibitor	ACHE
10	BRD-K29713308	mebhydrolin	antihistamine	nan
11	BRD-K29713308	mebhydroline-1,5-naphtalenedisulfonate	nan	nan
12	BRD-K35952844	calcium-gluceptate	nan	nan
13	BRD-K35952844	sodium-glucoheptonate	nan	nan
14	BRD-K41260949	valproic-acid	HDAC inhibitor	ABAT
15	BRD-K41260949	divalproex-sodium	benzodiazepine receptor agonist	ALDH5A1
16	BRD-K66035042	mannitol-D	diuretic	nan
17	BRD-K66035042	sorbitol	mucolytic agent	nan
18	BRD-K71013094	neomycin-sulfate	bacterial 30S ribosomal subunit inhibitor	nan
19	BRD-K71013094	neomycin	bacterial 30S ribosomal subunit inhibitor	CXCR4
20	BRD-K79450420	INCB-024360	indoleamine 2,3-dioxygenase inhibitor	IDO1
21	BRD-K79450420	epacadostat	indoleamine 2,3-dioxygenase inhibitor	IDO1
22	BRD-K87202646	isoniazid	FABI inhibitor	CYP1A2
23	BRD-K87202646	pasiniazid	cyclooxygenase inhibitor	nan
24	BRD-K93632104	salicylic-acid	cyclooxygenase inhibitor	AKR1C1
25	BRD-K93632104	sodium-salicylate	prostanoid receptor antagonist	ASIC3
26	BRD-K97799481	theophylline	adenosine receptor antagonist	nan
27	BRD-K97799481	aminophylline	adenosine receptor antagonist	ADORA1
28	BRD-K97799481	oxtriphylline	adenosine receptor antagonist	ADORA1
29	BRD-M55114534	pyrvinium	androgen receptor antagonist	nan
30	BRD-M55114534	pyrvinium-pamoate	androgen receptor antagonist	AR

Should we reprocess all profiles before frozen data release?

I am leaning towards doing this. To work toward reprocessing, we need to accomplish the following:

release pycytominer version 0.1. It will be great to include a stable pycytominer version in the conda environment. We've upgraded pycytominer so much since the original reprocessing, and rerunning profiles will ease headaches (see below). (Decided not to pursue)
update MOA map for batch 2 data (see #61 (comment))
- Resolved in #62 (comment)

What headaches will an updated pycytominer resolve?

the updated pycytominer fixes no-name gzip flab (#50)
updated naming convention "blacklist" -> "blocklist"
potential to change epsilon in spherize()

Rerunning the pipeline will also enable us to migrate from git lfs to dvc.

Time estimate

Runtime will take non-negligible time, probably ~1 week, but it will increase confidence and organization of the data.
Migrating from git lfs to dvc will take 4 hours
Releasing pycytominer version 0.1 will take longer. I think we are close to an official version 0.1 release https://github.com/cytomining/pycytominer/milestone/1

Updated Strategy for Adding Profiles

At the profiling checkin today, we discussed our strategy to adding (and evaluating) profiles in this project.

This issue supersedes #4

I will complete #21 and @niranjchandrasekaran will review
We will add median profiles to this repository as a first step
I will confirm floating point differences between pycytominer and cytominer processed profiles (note the limitations discussed in #3 (comment))
We will add mean profiles to this repo next
We will perform an evaluation of sorts to compare mean vs. median profiles

Comparing Cytominer and Pycytominer Profiles

In this issue, I will discuss results of step 3 outlined in #22 (comment)

Note that this is copied and pasted from a notebook that will be added in a future pull request. Details in this notebook will guide our discussion of the results

Comparing Pycytominer and Cytominer Processing

We have previously processed all of the Drug Repurposing Hub Cell Painting Data using cytominer. Cytominer is an R based image-based profiling tool. In this repo, we reprocess the data with pycytominer. As the name connotes, pycytominer is a python based image-based profiling tool.

We include all processing scripts and present the pycytominer profiles in this open source repository. The repository represents a unified bioinformatics pipeline applied to all Cell Painting Drug Repurposing Profiles. In this notebook, we compare the resulting output data between the processing pipelines for the two tools: Cytominer and pycytominer.

We output several metrics comparing the two approaches

Metrics

In all cases, we calculate the element-wise absolute value difference between pycytominer and cytominer profiles.

Mean, median, and sum of element-wise differencs
Per feature mean, median, and sum of element-wise differences
Feature selection procedure differences per feature (level 4b only)

In addition, we confirm alignment of the following metadata columns:

Well
Broad Sample Name
Plate

Other metadata columns are not expected to be aligned. For example, we have updated MOA and Target information in the pycytominer version.

Data Levels

Image-based profiling results in the following output data levels. We do not compare all data levels in this notebook.

Data	Level	Comparison
Images	Level 1	NA
SQLite File (single cell profiles )	Level 2	NA
Aggregated Profiles with Well Information (metadata)	Level 3	Yes
Normalized Aggregated Profiles with Metadata	Level 4a	Yes
Normalized and Feature Selected Aggregated Profiles with Metadata	Level 4b	Yes
Perturbation Profiles created Summarizing Replicates	Level 5	No

Dropping outlier features

MB said:

I have found a “error” in the Lincs dataset and I was wondering if you guys knew of this and if there needs to be some fixing of the pycyto pipeline? I am analyzing the Level 5 consensus data from here. When running the cyto eval functions on this data, I noticed some very high correlations. They come from this one feature (Nuclei_AreaShape_MedianRadius) that is 10^13 times larger than the others. The image shows a scatter plot of two samples which have a 1.000 similarity but are different compounds.

This is almost definitely because of mad of these features being zero in DMSO (at least for the plates that those compounds come from.

https://github.com/cytomining/pycytominer/blob/a04397d9cd7e25828d2f24f986a3386a79e6193d/pycytominer/operations/transform.py#L142

Add drop_outliers to https://github.com/broadinstitute/lincs-cell-painting/blob/master/profiles/profile_cells.py
Reprocess

Perturbation Metadata File - Perturbation ID and MOA

We have additional information for each compound assayed in the Drug Repurposing Hub Cell Painting Dataset.

There are at least four files on AWS, that could all work as a reference to describe compound metadata.

File Name	Columns
pert_info.txt	pert_id, pert_iname, pert_type, moa
pert_iname_moa.txt	pert_iname, moa, source, url, support, num_sources
pert_id_to_iname.txt	pert_id, pert_iname, pert_type
pert_iname_moa_aggregated.txt	pert_iname, moa, pert_iname_modified

Below I summarize each of the files

pert_info.txt

pert_iname_moa.txt

pert_id_to_iname.txt

pert_iname_moa_aggregated.txt

Confirmed pert_info.txt subset

Delete column names after spherizing

@FloHu noticed that the original names of the features are carried through after spherizing.
We might want to not do that since it can confuse people actually using the consensus data later on then. I think they should just be named feature_1 .. x

Improve spherize documentation

The spherize notebook documentation should be improved.

@shntnu notes in #60

The notebook says

Here, we load in all normalized profiles (level 4a) data across all plates and apply a spherize transform using the DMSO profiles as the background distribution.

but it should say

Here, we load in all normalized profiles (level 4a) data across all plates, apply the standard set of feature selection operations, and then apply a spherize transform using the DMSO profiles as the background distribution.

This is a very easy fix, and a good beginner issue!

Download profiles via dvc command line

Hi @gwaygenomics and team,

We've worked on some cell profiling tools and would be interested to try them on this dataset. Unfortunately I am having trouble to download the profile data. It would be great if you would have pointers to help with that?

At the moment I tried the dvc get command line tools (I am new to it but quite excited by the concept 😊) but I probably do something wrong (I tried this on windows10 from a powershell terminal)

Thanks for your help,

Kind regards,
Benoit

Old/Updated Broad IDs

I have encountered perhaps a significant hurdle in adding Cell Painting Repurposing Hub profiles to this repo.

There are broad ids (pert_id) in the profile data that are absent from the updated moa information.

For example, in one plate (SQ00014814) the following pert_ids are present (with annotations) in the profile data, but are absent in the repurposing moa files in this repo:

['BRD-A69275535',
'BRD-A69636825',
'BRD-A69815203',
'BRD-A72309220',
'BRD-A72390365',
'BRD-A74980173',
'BRD-A82156122',
'BRD-K50691590',
'BRD-K68164687',
'BRD-K71480163',
'BRD-K81258678',
'BRD-K81957469']

Given that these pert_ids have annotations in cytominer-derived profiles, this indicates that the pert_ids have changed somewhere.

Before I pursue this issue, I was wondering if there are any known solutions or datasets that map old to updated pert_ids. cc @shntnu @niranjchandrasekaran

Perhaps also @jrsacher has insight here. Josh, I scanned the CLUE and DepMap resources and was not able to find a map. I also checked the column deprecated_broad_id and I was able to recover 3 of the profiles (['BRD-K50691590', 'BRD-K50691590', 'BRD-K81258678']).

Any insights or pointers here would be greatly appreciated!

Adding Level 3-5 Cell Painting Data Questions

I am in the process of adding level 3-5 profiles to this repo (using git lfs). I will use this issue to document various questions I have about the process.

Confirm what the levels actually are! 😆
- Described here: #1 (comment)
I assume I should add cytominer profiles here? We should consider the pycytominer-based profiles less mature (and therefore less stable)? The cytominer profiles are the ones that were originally computed.
- Located here: /home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/2016_04_01_a549_48hr_batch1
- Were they processed using the standard profiling workflow?
  - https://github.com/cytomining/profiling-handbook

Define folder structure

@gwaygenomics it might be useful to define the folder structure of the data

Here's what I am thinking cytomining/profiling-handbook#54 (comment) but feel free to propose alternatives.

This was our thinking when we defined the folder structure (transcribed by @gwaygenomics)

Upload Image Files to IDR

We will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.

We will use this issue to outline the required steps.

From IDR:

study file describing the overall study and the screens that were performed e.g. cell health
library file(s) describing the plate layout of each screen e.g. cell health
processed data file(s) containing summary results and/or a ‘hit' list for each screen

All files should be in tab-delimited text format.
Templates are provided but can be modified to suit your experiment.
Add or remove columns from the templates as necessary.

@gwaygenomics Did you have a processed data file for cell health?

Two discrepancies in MOA `pert_iname` between samples and drugs files

I am working on step 4 of #5 (comment) and came across two discrepancies between the samples and drugs files. They are likely very minor, and can easily be resolved, but I am noting them here for completeness.

When comparing pert_iname between the two files (drugs and samples), every single pert_iname entry in the drugs file is found in the samples file. However, two pert_iname entries are found in the samples file and not in the drugs file.

The two pert_iname entries are:

YM-298198-desmethyl
golgicide-A

Reconciliation of `YM-298198-desmethyl`

The compound YM-298198-desmethyl is missing from the drugs file, but the entry YM-298198.
YM-298198-desmethyl is a derivative of YM-298198, and therefore has a different structure.

Samples File

Drugs File

Conclusion

I think it is safe to duplicate the YM-298198 drug column and make one pert_iname entry YM-298198-desmethyl.

Reconciliation of `golgicide-A`

This appears to be, simply, an issue of capitalization. See below:

Samples File

Note the exact same smiles string, but different broad_ids (b/c of different purities and vendors).

Drugs File

Conclusion

I will rename the first entry in the samples file for BRD-A57886255-001-02-9 to have a lower-case golgicide-A.

Summary

Two pert_iname entries had conflicts. The proposed solutions will remedy. Note that I will include the jupyter notebook that performs this adjustment in the repo.

	consensus_df.to_csv(
	consensus_file, sep=",", compression="gzip", float_format="%5g", index=False
	)

	replicate_cols = [
	"Metadata_Plate_Map_Name",
	"Metadata_broad_sample",
	"Metadata_pert_well",
	"Metadata_mmoles_per_liter",
	"Metadata_dose_recode",
	]

broadinstitute / lincs-cell-painting Goto Github PK

lincs-cell-painting's Issues

BBBC022 channel-wise percent strong comparison

BBBBC022 all channels percent strong

Step 1: Make SQLite files available via RODA

Step 2: Unarchive SQLite files

Unarchiving notes:

Step 3: Copy SQLite files

Asks

Cross references

Summary

Batch 1

Batch 2

Rationale

Three Options

Time estimate

Comparing Pycytominer and Cytominer Processing

Metrics

Data Levels

pert_info.txt

pert_iname_moa.txt

pert_id_to_iname.txt

pert_iname_moa_aggregated.txt

Confirmed pert_info.txt subset

Reconciliation of YM-298198-desmethyl

Samples File

Drugs File

Conclusion

Reconciliation of golgicide-A

Samples File

Drugs File

Conclusion

Summary

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Reconciliation of `YM-298198-desmethyl`

Reconciliation of `golgicide-A`