broadinstitute / lincs-cell-painting Goto Github PK
View Code? Open in Web Editor NEWProcessed Cell Painting Data for the LINCS Drug Repurposing Project
License: BSD 3-Clause "New" or "Revised" License
Processed Cell Painting Data for the LINCS Drug Repurposing Project
License: BSD 3-Clause "New" or "Revised" License
Talking to Mattias I noticed that there are some holes in terms of explaining what happens to get to the consensus data.
spherized_profiles
in the profiler folder, makes more sense I thinkSorry for all these suggestions at once. I don't what is on your plate (haha) right now @gwaygenomics so I'm unsure how to move forward with these suggestions.
In #34, I added preliminary cell count files. However, they included an extra column and were tab separated (see #34 (comment))
in 2141da9 I removed the cell count files in order to process them more consistently. The files are currently being generated and this issue will be closed once they are added back.
While working on CPJUMP-Stain2, @shntnu and I observed that the proportion of compounds with a strong signal (percent strong metric) was similar if the analysis was performed with individual channels or across all channels. We wanted to find out if this behavior was seen in other datasets as well.
I chose one of the platemaps (H-BIOA-002-1) from BBBC022 and computed the channel-wise correlation values. Based on the results below, it looks like BBBC022 also behaves similarly.
Performing this experiment with a larger dataset, such as LINCS, may help answer whether the above plots are technical artifacts or if this behavior is consistent across datasets.
Hello! Where can I find which markers/stains each of the 5 image channels correspond to?
See #46
Save file as GCT after this line
Use https://github.com/cytomining/pycytominer/blob/master/pycytominer/write_gct.py
We concluded that it is ok to have Level 3-5 data on GitHub, although we will cite clue.io as the primary source and references for this data, similar to this README by @gwaygenomics
The following data will eventually be made available on clue.io/morphology
Level 1-5 + connectivity file of
@shntnu - I remember you mentioning that we have another batch of Cell Painting data for this project. Can you point me to where this data lives?
We should work towards getting this data on here and processed. @sMyn42 is looking for a good dataset to test different batch effect correction tools (e.g. Harmony) to extend his Summer Research project. Having the second batch on here will help!
I had previously settled on using InChIKey14
as the common field for mapping across different repurposing hub versions (#13) partly due to the success in manually mapping three compounds across all the versions (#11 (comment)). Also, since there are only 45/1514 compounds (#11 (comment)) from the repurposing profiles dataset that do not map to any broad_id
s in the most recent repurposing hub version (20200324
), this approach may be the most effective.
But given #17, it may be worth repeating this pipeline with InChIKey
as the common field for merging as InChIKey
does uniquely identify stereoisomers. My current assumption is that there will many more than 45 compounds from the repurposing profiles dataset that do not map to most recent broad_id
s but I believe it will be useful to know the actual number.
@gwaygenomics I can begin by creating a new PR by modifying the mapping code (2.map-broad_id.ipynb
) and perhaps you could re-run the rest of the pipeline to generate a table similar to #11 (comment)?
@shntnu - I remember you mentioning that we have another batch of Cell Painting data for this project. Can you point me to where this data lives?
We should work towards getting this data on here and processed. @sMyn42 is looking for a good dataset to test different batch effect correction tools (e.g. Harmony) to extend his Summer Research project. Having the second batch on here will help!
In the cell health project, I noticed some strange behavior with a specific plate.
The plate is SQ00015221
coming from plate map C-7161-01-LM6-011
. The offending features seem to be based on Correlation_RWC
.
We include the plate in this repo, but I am adding a note here that we should revisit. I have a sneaking suspicion that the issue stems from the missing values and zero issue noted in cytomining/pycytominer#79 and described in cytomining/cytominergallery#62
In the past, I've noticed that some users of the data struggle to access data in git lfs. We need to add downloading (and perhaps submodule setup) instructions to the README of this repo.
Given that we create a single CSV file for spherized in this notebook, it will easiest to compute consensus in the same notebook.
The output should be stored at lincs-cell-painting/spherized_profiles/consensus
and be named
2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_dmso_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_whole_plate_consensus_median.csv.gz
2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_dmso_consensus_modz.csv.gz
2016_04_01_a549_48hr_batch1_dmso_spherized_profiles_with_input_normalized_by_whole_plate_consensus_modz.csv.gz
i.e. median
and modz
consensus for each of the two Batch 1 files in this directory.
And same for Batch 2 (2017_12_05_Batch2
)
My current plan is as follows:
median
and MODZ
Unifying documentation for this step from #4 (comment) and #34 (comment)
Ideally, I should have added a conda environment (and setup instructions) in #7 - but alas, I didn't!
In #13 @niranjchandrasekaran introduced rdkit
as a dependency (this has inspired the creation of this issue). It looks like it does exist in conda-forge so it should not be a problem.
@niranjchandrasekaran - what version did you use?
We'd ideally like to make all single cell SQLite files publicly available. As @shntnu noted to me in a separate email, the lab has a process in place to accomplish this, which is great!
To summarize the plan that @shntnu outlined:
However, this step has two blocking tasks:
The big lift here is unarchiving the data
All we need to do here is copy the unarchived SQLite to s3://cellpainting-gallery/cpg0004-lincs/broad/workspace/backend/${batch}.
In #12 we used InChIKey14
to map broad_ids and in #11 we discussed why this is important.
While processing some data, I noticed that InChiKey14s do not map uniquely to MOA and Targets. I guess this is not surprising given that drugs are often used for different indications in various clinical phases, but it is worth documenting here! It is dangerous to use InChIKeys14s to map directly to MOA/Targets.
For example, InChIKey14 KTEIFNKAUNYNJU
maps to two MOA/Targets. However, it looks like the full InChIKey does map uniquely. I didn't comprehensively explore this.
@niranjchandrasekaran - maybe I missed this, but was there a reason to use InChiKey14 instead of the full InChiKey?
It is very useful for researchers to be able to browse a heat map with a dendrogram attached (or some other representation - it's hard!) and look at relationships among the samples (drugs in this case).
It would therefore be great to create such a visualization in sharable format, for Cell Painting profiles and also for L1000 profiles, to compare them qualitatively.
@shntnu - we should add an open source license to this repo before adding profiles. Have we thought about which license we should apply here?
The profiles deposited in #34 do not include whitening normalization. Previously, (see #4 (comment)) I elected to leave the whitened data to a future data upload because of this caveat:
Pycytominer currently does have a
whiten
implementation, and I applied it to the two 4a profiles in a test case. The test case did not go smoothly, so it is likely I will need to tinker with the pycytominer implementation a bit (hard to estimate how long the delay will be).
@shntnu also notes in #4 (comment)
Going forward, we will very likely produce at least two different Level 4a profiles
- whole-well z-scored
- DMSO z-scored
because depending on the layout, one might be better than the other.
We will then produce corresponding 4b (normalized feature selected) versions of the two 4a profiles.
We will also produce corresponding 4w (normalized and whitened) versions of the two 4a profiles.
@gwaygenomics Were MOA and target columns left out intentionally?
These columns are not present
Metadata_moa
Metadata_target
and other cool stuff you'd find in `*_augmented.csv" files
Hello @gwaygenomics
I noticed that there is a discrepancy on some mappings from Broad ID to MOA.
For instance in repurposing_info_long.tsv, the Broad ID: BRD-K66035042-001-10-1 maps to the MOA: mucolytic agent.
While in repurposing_info_external_moa_map_resolved.tsv, the same Broad ID maps to the MOA: diuretic. Which .tsv file is correct? Thank you!
Something is wrong with plate SQ00015049. We successfully processed all other plates except this one. Below is the error:
Now processing... Plate: SQ00015049
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
cursor, statement, parameters, context
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
cursor.execute(statement, parameters)
sqlite3.DatabaseError: database disk image is malformed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "profile.py", line 56, in <module>
ap = AggregateProfiles(sql_file=sql_file, strata=strata, operation=aggregate_method)
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pycytominer/aggregate.py", line 86, in __init__
self.load_image()
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pycytominer/aggregate.py", line 118, in load_image
self.image_df = pd.read_sql(sql=image_query, con=self.conn)
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pandas/io/sql.py", line 438, in read_sql
chunksize=chunksize,
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pandas/io/sql.py", line 1218, in read_query
result = self.execute(*args)
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/pandas/io/sql.py", line 1087, in execute
return self.connectable.execute(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 976, in execute
return self._execute_text(object_, multiparams, params)
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1151, in _execute_text
parameters,
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1288, in _execute_context
e, statement, parameters, cursor, context
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1482, in _handle_dbapi_exception
sqlalchemy_exception, with_traceback=exc_info[2], from_=e
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
raise exception
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
cursor, statement, parameters, context
File "/home/ubuntu/miniconda3/envs/lincs/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (sqlite3.DatabaseError) database disk image is malformed
[SQL: select TableNumber, ImageNumber, Image_Metadata_Plate, Image_Metadata_Well from image]
(Background on this error at: http://sqlalche.me/e/4xp6)
Juan asked this:
I need to process the LINCS dataset to proceed with the plan we discussed for LUAD. I'm going to need access to the images, which is a lot of data! Here is my plan to make it efficient, and I would like to get your feedback and recommendations:
Does this make sense? Do you have any recommendations for me before moving forward?
By the way, the compression should take roughly 4 hours per plate (pessimistic estimate), and can be run in parallel, with multiple plates per machine (one per CPU). So using 30 cheap instances (with 4 cores each) in spot mode should do the trick in one day, including the operator's time :)
I am working towards processing all Drug Repurposing data and adding the results in this repository. The cell health project (https://github.com/broadinstitute/cell-health) now requires that the data are uniformly processed, documented, and made available here.
I will outline below the necessary steps required to get the data and processing pipelines uploaded.
robustize_mad
normalization strategy, which will also require a decision on whole-plate or DMSO-specific normalization.4.apply
module in cell-health
4.apply
modulelincs-cell-painting
profile repository a submodule of the cell-health projectIn #73 @michaelbornholdt fixed the workflow diagram for processing the LINCS cell painting profiles.
We have since realized that this workflow diagram is incorrect.
@michaelbornholdt - are you able to adjust the diagram and file a pull request to correct the figure? We will need to do this before we submit the manuscript (which will be soon!)
I am working on this now.
A couple cross-references to track history of DVC discussions:
This is a reminder to tackle soon.
I don't know what I am doing wrong but when I download some non spherized consensus data and run enrichment over it. The results are awful. From past experience I know that this is due to super high correlate values from the normalization.
I thought we had fixed this... maybe not.
use the cytominer-eval
library. An example https://github.com/jump-cellpainting/develop-computational-pipeline/issues/4#issuecomment-693006903 is pasted below:
After installing with:
pip install git+https://github.com/cytomining/cytominer-eval@56bd9e545d4ce5dea8c2d3897024a4eb241d06db
This now works:
import pandas as pd
from cytominer_eval import evaluate
from pycytominer.cyto_utils import infer_cp_features
file = "https://github.com/broadinstitute/lincs-cell-painting/raw/master/profiles/2016_04_01_a549_48hr_batch1/SQ00014813/SQ00014813_normalized_feature_select_dmso.csv.gz"
df = pd.read_csv(file)
features = infer_cp_features(df)
meta_features = infer_cp_features(df, metadata=True)
replicate_groups = ["Metadata_broad_sample", "Metadata_mg_per_ml"]
evaluate(
profiles=df,
features=features,
meta_features=meta_features,
replicate_groups=replicate_groups,
operation="percent_strong",
percent_strong_quantile=0.95
)
# Output: 0.32598039215686275
operation="grit"
and operation="precision_recall"
are also implemented.
(see https://github.com/cytomining/cytominer-eval/blob/master/cytominer_eval/evaluate.py for details)
Create 1571x1571 matrix of connectivities between compounds. Details tbd.
In #63 I add spherized profiles for batch 1 and batch 2 LINCS data. Here is a birds-eye view of what the profiles look like:
It is hard to determine from this view exactly how much the profile quality has improved. The DMSO profiles are still distributed widely in the UMAP space, but many compounds form distinct islands. It also doesn't look too much different (at least by a cursory glance) to be too much different from a non-spherized (level 4 profiles) LINCS dataset (see here).
Batch 2 profiles are potentially more interesting. We see distinct islands separated by cell type! This is expected, but also quite exciting. I do not have UMAP coordinates with level 4 profiles.
We need to finalize authors for the version 1 release of this repository.
The authors should be those involved in the LINCS Cell Painting data creation. This means individuals involved in:
I know that this is a complex task, but its complexity matches its importance. Once we define authors we can ensure proper attribution to all papers that use this data. @shntnu - please help me with this :)
Once we define authors, I will also:
we have made substantial progress in pycytominer since version 0.1 release. We need to update the environment.yml
file and update the profiling pipeline to account for this change.
Currently, all profiles are normalized based on mad_robustize
. We can use this repository to systematically evaluate if one strategy is better.
As noted in #4 (comment)
The default in cytominer_scripts/normalize.R is robustize. I assume that I should continue using this method.
Yes. Rationale: mostly empirical – robustize resulted in higher (compared to standardize) replicate correlations of Level 4 across a few experiments we tested this in.
We are get annoying file diff triggers when reprocessing the pipeline, even if nothing changes in the file. This is important to fix so that we are able to isolate actual changes that result from reprocessing output data.
As @shntnu notes in #48 the reason why the gzip files are triggering positive diffs, is because of an added timestamp.
The way to remove the timestamp from the file is to pass a --no-name
(-n
) flag to the gzip command. See http://linuxcommand.org/lc3_man_pages/gzip1.html
Fortunately, it looks like pandas-dev/pandas#33398 has added the ability to include args to pandas gzip
compression. This improvement will be included in pandas version 1.1, which is scheduled for an Aug 1 release.
For the pandas or python option, the solution should ideally live in pycytominer
. I've created a stub for this at cytomining/pycytominer#83
A potentially fun analysis would be to evaluate how the different moa/target annotations updated over time (in different CLUE drugs/samples versions) influence moa/target recall.
Essentially, we would setup an eval framework (I imagine there is a traditional moa/target recall eval) where we use the same input profiles and alter the moa/target information as they have been updated over time.
If we see improvement over time this tells us that annotations are improving and, potentially, that there is even more room to improve categorization.
I am following up #7 with an additional notebook to create a simple, basic mapping file with only a handful of columns. This includes creating a pert_id
column, which is a 13 character subset of the full 22 character broad_id
column. The additional 9 characters contain batch info about the compound. More details about this procedure is here: #5 (comment)
In generating this data, I noticed that 16 perturbations (by pert_id
) contain conflicting information (by pert_iname
, moa
, or target
). I paste all of the conflicting info below:
pert_id | pert_iname | moa | target | |
---|---|---|---|---|
0 | BRD-A03204438 | allopregnanolone | GABA receptor positive allosteric modulator | GABRA1 |
1 | BRD-A03204438 | pregnanolone | GABA receptor positive allosteric modulator | nan |
2 | BRD-K05674516 | sofosbuvir | RNA polymerase inhibitor | nan |
3 | BRD-K05674516 | PSI-7976 | HCV inhibitor | nan |
4 | BRD-K17498618 | betaxolol | adrenergic receptor antagonist | ADRB1 |
5 | BRD-K17498618 | cisatracurium | acetylcholine receptor antagonist | CHRNA2 |
6 | BRD-K20672254 | pyrantel-tartrate | acetylcholine receptor agonist | CHRNA1 |
7 | BRD-K20672254 | pyrantel-pamoate | neuromuscular blocker | nan |
8 | BRD-K25650355 | physostigmine-salicylate | acetylcholinesterase inhibitor | nan |
9 | BRD-K25650355 | physostigmine | cholinesterase inhibitor | ACHE |
10 | BRD-K29713308 | mebhydrolin | antihistamine | nan |
11 | BRD-K29713308 | mebhydroline-1,5-naphtalenedisulfonate | nan | nan |
12 | BRD-K35952844 | calcium-gluceptate | nan | nan |
13 | BRD-K35952844 | sodium-glucoheptonate | nan | nan |
14 | BRD-K41260949 | valproic-acid | HDAC inhibitor | ABAT |
15 | BRD-K41260949 | divalproex-sodium | benzodiazepine receptor agonist | ALDH5A1 |
16 | BRD-K66035042 | mannitol-D | diuretic | nan |
17 | BRD-K66035042 | sorbitol | mucolytic agent | nan |
18 | BRD-K71013094 | neomycin-sulfate | bacterial 30S ribosomal subunit inhibitor | nan |
19 | BRD-K71013094 | neomycin | bacterial 30S ribosomal subunit inhibitor | CXCR4 |
20 | BRD-K79450420 | INCB-024360 | indoleamine 2,3-dioxygenase inhibitor | IDO1 |
21 | BRD-K79450420 | epacadostat | indoleamine 2,3-dioxygenase inhibitor | IDO1 |
22 | BRD-K87202646 | isoniazid | FABI inhibitor | CYP1A2 |
23 | BRD-K87202646 | pasiniazid | cyclooxygenase inhibitor | nan |
24 | BRD-K93632104 | salicylic-acid | cyclooxygenase inhibitor | AKR1C1 |
25 | BRD-K93632104 | sodium-salicylate | prostanoid receptor antagonist | ASIC3 |
26 | BRD-K97799481 | theophylline | adenosine receptor antagonist | nan |
27 | BRD-K97799481 | aminophylline | adenosine receptor antagonist | ADORA1 |
28 | BRD-K97799481 | oxtriphylline | adenosine receptor antagonist | ADORA1 |
29 | BRD-M55114534 | pyrvinium | androgen receptor antagonist | nan |
30 | BRD-M55114534 | pyrvinium-pamoate | androgen receptor antagonist | AR |
I am leaning towards doing this. To work toward reprocessing, we need to accomplish the following:
What headaches will an updated pycytominer resolve?
epsilon
in spherize()Rerunning the pipeline will also enable us to migrate from git lfs to dvc.
At the profiling checkin today, we discussed our strategy to adding (and evaluating) profiles in this project.
This issue supersedes #4
In this issue, I will discuss results of step 3 outlined in #22 (comment)
Note that this is copied and pasted from a notebook that will be added in a future pull request. Details in this notebook will guide our discussion of the results
We have previously processed all of the Drug Repurposing Hub Cell Painting Data using cytominer. Cytominer is an R based image-based profiling tool. In this repo, we reprocess the data with pycytominer. As the name connotes, pycytominer is a python based image-based profiling tool.
We include all processing scripts and present the pycytominer profiles in this open source repository. The repository represents a unified bioinformatics pipeline applied to all Cell Painting Drug Repurposing Profiles. In this notebook, we compare the resulting output data between the processing pipelines for the two tools: Cytominer and pycytominer.
We output several metrics comparing the two approaches
In all cases, we calculate the element-wise absolute value difference between pycytominer and cytominer profiles.
In addition, we confirm alignment of the following metadata columns:
Other metadata columns are not expected to be aligned. For example, we have updated MOA and Target information in the pycytominer version.
Image-based profiling results in the following output data levels. We do not compare all data levels in this notebook.
Data | Level | Comparison |
---|---|---|
Images | Level 1 | NA |
SQLite File (single cell profiles ) | Level 2 | NA |
Aggregated Profiles with Well Information (metadata) | Level 3 | Yes |
Normalized Aggregated Profiles with Metadata | Level 4a | Yes |
Normalized and Feature Selected Aggregated Profiles with Metadata | Level 4b | Yes |
Perturbation Profiles created Summarizing Replicates | Level 5 | No |
MB said:
I have found a “error” in the Lincs dataset and I was wondering if you guys knew of this and if there needs to be some fixing of the pycyto pipeline? I am analyzing the Level 5 consensus data from here. When running the cyto eval functions on this data, I noticed some very high correlations. They come from this one feature (Nuclei_AreaShape_MedianRadius) that is 10^13 times larger than the others. The image shows a scatter plot of two samples which have a 1.000 similarity but are different compounds.
This is almost definitely because of mad of these features being zero in DMSO (at least for the plates that those compounds come from.
drop_outliers
to https://github.com/broadinstitute/lincs-cell-painting/blob/master/profiles/profile_cells.pyWe have additional information for each compound assayed in the Drug Repurposing Hub Cell Painting Dataset.
There are at least four files on AWS, that could all work as a reference to describe compound metadata.
File Name | Columns |
---|---|
pert_info.txt | pert_id, pert_iname, pert_type, moa |
pert_iname_moa.txt | pert_iname, moa, source, url, support, num_sources |
pert_id_to_iname.txt | pert_id, pert_iname, pert_type |
pert_iname_moa_aggregated.txt | pert_iname, moa, pert_iname_modified |
Below I summarize each of the files
@FloHu noticed that the original names of the features are carried through after spherizing.
We might want to not do that since it can confuse people actually using the consensus data later on then. I think they should just be named feature_1 .. x
The spherize notebook documentation should be improved.
The notebook says
Here, we load in all normalized profiles (level 4a) data across all plates and apply a spherize transform using the DMSO profiles as the background distribution.
but it should say
Here, we load in all normalized profiles (level 4a) data across all plates, apply the standard set of feature selection operations, and then apply a spherize transform using the DMSO profiles as the background distribution.
This is a very easy fix, and a good beginner issue!
Hi @gwaygenomics and team,
We've worked on some cell profiling tools and would be interested to try them on this dataset. Unfortunately I am having trouble to download the profile data. It would be great if you would have pointers to help with that?
At the moment I tried the dvc get command line tools (I am new to it but quite excited by the concept 😊) but I probably do something wrong (I tried this on windows10 from a powershell terminal)
Thanks for your help,
Kind regards,
Benoit
I have encountered perhaps a significant hurdle in adding Cell Painting Repurposing Hub profiles to this repo.
There are broad ids (pert_id
) in the profile data that are absent from the updated moa information.
For example, in one plate (SQ00014814
) the following pert_ids
are present (with annotations) in the profile data, but are absent in the repurposing moa files in this repo:
['BRD-A69275535',
'BRD-A69636825',
'BRD-A69815203',
'BRD-A72309220',
'BRD-A72390365',
'BRD-A74980173',
'BRD-A82156122',
'BRD-K50691590',
'BRD-K68164687',
'BRD-K71480163',
'BRD-K81258678',
'BRD-K81957469']
Given that these pert_ids
have annotations in cytominer-derived profiles, this indicates that the pert_ids
have changed somewhere.
Before I pursue this issue, I was wondering if there are any known solutions or datasets that map old to updated pert_ids
. cc @shntnu @niranjchandrasekaran
Perhaps also @jrsacher has insight here. Josh, I scanned the CLUE and DepMap resources and was not able to find a map. I also checked the column deprecated_broad_id
and I was able to recover 3 of the profiles (['BRD-K50691590', 'BRD-K50691590', 'BRD-K81258678']).
Any insights or pointers here would be greatly appreciated!
I am in the process of adding level 3-5 profiles to this repo (using git lfs). I will use this issue to document various questions I have about the process.
cytominer
profiles here? We should consider the pycytominer-based profiles less mature (and therefore less stable)? The cytominer profiles are the ones that were originally computed.
/home/ubuntu/bucket/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/backend/2016_04_01_a549_48hr_batch1
@gwaygenomics it might be useful to define the folder structure of the data
Here's what I am thinking cytomining/profiling-handbook#54 (comment) but feel free to propose alternatives.
This was our thinking when we defined the folder structure (transcribed by @gwaygenomics)
We will upload image files to the Image Data Resource and add URL and metadata information to the Broad Bioimage Benchmark Collection.
We will use this issue to outline the required steps.
From IDR:
All files should be in tab-delimited text format.
Templates are provided but can be modified to suit your experiment.
Add or remove columns from the templates as necessary.
@gwaygenomics Did you have a processed data file for cell health?
I am working on step 4 of #5 (comment) and came across two discrepancies between the samples
and drugs
files. They are likely very minor, and can easily be resolved, but I am noting them here for completeness.
When comparing pert_iname
between the two files (drugs and samples), every single pert_iname
entry in the drugs file is found in the samples file. However, two pert_iname
entries are found in the samples file and not in the drugs file.
The two pert_iname
entries are:
YM-298198-desmethyl
golgicide-A
YM-298198-desmethyl
The compound YM-298198-desmethyl
is missing from the drugs
file, but the entry YM-298198
.
YM-298198-desmethyl
is a derivative of YM-298198
, and therefore has a different structure.
I think it is safe to duplicate the YM-298198
drug column and make one pert_iname
entry YM-298198-desmethyl
.
golgicide-A
This appears to be, simply, an issue of capitalization. See below:
Note the exact same smiles
string, but different broad_ids
(b/c of different purities and vendors).
I will rename the first entry in the samples file for BRD-A57886255-001-02-9
to have a lower-case golgicide-A
.
Two pert_iname
entries had conflicts. The proposed solutions will remedy. Note that I will include the jupyter notebook that performs this adjustment in the repo.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.