GithubHelp home page GithubHelp logo

bunnech / cellot Goto Github PK

View Code? Open in Web Editor NEW
102.0 102.0 10.0 582 KB

Learning Single-Cell Perturbation Responses using Neural Optimal Transport

License: BSD 3-Clause "New" or "Revised" License

Python 96.34% Shell 3.66%

cellot's People

Contributors

bunnech avatar stefangstark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cellot's Issues

How to set the parameter target if i want to predict the outcome of multiple drugs?

Hi,
In the sciplex dataset ood mode, if i want to predict the outcome of multiple drugs, how to set the parameter target in the config file? (1) target: drug A, drug B ,... , drug N. For example, if i want to predict the outcome of 3 drugs, respectively, target: drug A, drug B, drug C? (2) or for every drug i want to predict, i should train a new model?

Best,
Hope to receive your reply.

Generating predictions using CellOT

Hi!

I wasn't able to find a way to generate predicted expression values using a trained CellOT model, do you have a script to do so?

Thanks,
Yan

Scanpy/anndata error.

With following run command:

python3 ./scripts/train.py --outdir ./results/scrna-sciplex3/drug-ruxolitinib/model-cellot --config ./configs/tasks/sciplex3.yaml --config ./configs/models/cellot.yaml --config.data.target ruxolitinib

I am getting following error:

Traceback (most recent call last):
File "/home/centos/anaconda3/envs/cellot/lib/python3.9/site-packages/anndata/_io/utils.py", line 177, in func_wrapper
return func(elem, *args, **kwargs)
File "/home/centos/anaconda3/envs/cellot/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 527, in read_group
EncodingVersions[encoding_type].check(
File "/home/centos/anaconda3/envs/cellot/lib/python3.9/enum.py", line 408, in getitem
return cls.member_map[name]
KeyError: 'dict'

During handling of the above exception, another exception occurred:


I believe this is due to the fact that dataset has been created using a newer version of scanpy while cellot is trying to read it with an older version (scanpy==1.8.1). If I am attempting to upgrade scanpy to newer version, it fixing this particular issue but I believe this causing new errors.

I would appreciate more help in this regard.

The R2 metric

Hi !
In ./scripts/evaluate.py, The r2 between the observed and predicted gene expression is calculated using "pd.Series.corr(mut, mui)". However, this function only returns the Pearson correlation coefficient (PCC). I is there anything wrong ?

Unable to reproduce the results of Sciplex3

Hello, I trained and evaluated the performance of CellOT on the sciplex3 dataset using the following command (without changing any configuration):

python ./scripts/train.py --outdir ./results/sciplex3/drug-trametinib/model-cellot --config ./configs/tasks/sciplex3.yaml --config ./configs/models/cellot.yaml --config.data.target trametinib
python ./scripts/evaluate.py --outdir ./results/sciplex3/drug-trametinib/model-cellot --setting iid --where data_space

However, I did not receive the results reported in the article. This is the result of my reproduction:

1000,all,l2-means,4.904272556304932
1000,all,l2-stds,4.1161394119262695
1000,all,r2-means,0.6526695314139299
1000,all,r2-stds,0.9440400841747668
1000,all,r2-pairwise_feat_corrs,0.43102186295905737
1000,all,l2-pairwise_feat_corrs

Training time

I was hoping to use CellOT on full scRNA-seq data and was wondering what the training times for that should look like and if there is any way to accelerate training. I'm currently running scGen to get the autoencoder embeddings and I'm getting predicted runtimes of 594hrs on 1 GPU for 20k genes in 3k cells and 8hrs for 1k genes in 3k cells.

Thank you!

Unable to reproduce the results of Sciplex3

Hello, I trained and evaluated the performance of CellOT on the sciplex3 dataset using the following command (without changing any configuration):

python ./scripts/train.py --outdir ./results/sciplex3/drug-trametinib/model-cellot --config ./configs/tasks/sciplex3.yaml --config ./configs/models/cellot.yaml --config.data.target trametinib
python ./scripts/evaluate.py --outdir ./results/sciplex3/drug-trametinib/model-cellot --setting iid --where data_space

However, I did not receive the results reported in the article. This is the result of my reproduction:

1000,all,l2-means,4.904272556304932
1000,all,l2-stds,4.1161394119262695
1000,all,r2-means,0.6526695314139299
1000,all,r2-stds,0.9440400841747668
1000,all,r2-pairwise_feat_corrs,0.43102186295905737
1000,all,l2-pairwise_feat_corrs

Error 4i evaluate.py

Hi,

I'm trying to run the simple 4i tutorial, but evaluate.py crashes halfway. Unfortuantely, without more info on what each script is doing its difficult to troubleshoot this by oneself.

I'm running python /pathto/cellot/scripts/evaluate.py --outdir /pathto/scripts/cellot_run/ --setting iid --where data_space

Traceback (most recent call last):
  File "/pathto/miniconda3/envs/cellot/lib/python3.9/site-packages/ml_collections/config_dict/config_dict.py", line 883, in __getitem__
    field = self._fields[key]
KeyError: 'data'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/pathto/miniconda3/envs/cellot/lib/python3.9/site-packages/ml_collections/config_dict/config_dict.py", line 807, in __getattr__
    return self[attribute]
  File "/pathto/miniconda3/envs/cellot/lib/python3.9/site-packages/ml_collections/config_dict/config_dict.py", line 889, in __getitem__
    raise KeyError(self._generate_did_you_mean_message(key, str(e)))
KeyError: "'data'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/pathto/cellot/scripts/evaluate.py", line 183, in <module>
    app.run(main)
  File "/pathto/miniconda3/envs/cellot/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/pathto/miniconda3/envs/cellot/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/pathto/cellot/scripts/evaluate.py", line 173, in main
    evals = pd.DataFrame(
  File "/pathto/miniconda3/envs/cellot/lib/python3.9/site-packages/pandas/core/frame.py", line 563, in __init__
    data = list(data)
  File "/pathto/cellot/scripts/evaluate.py", line 62, in compute_evaluations
    for ncells, nfeatures, treated, imputed in iterator:
  File "/pathto/cellot/scripts/evaluate.py", line 118, in iterate_feature_slices
    _, treateddf, imputed = load_conditions(
  File "/pathto/cellot/cellot/utils/evaluate.py", line 271, in load_conditions
    embedding = read_embedding_context(
  File /pathto/cellot/cellot/utils/evaluate.py", line 159, in read_embedding_context
    if "ae_emb" in config.data:
  File "/pathto/miniconda3/envs/cellot/lib/python3.9/site-packages/ml_collections/config_dict/config_dict.py", line 809, in __getattr__
    raise AttributeError(e)
AttributeError: "'data'"

Training finished without errors and the output directory /pathto/scripts/cellot_run/ looks like this:

config.yaml

cache:
last.pt  model.pt  scalars  status

Any insights would be appreciated,
Best,
M

Typo in setup.py

Hi guys,

long_description=open('READme.md').read() in setup.py throws an error. READme.md >> README.md fixes it.

How can we train a model with multiple targets in the ood mode?

Hi, thanks for this great job.
I can run the sciplex3 dataset with one target in the ood mode? But this will time-consuming as the sciplex3 have 188 drugs in total. So Whether can we train the model with multiple targets once?

Best
Hope to receive your reply.

GPU acceleration

Hi,

Thanks so much for the great tool! I was wondering if there's a way to use GPUs using the current training framework or if you were planning on adding GPU support in the near future?

Thanks!
Yan

Question about the groupby parameter?

Hi,
In the lupuspatients dataset, you set the groupby parameter to condition. In the sciplex3 dataset, you set the groupby parameter to [cell_type, condition]?

(1) In the lupuspatients dataset, why not set the groupby parameter to [cell_type, condition] as 8 cell type exists.
(2) Under what circumstances set the groupby parameter to condition, and set the groupby parameter to [cell_type, condition]?

Best.

OOD train test split question

Hi!

I am using CellOT in the OOD setting and I was wondering why you were using a test_size > 0 for split_cell_data_train_test in the ood setting? My thinking is that in OOD you would want to use all the data you can that's outside of the holdout group. Is it for evaluating the performance of the non-ood tasks or maybe something else I am missing?

Thank you so much!

The architecture of ICNN

In the code, I see that the forward function of ICNN is defined like this:
`
def forward(self, x):

   z = self.sigma(0.2)(self.A[0](x))
   z = z * z

   for W, A in zip(self.W[:-1], self.A[1:-1]):
       z = self.sigma(0.2)(W(z) + A(x))

   y = self.W[-1](z) + self.A[-1](x)

   return y

`

I think there are two places that are inconsistent with the formula in article.
(i) Why should we make z=z*z in the first layer?
(ii) Why no non-negative activation function is added to the last layer.

Thank you!

Obtain mapping inside training dataset

Hi!
Thank you for this interesting tool!

Rather than predicting unseen perturbation outcomes, I am interested in just finding a mapping between wild-type and perturbed cells in my training dataset. Is there a way to get this explicitly form the model? Or do I have to transport() my wild-type cells and find the closest perturbed cell for each? Thank you!

Missing plots data in the results folder

After trained cellot, cae-4i, random, identity and scgen-4i models for 4i data, I tried running plot.py script by this command: python scripts/plot.py, I encountered error on file not found when plotting UMAP and KNN_MMD
Screenshot 2024-03-15 at 1 28 28 AM
This indicates umap.csv and knn_enrichment.csv are missing in evals_iid_data_space of results folder. Could you please tell me is there anything wrong? Is my command correct for running this plot script?
Thanks.

fail to open the file“hvg-train-only.h5ad”in scrna-crossspecies

when i try to use anndata to read the file“hvg-train-only.h5ad”in scrna-crossspecies,i failed.
but i successfully read the file 4i-melanoma_cell_lines-8h.h5ad in cl-8h
The error message is as follows
AnnDataReadError: Above error raised while reading key '/layers' of type <class 'h5py._hl.group.Group'> from /.

I don't know why, looking forward to your reply,thanks.

Crossspecies: unable to reproduce results for CellOT

Hello: I'm trying to reproduce results for cross species for CellOT (using rats as starting point), but ran into an issue.

The steps I tried are as follows:

  • Get the embedding from model-scgen as follows: --outdir ./results/scrna-crossspecies/mode-iid/model-scgen --config ./configs/tasks/crossspecies.yaml --config ./configs/models/scgen.yaml
  • Use the embedding from scgen and apply to CellOT: --outdir ./results/scrna-crossspecies/mode-iid/model-cellot --config ./configs/tasks/crossspecies.yaml --config ./configs/models/cellot.yaml --config.data.ae_emb.path ./results/scrna-crossspecies/mode-iid/model-scgen

Once stored the result, I evaluated via the following: --outdir results/scrna-crossspecies/mode-iid/model-cellot --n_markers 50 --setting iid --where data_space

The results I get are:

'mmd': 0.4460518822912073, 'l2': 15.850724, 'r2': 0.4929862534733026

What have I done wrongly? For reference, I got the following for identity, which seems to make more sense:
'mmd': 0.20872688110292562, 'l2': 11.169688, 'r2': 0.7255046739934895

Thank you!

Any example for o.o.s?

Hi, as mentioned in README.md

All scripts to reproduce the experiments in the i.i.d. (independent-and-identically-distributed), o.o.s. (out-of-sample), and o.o.d. (out-of-distribution) setting can be found in scripts/submit

I tried to find ./configs/tasks/oos-lupuspatients.yaml in scripts/submit/oos-lupuspatients.sh, but only found lupuspatients.yaml and lupuspatients-ood.yaml.

So is there any example for o.o.s?

Thanks.

AssertionError

In training data phase. Ran python ./scripts/train.py --outdir ./results/4i/drug-cisplatin/model-cellot --config ./configs/tasks/4i.yaml --config ./configs/models/cellot.yaml --config.data.target cisplatin and it terminated with AssertionError.
Screenshot 2024-01-19 at 3 36 58 PM

4i dataset

After reading 4i data the preprocessed version provided
Screenshot1
Screenshot2
Screenshot3
Screenshot4
. I have not figured out the source/target distribution there.
I notice that the data are indexed by the drug and cell original, but no source/target labeled.

Please see attached screenshots.
Screenshot1: my code snippet
Screenshot2 Screenshot3: is the data obs and var, as you can see it is indexed by drug as row and cell original as column.
Screenshot4: is UMAP filtering the data by Trametinib but could not filter (source vs target)

I also found in the repository line 71 to line 93: https://github.com/bunnech/cellot/blob/main/cellot/data/cell.py
you where labeling the data as source and target, I am not sure how do you do that. I thought the data are already labeled.

I really appreciate any explanation.
Thank you

2nd Generating prediction after the model is trained.

Dear Author,

I have taken interest in CellOT package and found it is interesting. After trying it for awhile. I can't get a function to generate prediction based on the train model.

For example, I want to have a different split used for testing and I want to make prediction based on that split instead of random split.

Is it possible to find the function?

Best regards,

Rom Uddamvathanak

Request to add commands to run Crossspecies and GBM Dataset

The script in scripts/submit/iid.sh doesn't seem to have any command to run the crossspecies dataset. When I tried running

python ./scripts/train.py --outdir ./results/scrna-crossspecies/model-cellot --config ./configs/tasks/crossspecies.yaml --config ./configs/models/cellot.yaml 

it returns the following result:

Traceback (most recent call last):
  File "[repo_name]/./scripts/train.py", line 80, in <module>
    main(sys.argv)
  File "[repo_name]/./scripts/train.py", line 64, in main
    train(outdir, config)
  File "[repo_name]/cellot/train/train.py", line 129, in train_cellot
    gl = compute_loss_g(f, g, source).mean()
  File "[repo_name]/cellot/models/cellot.py", line 102, in compute_loss_g
    transport = g.transport(source)
  File "[repo_name]/cellot/networks/icnns.py", line 97, in transport
    (output,) = autograd.grad(
  File "/state/partition1/llgrid/pkg/anaconda/python-LLM-2023b/lib/python3.10/site-packages/torch/autograd/__init__.py", line 288, in grad
    grad_outputs_ = _make_grads(t_outputs, grad_outputs_, is_grads_batched=is_grads_batched)
  File "/state/partition1/llgrid/pkg/anaconda/python-LLM-2023b/lib/python3.10/site-packages/torch/autograd/__init__.py", line 71, in _make_grads
    raise RuntimeError("Mismatch in shape: grad_output["
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([256, 1]) and output[0] has a shape of torch.Size([256, 1, 1]).

A similar outcome is produced when I tried running the following for GBM dataset.

python ./scripts/train.py --outdir ./results/scrna-gbm/model-cellot --config ./configs/tasks/gbm.yaml --config ./configs/models/cellot.yaml 

online methods

Hi, can you point me to the 'Online Methods' referenced in the publication? Thanks!

UMAP not found when plotting

Here is what I did after running all models (including evaluation part) on 4i/cisplatin:

python3 ./scripts/plot.py --evaldir results/4i/drug-cisplatin/

The first task (plotting marginals) went fine, but the next thing (plotting umaps) gave the following error:

Plotting UMAPS.
Traceback (most recent call last):
  File "[repo_name]/./scripts/plot.py", line 359, in <module>
    app.run(main)
  File "/state/partition1/llgrid/pkg/anaconda/python-LLM-2023b/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/state/partition1/llgrid/pkg/anaconda/python-LLM-2023b/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "[repo_name]/./scripts/plot.py", line 336, in main
    plot_umaps(config_plotting, evaldir, outdir, setting, where)
  File "[repo_name]/./scripts/plot.py", line 110, in plot_umaps
    umaps[model] = load_single_umap(evaldir / f"model-{model}", setting, where)
  File "[repo_name]/./scripts/plot.py", line 60, in load_single_umap
    umaps = pd.read_csv(expdir / f"evals_{setting}_{where}" / "umap.csv", index_col=0)
  File "/state/partition1/llgrid/pkg/anaconda/python-LLM-2023b/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/state/partition1/llgrid/pkg/anaconda/python-LLM-2023b/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/state/partition1/llgrid/pkg/anaconda/python-LLM-2023b/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1448, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/state/partition1/llgrid/pkg/anaconda/python-LLM-2023b/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1705, in _make_engine
    self.handles = get_handle(
  File "/state/partition1/llgrid/pkg/anaconda/python-LLM-2023b/lib/python3.10/site-packages/pandas/io/common.py", line 863, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: 'results/4i/drug-cisplatin/model-cellot/evals_iid_data_space/umap.csv'

Did I miss something (during the evaluation part) that (supposedly) produces the umap? The only thing produced during evaluation is imputed file and the evals file on all the metric. Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.