GithubHelp home page GithubHelp logo

Comments (115)

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024 2

OK, I think it's all good ideas. Should we talk about it on Thursday?

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024 2

you are right @Inyrkz ..
Get as far as you can with the different options and on Thursday we will discuss them throughly to adopt the best one - the notebooks you are preparing are very useful, we'll discuss them all.
If you get to a point where you cannot proceed further, start looking into the Virtual Libraries (Mollib) as this will be our next model to work on

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024 1

Hi @miquelduranfrigola, @GemmaTuron,

I'll start with the "scaffold morphing." And also take note of the performance of the model.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024 1

I've been able to download the model. My internet is much better now.

pytorch_model.bin: 100%|██████████| 349M/349M [02:51<00:00, 2.03MB/s] 
generation_config.json: 100%|██████████| 132/132 [00:00<00:00, 3.84kB/s]
tokenizer.json: 100%|██████████| 46.3k/46.3k [00:00<00:00, 159kB/s]
SAFEDoubleHeadsModel(
  (transformer): GPT2Model(
    (wte): Embedding(1880, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
...

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024 1

Thanks @Inyrkz for the update. Good work, we will use tomorrow's meeting to discuss how to proceed with this. Push the code so I can have a look and prepare!
Thanks

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024 1

On a quick look, I think the "core" can be obtained with MurckoScaffolds. But please double-check

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024 1

Thanks @Inyrkz - awesome stuff. This is really going in the right direction.
Can you please take a look at this page? https://portal.valencelabs.com/datamol/post/generate-scaffolds-iBUTqU8Im9N2zCM

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024 1

Also, @Inyrkz regarding the output of MolFrags:

I think we might not be getting more MolFrags because we are applying it to a MurckScaffold. what if we do the other way around, MolFrags and then MurckoScaffold?

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024 1

Links:
https://pubs.acs.org/doi/10.1021/acs.jcim.0c00296
https://rdkit.org/docs/source/rdkit.Chem.Scaffolds.rdScaffoldNetwork.html
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080744/
https://www.zbh.uni-hamburg.de/en/forschung/amd/datasets/brics.html

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024 1

I've removed the duplicates. Now we only have 7 outputs for the 4 input SMILES.

The model fails on some SMILES input. In this notebook, I only use one SMILES as input, the model couldn't generate any new molecule for each of the four core structures extracted.

I even tried converting the SMILES to a SAFE before passing it to the model. It didn't work.

I'm not sure why.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024 1

I haven't prepared the Dockerfile yet. I'll work on that.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024 1

Okay, pairs and triplets. So I just do

all_combinations = combinations_2 + combinations_3

I'll experiment to get answers to the questions.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024 1

Here's what I've observed so far (I'll update this list).

  • The n_samples_per_trial=12 will generate 12 sample outputs when n_trials is set to 1. It may give an empty list for some side chains. Maybe it's because the n_trials is 1
  • When n_trials is set to 5, for example, we could get up to 60 sample outputs for each side chains passed to the model. This takes longer to run. Also for side chains that return an empty list as out (using the previous parameters), we get an output (non-empty list) when we adjust the n_trials hyperparameter. Instead of an empty list, we get about 6 elements.
  • Increasing n_trials increases the running time.
  • For some side chains, increasing n_trials will still give an empty list.
  • It's a bit faster when we reduce the n_samples_per_trial. But this mostly depends on the n_trials. So it will be a trade-off between the n_trials and n_samples_per_trial parameters to get more samples or reduce generation time. It will take longer to generate more samples.
  • The maximum sample we can get for a side chain is n_trials * n_samples_per_trials.

I hope this helps.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024 1

@GemmaTuron, You're welcome.

I'm yet to test it with Ersilia.
I want to make sure the code is okay before using it in my presentation tomorrow.

It took over 30 mins to run it on my system.

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024 1

Thanks for the update @GemmaTuron & @miquelduranfrigola

I corrected the typos in the mol_gen.py script.

I'm a bit confused about the path issue. The code will only require the input and output files in the main.py. It won't require any other file.

input_file = "data/my_molecules (copy).csv"
output_file = "data/results.csv"

The lines above were just for testing. This is what the main code will look like.

input_file = sys.argv[1]
output_file = sys.argv[2]

Is there any path issue to modify here?

When I was testing with the specified files instead of the argv inputed files, it was having issues with the paths, that's all!

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024 1

@Inyrkz

Please:

  • Update the metadata.json
  • Correct the issues on the function names in mol_gen.py
  • Correct the main.py to use the sys.argv inputs only

Then, to solve the wandb dependency issue, install safe-mol without dependencies and manually install each package. Specify the versions you got when installing safe outside ersilia

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024 1

@GemmaTuron, this is what the metadata.json file looks like. Is this okay?

{
    "Identifier": "eos8bhe",
    "Slug": "scaffold-morphing",
    "Status": "In progress",
    "Title": "safe",
    "Description": "The context discusses a novel notation system called Sequential Attachment-based Fragment Embedding (SAFE) that improves upon traditional molecular string representations like SMILES. SAFE reframes SMILES strings as an unordered sequence of interconnected fragment blocks while maintaining compatibility with existing SMILES parsers. This streamlines complex molecular design tasks by facilitating autoregressive generation under various constraints. The effectiveness of SAFE is demonstrated by training a GPT2-like model on a dataset of 1.1 billion SAFE representations that exhibited versatile and robust optimization performance for molecular design.",
    "Mode": "Pretrained",
    "Task": ["Generation"],
    "Input": ["Compound"],
    "Input Shape": "Single",
    "Output": ["Compound"],
    "Output Type": ["String"],
    "Output Shape": "List",
    "Interpretation": "Model generates new molecules from input molecule by replacing core structures of input molecule.",
    "Tag": "Compound Generation",
    "Publication": "https://arxiv.org/pdf/2310.10773.pdf",
    "Source Code": "https://github.com/datamol-io/safe/tree/main",
    "License": "CC BY 4.0"
}

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024 1

@GemmaTuron,

I've opened a pull request.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

The model works locally.

This is the code sample

import safe

ibuprofen = "CC(Cc1ccc(cc1)C(C(=O)O)C)C"

# SMILES -> SAFE -> SMILES translation
try:
    ibuprofen_sf = safe.encode(ibuprofen)  # c12ccc3cc1.C3(C)C(=O)O.CC(C)C2
    ibuprofen_smi = safe.decode(ibuprofen_sf, canonical=True)  # CC(C)Cc1ccc(C(C)C(=O)O)cc1
except safe.EncoderError:
    pass
except safe.DecoderError:
    pass

ibuprofen_tokens = list(safe.split(ibuprofen_sf))
print(ibuprofen_tokens)

This is the output

['c', '1', '2', 'c', 'c', 'c', '3', 'c', 'c', '1', '.', 'C', '3', '(', 'C', ')', 'C', '(', '=', 'O', ')', 'O', '.', 'C', 'C', '(', 'C', ')', 'C', '2']

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

Hi @Inyrkz !

Good, you can load and run the basic SAFE commands. Datamol is usually very well organised, so hopefully downstream tasks will also be working nicely.
This model has many many functionalities, as you can see in the graph they show in their repo. I need you to:

  • Look at the Documentation (particularly molecular design) and play around with the different functions. You can use their examples or we can come up with new ones as well: https://safe-docs.datamol.io/stable/tutorials/design-with-safe.html
  • Think of the best design for this. Should we do, one single model that does everything? Is this possible? Or incorporate each of the safe functionalities as independent models?
    Then we can start incorporating the model(s) - please as you play with the different possibilities for molecular design, take a note of performance, if anything seems already very slow would be good to know!
    This model is really exciting and we will be using it soon in chem-sampler

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Thank you for sharing the documentation. I'll check out other functions. It's an interesting model

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

Hi both,

100% agree with @GemmaTuron - let's check their notebooks and try to reproduce one of them. In my opinion, the most interesting is "scaffold morphing" as shown in the tutorial. Let's start by that. If that works, then we can actually create multiple models in the hub, for example: safe-scaffold-morphing, safe-scaffold-decoration, etc. But let's go step by step and start by one case.

To me, the most crucial part now is to reproduce the tutorial to show that the model is not too heavy and take note of performance, as Gemma suggests.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

I've been stuck here (model downloading). My internet connection is poor. I'll try again later tonight

# Load pre-trained model
designer = sf.SAFEDesign.load_default(verbose=True)
designer.model
pytorch_model.bin:   0%|          | 0.00/349M [00:00<?, ?B/s]Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/87/f1/87f1a96893e9890db129b18dcc1e87833c610379b09e91b9d21277d8242d6205/d243f31837ec16c8d6653b8b18aa6225469fa6cbae6057ccaf93b46af46ca8a8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1705232429&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNTIzMjQyOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzg3L2YxLzg3ZjFhOTY4OTNlOTg5MGRiMTI5YjE4ZGNjMWU4NzgzM2M2MTAzNzliMDllOTFiOWQyMTI3N2Q4MjQyZDYyMDUvZDI0M2YzMTgzN2VjMTZjOGQ2NjUzYjhiMThhYTYyMjU0NjlmYTZjYmFlNjA1N2NjYWY5M2I0NmFmNDZjYThhOD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=P7Zc5wO4gMBRzwflroJDW%7EKGK-qBU1%7E7PUl5U65t14T9fuvT5c-VSXFTYiEaah1r6hlVdn9YPcCzxa8dhiQTWdcpgtQpVjSwcmp-VLLapsFx2HJQt-QEDDyAamQgZ0d2ezbwnpErg2ObpoVCz0ta4M%7ErBr2SuWdx3IAac-exi4EZ2G40HZDSuis7c7s8Id46LdEg3Qsdb1RbsVOoie752ZoT2mHnamaN9SFG5j%7EQe2OG0OEbS4To%7Eb-cXovbtRJ-7ScBY1keLzlqbGhywArqDryKQRUd2ldZlbzmPGvJXjcEMpNP6R2E9gTLiBSZi-YLGh1Lxxf0taqU0g2YD9ocLA__&Key-Pair-Id=KCD77M1F0VK2B: HTTPSConnectionPool(host='cdn-lfs-us-1.huggingface.co', port=443): Read timed out.
Trying to resume download...
Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/87/f1/87f1a96893e9890db129b18dcc1e87833c610379b09e91b9d21277d8242d6205/d243f31837ec16c8d6653b8b18aa6225469fa6cbae6057ccaf93b46af46ca8a8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1705232429&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNTIzMjQyOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzg3L2YxLzg3ZjFhOTY4OTNlOTg5MGRiMTI5YjE4ZGNjMWU4NzgzM2M2MTAzNzliMDllOTFiOWQyMTI3N2Q4MjQyZDYyMDUvZDI0M2YzMTgzN2VjMTZjOGQ2NjUzYjhiMThhYTYyMjU0NjlmYTZjYmFlNjA1N2NjYWY5M2I0NmFmNDZjYThhOD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=P7Zc5wO4gMBRzwflroJDW%7EKGK-qBU1%7E7PUl5U65t14T9fuvT5c-VSXFTYiEaah1r6hlVdn9YPcCzxa8dhiQTWdcpgtQpVjSwcmp-VLLapsFx2HJQt-QEDDyAamQgZ0d2ezbwnpErg2ObpoVCz0ta4M%7ErBr2SuWdx3IAac-exi4EZ2G40HZDSuis7c7s8Id46LdEg3Qsdb1RbsVOoie752ZoT2mHnamaN9SFG5j%7EQe2OG0OEbS4To%7Eb-cXovbtRJ-7ScBY1keLzlqbGhywArqDryKQRUd2ldZlbzmPGvJXjcEMpNP6R2E9gTLiBSZi-YLGh1Lxxf0taqU0g2YD9ocLA__&Key-Pair-Id=KCD77M1F0VK2B: HTTPSConnectionPool(host='cdn-lfs-us-1.huggingface.co', port=443): Read timed out.
Trying to resume download...

Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/87/f1/87f1a96893e9890db129b18dcc1e87833c610379b09e91b9d21277d8242d6205/d243f31837ec16c8d6653b8b18aa6225469fa6cbae6057ccaf93b46af46ca8a8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1705232429&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNTIzMjQyOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzg3L2YxLzg3ZjFhOTY4OTNlOTg5MGRiMTI5YjE4ZGNjMWU4NzgzM2M2MTAzNzliMDllOTFiOWQyMTI3N2Q4MjQyZDYyMDUvZDI0M2YzMTgzN2VjMTZjOGQ2NjUzYjhiMThhYTYyMjU0NjlmYTZjYmFlNjA1N2NjYWY5M2I0NmFmNDZjYThhOD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=P7Zc5wO4gMBRzwflroJDW%7EKGK-qBU1%7E7PUl5U65t14T9fuvT5c-VSXFTYiEaah1r6hlVdn9YPcCzxa8dhiQTWdcpgtQpVjSwcmp-VLLapsFx2HJQt-QEDDyAamQgZ0d2ezbwnpErg2ObpoVCz0ta4M%7ErBr2SuWdx3IAac-exi4EZ2G40HZDSuis7c7s8Id46LdEg3Qsdb1RbsVOoie752ZoT2mHnamaN9SFG5j%7EQe2OG0OEbS4To%7Eb-cXovbtRJ-7ScBY1keLzlqbGhywArqDryKQRUd2ldZlbzmPGvJXjcEMpNP6R2E9gTLiBSZi-YLGh1Lxxf0taqU0g2YD9ocLA__&Key-Pair-Id=KCD77M1F0VK2B: HTTPSConnectionPool(host='cdn-lfs-us-1.huggingface.co', port=443): Read timed out.
Trying to resume download...

While waiting, I'll study the documentation.

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

Hi Ini, sorry to hear that.
Not great because it means the model will be pretty large in the Hub as well. Let me try it in my end see how long it takes

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

This table shows how long it takes to execute the original code.

Tasks Execution Time (seconds)
Load model (first time) 134 (depends on internet speed)
Load model 55
De novo generation 193.1
Scaffold generation 156.1
Scaffold morphing 54.9
Super structure generation 32.9
Motif extension 25.7
Linker generation 79.8

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

Ok, so the model is fast but it just heavy so donwloading time will be the issue. That is helpful Ini!
Let's focus on the scaffold morphing. As we discussed in the meeting it would be good to play around with it using structures that we know best. @miquelduranfrigola I suggested running a first test with the OSM compounds.
The OSM project has been growing in the last years, but basically, there is a chemical series of interest (series 4) with a known core (triazolopyrazine). Could we try to use the scaffold morphing with that core? (in the docs they provide examples with side chains, and mention it can be done with cores instead but no examples) - could you try to look into the code to see how we could do this?
And then compare the scaffold morphing to the decorator.
As always, we can start working with Notebooks and once we have the results and decide what to exactly incorporate in the hub, we can work on the final model repo structure. I'll approve for the moment so you have a repo to put the code in

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

/approve

from ersilia.

github-actions avatar github-actions commented on June 10, 2024

New Model Repository Created! 🎉

@Inyrkz ersilia model respository has been successfully created and is available at:

🔗 ersilia-os/eos8bhe

Next Steps ⭐

Now that your new model respository has been created, you are ready to start contributing to it!

Here are some brief starter steps for contributing to your new model repository:

Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository

  • 🍴 Get started by creating a fork of your new model repository - docs
  • 👯 Clone your forked repository - docs
  • ✏️ Make edits to your new forked model repository - docs - Edits might include:
    • Updating the README.md file to accurately describe your model
    • Add source code for your model
    • Adding documentation for your model
  • 🚀 Open a Pull Request from your forked repository to the original repository. This will allow you to bring your local changes into the new ersilia model repository that was just created! - docs

Additional Resources 📚

If you have any questions, please feel free to open an issue and get support from the community!

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Yes, the model is heavy to download.

I'll start working with the OSM compounds, and see how the scaffold morphing performs.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Ok, so the model is fast but it just heavy so donwloading time will be the issue. That is helpful Ini! Let's focus on the scaffold morphing. As we discussed in the meeting it would be good to play around with it using structures that we know best. @miquelduranfrigola I suggested running a first test with the OSM compounds. The OSM project has been growing in the last years, but basically, there is a chemical series of interest (series 4) with a known core (triazolopyrazine). Could we try to use the scaffold morphing with that core? (in the docs they provide examples with side chains, and mention it can be done with cores instead but no examples) - could you try to look into the code to see how we could do this? And then compare the scaffold morphing to the decorator. As always, we can start working with Notebooks and once we have the results and decide what to exactly incorporate in the hub, we can work on the final model repo structure. I'll approve for the moment so you have a repo to put the code in

I'm still going through the documentation. I found this.

compute_side_chains(mol, core, label_by_index=False) [¶](https://safe-docs.datamol.io/stable/api/safe.html#safe.utils.compute_side_chains)
Compute the side chain of a molecule given a core

Finding the side chains

The algorithm to find the side chains from core assumes that the core we get as input has attachment points. Those attachment points are never considered as part of the query, rather they are used to define the attachment points on the side chains. Removing the attachment points from the core is exactly the same as keeping them.

mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
core0 = "CC1(C)CN2C(CC2=O)S1"
core1 = "CC1(C)SC2C(-*)C(=O)N2C1-*"
core2 = "CC1N2C(SC1(C)C)C(N)C2=O"
side_chain = compute_side_chain(core=core0, mol=mol)
dm.to_image([side_chain, core0, mol])
Therefore on the above, core0 and core1 are equivalent for the molecule mol, but core2 is not.

Do you think we are to convert the core structure to a side chain?

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Also, from this text in the documentation

Scaffold Morphing
In scaffold morphing, we wish to replace a scaffold by another one in a molecule. The process requires as input that the user provides either the side chains or the input molecules and the core

This means we need two inputs, the side chains (or input molecules) and the core. In the example they gave, they didn't input any core. I don't know if they are using any default core. I'm trying to find that part of the documentation (for the pretrained model)

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

OK this is extremely promising. Thanks @Inyrkz.

I will inspect the model too and get back to you with suggestions. For now, let's focus on placing the code accordingly in the model template structure and, meanwhile, we can come up ideas.

On a related note, as previously discussed, it is likely that we will use this model in different modalities (scaffold morphing, linker, etc.). Therefore, I would change the slug to something like: safe-scaffold-morphing. Probably, in the near future, we'll do safe-linker, safe-de-novo etc. I hope this makes sense.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Ok, so the model is fast but it just heavy so donwloading time will be the issue. That is helpful Ini! Let's focus on the scaffold morphing. As we discussed in the meeting it would be good to play around with it using structures that we know best. @miquelduranfrigola I suggested running a first test with the OSM compounds. The OSM project has been growing in the last years, but basically, there is a chemical series of interest (series 4) with a known core (triazolopyrazine). Could we try to use the scaffold morphing with that core? (in the docs they provide examples with side chains, and mention it can be done with cores instead but no examples) - could you try to look into the code to see how we could do this? And then compare the scaffold morphing to the decorator. As always, we can start working with Notebooks and once we have the results and decide what to exactly incorporate in the hub, we can work on the final model repo structure. I'll approve for the moment so you have a repo to put the code in

I'm still going through the documentation. I found this.

compute_side_chains(mol, core, label_by_index=False) [¶](https://safe-docs.datamol.io/stable/api/safe.html#safe.utils.compute_side_chains)
Compute the side chain of a molecule given a core

Finding the side chains

The algorithm to find the side chains from core assumes that the core we get as input has attachment points. Those attachment points are never considered as part of the query, rather they are used to define the attachment points on the side chains. Removing the attachment points from the core is exactly the same as keeping them.

mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
core0 = "CC1(C)CN2C(CC2=O)S1"
core1 = "CC1(C)SC2C(-*)C(=O)N2C1-*"
core2 = "CC1N2C(SC1(C)C)C(N)C2=O"
side_chain = compute_side_chain(core=core0, mol=mol)
dm.to_image([side_chain, core0, mol])
Therefore on the above, core0 and core1 are equivalent for the molecule mol, but core2 is not.

Do you think we are to convert the core structure to a side chain?

@GemmaTuron @miquelduranfrigola
Both scaffold morphing and scaffold decoration require using side chains. To compute the side chains using the compute_side_chains(mol, core, label_by_index=False) function, we need both the SMILES (mol) and the core (core0) structure (core0). I don't know how we are going to extract the core structure from the SAFE after converting the SMILES to SAFE.

My code can't run without an error without resolving this first. I'll push what I've written so far.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

This is the link to the notebook

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

I found a documentation here. I'll try this.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

This is the function I wrote.

def extract_core_structure(molecule):
    # Convert SMILES to RDKit molecule
    mol = Chem.MolFromSmiles(molecule)

    # Identify the core structure using the Murcko Scaffold
    core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(mol)

    # Convert the core structure back to SMILES
    core_smiles = Chem.MolToSmiles(core)

    return core_smiles

mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
core_structure = extract_core_structure(mol)
print("Core Structure:", core_structure)

This is the output: Core Structure: O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1

Do you think this core structure is correct?
I'll try to visualize it in marvinjs.

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

Thanks Ini! think this is still too large, maybe we can try recursively applying the MurckoScaffold to this structure once more see if it trims it down?
Otherwise exploring the MolFragments (I've left some code on the other model request issue) can be useful.
Finally, we can add the binding sites with RDKIT as well (check the other issue where I left an example)

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

Thanks both. Good stuff.
Unfortunately, I don't think recursively using Murcko will further prune the molecule (I agree it is too big). The MolFragments sounds like a reasonable second step. So: Murcko+MolFragments sounds good.
One question that will arise is: of all fragments, which one is the core? This question is very difficult to answer, so for now, let's just take the largest fragment generated (for example, the one with the highest numbre of atoms). We may have to explore this manually, but as a starting point I think it makes sense. Opinions?

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

I tried this code

mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
mol = Chem.MolFromSmiles(mol)
core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(mol)
fragments = Chem.GetMolFrags(core, asMols=True)
for i in fragments:
    core_smiles = Chem.MolToSmiles(i)
    print(core_smiles)

and I got this output
O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1

I was expecting more than one value from the GetMolFrags() function.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

I tried this code to get the molecule with the highest number of atoms.

mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
mol = Chem.MolFromSmiles(mol)
core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(mol)
fragments = Chem.GetMolFrags(core, asMols=True)

# Select the largest fragment based on the number of atoms
largest_fragment = max(fragments, key=lambda x: x.GetNumAtoms())

# Convert the largest fragment back to SMILES
core_smiles = Chem.MolToSmiles(largest_fragment)
print(core_smiles)

I got this output
O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1

I think it's because it's only one fragment that the GetMolFrags() function gives. I'm going to try this on another sample molecule.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Using this molecule mol = "N#Cc1ccc(-c2nnc3cncc(OCCc4ccc(F)c(F)c4)n23)cc1"
I get this as the output for the two code cells: c1ccc(CCOc2cncc3nnc(-c4ccccc4)n23)cc1

I'm guessing this is still too large.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

This molecule molecule = "FC1=CC=C(CCOC2=CN=CC3=NN=C(N23)C2=CC=C(C(OC3CC3)=C2)C2=CC(OC3CC3)=C(Cl)C=C2)C=C1F"
gives this as output c1ccc(CCOc2cncc3nnc(-c4ccc(-c5cccc(OC6CC6)c5)c(OC5CC5)c4)n23)cc1

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Thanks @Inyrkz - awesome stuff. This is really going in the right direction. Can you please take a look at this page? https://portal.valencelabs.com/datamol/post/generate-scaffolds-iBUTqU8Im9N2zCM

@miquelduranfrigola this looks interesting. I'll go through it.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Also, @Inyrkz regarding the output of MolFrags:

I think we might not be getting more MolFrags because we are applying it to a MurckScaffold. what if we do the other way around, MolFrags and then MurckoScaffold?

@GemmaTuron okay, I'll try this to see the difference.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

@GemmaTuron

Using this code

molecule = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
mol = Chem.MolFromSmiles(molecule)
fragments = Chem.GetMolFrags(mol, asMols=True)

for i in fragments:
    print("Fragment:",  Chem.MolToSmiles(i))
    core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(i)
    print("MurckoScaffold:", Chem.MolToSmiles(core))

This is the result I get

Fragment: Cc1onc(-c2ccccc2Cl)c1C(=O)NC1C(=O)N2C1SC(C)(C)C2C(=O)O
MurckoScaffold: O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1

It's still only one fragment generated from the input SMILES. But this time, the fragment and core are different.

Using this molecule molecule = "FC1=CC=C(CCOC2=CN=CC3=NN=C(N23)C2=CC=C(C(OC3CC3)=C2)C2=CC(OC3CC3)=C(Cl)C=C2)C=C1F" we get

Fragment: Fc1ccc(CCOc2cncc3nnc(-c4ccc(-c5ccc(Cl)c(OC6CC6)c5)c(OC5CC5)c4)n23)cc1F
MurckoScaffold: c1ccc(CCOc2cncc3nnc(-c4ccc(-c5cccc(OC6CC6)c5)c(OC5CC5)c4)n23)cc1

It's only the Fluorine atom that is removed on both ends of the fragment.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

@miquelduranfrigola

It's a great article. Thanks, it was helpful. The diagram in the article shows how the scaffolds are broken down from level 4 to level 1 to get the main core structure.

They showed two different ways of extracting the scaffold. We can use the RDKit or DataMol package.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

I used the code from the article on the previously mentioned molecule. It still gives the same result.

I also tried to apply the code recursively. Theoretically, this should work like they showed in the diagram, but it doesn't. It still gives the same scaffold as the result. It doesn't break it down any further. The same thing applies to other molecules.

The notebook is here.

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

I see, it is a challenging issue!
I also have another suggestion, to ask the user to pass a subset of molecules from the same family, and then use the FindMCS function from rdkit to find the core (maybe you can try to get a few molecules from the OSM dataset and see what the MCS is for those @Inyrkz )
What do you think @miquelduranfrigola ?

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

So instead of getting just one molecule and finding the core structure of the molecule, the user will pass a subset of related molecules (because they share a common core structure). Then we use the FindMCS() function to extract the common substructure that represents the core of the chemical series.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

This is the notebook for it.

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

OK, thanks @Inyrkz and @GemmaTuron, this is all very interesting. I would strongly try to use only one molecule for now. In this particular case, I'd rather recommend the user to just pass the core. At the end of the day, if they want to morph a particular core, then why shouldn't they pass the core directly? I think this is reasonable.

So, in practical terms, I would do the following:

  • DataMol-based scaffold detection up to level 3 (or 4).
  • Take the largest scaffold resulting from layer 3 (or 4).
  • Use this scaffold as the "core".

So, in case the user already inputs a core or scaffold, it will be OK and, in case they input a molecule with side chains, we'll do our "best effort" to break it up. In my opinion, this is more than reasonable.

Would you agree?

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Hi @miquelduranfrigola,

That will be fine if we want the users to input just the core. We'll convert the core they input to a side chain and then convert it to a SAFE before passing it to the model (or convert it to a SAFE first before converting it to a side chain).

If the user inputs a molecule, the DataMol code can only get up to level 4 (which is one large scaffold).

Should I continue with using the scaffold from level 4?

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

I would not go with the startegy of large scaffolds as the molecules obtained will not be that significant.
We can also try what happens if we pass the core as a SMILES string (without the attachment points * ) and we simply try to create them with RDKIT, so the user only needs to input the core smiles -- would that work?

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

@GemmaTuron, that could work for scaffold decoration. We just focus on converting the core molecule to a side chain, and pass that to the model.

I don't think it will work for scaffold morphing. The users will have to input both the molecule and the core.

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

Thanks @GemmaTuron - completely agree with exploring Virtual Libraries (Mollib) meanwhile

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

The rdScaffoldNetwork: The Scaffold Network Implementation in RDKit works. It breaks the scaffolds into different levels. The number of scaffolds depends on the molecule size.

This is the notebook implementation. In some molecules, the attachment points are added.

We need to figure out which scaffold will be our core structure, especially for the last code cell in the notebook.

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

Thanks Ini, that is very interesting.
To start with, I would discard the fragments that do not show attachment points (*) so that at least the core we select already has them.
We could then filter out by MW, can you add the MW to the structures you are getting so we can establish a cut-off (probably a high and low, we don't want things that are too small or too large)... and then the question remains of whether we want only 1 core per molecule or we would pass up to three cores per molecule, for example.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

@GemmaTuron, Okay.
I tried using the FindMCS function to find the maximum common substructure from the generated list, but it didn't work.

I'll try this.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

@GemmaTuron, @miquelduranfrigola

The results are promising. Filtering by the molecular weight will help a lot. It seems the range of 70-80 is good.

Now, this is the question the question remains of whether we want only 1 core per molecule or we would pass up to three cores per molecule

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

Hi @Inyrkz !

Great, I suggest being a bit less restrictive, molecular weight 60 -100
And then, maybe we can select the core with less heteroatoms (less Carbons) so that is a more interesting scaffold.
With this, we could proceed, in the output of the model by the way we should give the core that we have used as output as well

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024
  • Filter by molecular weight within the range of 60-100
  • Then filter by less heteroatoms core (count the number of 'c' in the strings)

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

This result is promising.

One last step. We need to decide which scaffold we want to use, probably based on the number of attachment points.
Or we could work with all of them.

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

Awesome stuff, @Inyrkz . Results are indeed very promising.

A few considerations:

  • The filtering within the range 60-100 makes sense to me, but let's come up with a fallback strategy in case no scaffolds fall within this range. I would say: if scaffolds exist within this range, filter out the rest. If they don't exist, take the one that is closer to this range.
  • Choosing based on number of attachment points makes sense to me. Using all of them too. Let's do a few test and have a look at the outputs!
  • Let's get rid of scaffolds that are "generic" - in your previous results, that would be scaffolds 2, 4, 6, 8, 10, 12

Thanks again!

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

@miquelduranfrigola, I have added another condition to get the scaffold that is closer to the range of 60 - 100 as a fallback strategy. The generic scaffolds have been excluded.

I'll be bringing all the functions together to do the scaffold morphing.

  • Get Input SMILES
  • Generate the core structures from the input SMILES.
  • Compute the side chain using the input molecule and the core structure.
  • Pass the side chain to the scaffold-morphing model
  • Get the results.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

This notebook shows what running the main.py file will look like with these as the input SMILES

smiles
CC(C)(C)c1nc(c(s1)-c1ccnc(N)n1)-c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F
CN(C)c1cccc(c1)C(=O)Nc1ccc(C)c(NC(=O)c2ccc(O)cc2)c1
Cc1ccc(Cl)c(Nc2ccccc2C(O)=O)c1Cl
N[C@@H](Cc1ccc(O)c(O)c1)C(O)=O

I'll create another notebook to show the result of only one SMILES as input.

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

OK @Inyrkz this is going in the right direction. Could you add a method to keep, for each molecule, only unique molecules? I see repeated molecules in your previous notebook. Getting unique molecules can be done via 1. indexing them with InChIKeys and 2. getting the unique set.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

@miquelduranfrigola, Thanks for catching the repetition. I'll address it.

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

Thanks @Inyrkz - I will investigate it in detail in preparation for our meeting tomorrow.
Meanwhile, have you already prepared the Dockerfile, run.sh etc for this model?

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

I’ve adjusted the get_side_chain_pairs function. It gives different pairs of side chains.
I tested the code on this molecule CC(C)(C)c1nc(c(s1)-c1ccnc(N)n1)-c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F (one of the molecules that wasn’t working before.)

Not all the side chains combo works, only a few do. This approach is better so that we generate at least one side chain that works.

Here is the notebook.

It takes about 7 minutes to generate new molecules for one SMILES.

I want to try all four SMILES.

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

Thanks @Inyrkz

The bit of information saying that "not all side chains combo works" is relevant. Actually, we might even want to try to do this with 1 side chain. So, in general, instead of generating pairs, we might want to generate all combination of, let's say, up to 3 elements.

As an example, consider the following, where "a", "b", "c", "d" would be four side chains.

Here are the combinations for the example list ['a', 'b', 'c', 'd']:

Combinations of 1 element:

('a',)
('b',)
('c',)
('d',)
Combinations of 2 elements:

('a', 'b')
('a', 'c')
('a', 'd')
('b', 'c')
('b', 'd')
('c', 'd')
Combinations of 3 elements:

('a', 'b', 'c')
('a', 'b', 'd')
('a', 'c', 'd')
('b', 'c', 'd')

This demonstrates how to generate all possible combinations of 1, 2, and 3 elements from a given list. You can use the same approach for any list of elements.

This code would generate all combinations

from itertools import combinations

# Example list
example_list = ['a', 'b', 'c', 'd']

# Generating all combinations of 1, 2, and 3 elements
combinations_1 = list(combinations(example_list, 1))
combinations_2 = list(combinations(example_list, 2))
combinations_3 = list(combinations(example_list, 3))

all_combinations = combinations_1 + combinations_2 + combinations_3

What do you think?

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

This is great. Thanks for the code sample.

I can try this. The only problem is it may take longer for the model to run.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

@miquelduranfrigola, I've adjusted the get_side_chain_pairs() function to use all combination.

The problem is that the scaffold-morphing model gives this error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[53], [line 1](vscode-notebook-cell:?execution_count=53&line=1)
----> [1](vscode-notebook-cell:?execution_count=53&line=1) generated_smiles = designer.scaffold_morphing(
      [2](vscode-notebook-cell:?execution_count=53&line=2)     side_chains=side_chain_pairs[0],
      [3](vscode-notebook-cell:?execution_count=53&line=3)     n_samples_per_trial=12,
      [4](vscode-notebook-cell:?execution_count=53&line=4)     n_trials=1,
      [5](vscode-notebook-cell:?execution_count=53&line=5)     sanitize=True,
      [6](vscode-notebook-cell:?execution_count=53&line=6)     do_not_fragment_further=False,
      [7](vscode-notebook-cell:?execution_count=53&line=7)     random_seed=100,
      [8](vscode-notebook-cell:?execution_count=53&line=8)     )
     [10](vscode-notebook-cell:?execution_count=53&line=10) print(generated_smiles)

File [~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:172](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:172), in SAFEDesign.scaffold_morphing(self, side_chains, mol, core, n_samples_per_trial, n_trials, sanitize, do_not_fragment_further, random_seed, **kwargs)
    [137](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:137) def scaffold_morphing(
    [138](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:138)     self,
    [139](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:139)     side_chains: Optional[Union[dm.Mol, str, List[Union[str, dm.Mol]]]] = None,
   (...)
    [147](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:147)     **kwargs,
    [148](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:148) ):
    [149](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:149)     """Perform scaffold morphing decoration using the pretrained SAFE model
    [150](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:150) 
    [151](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:151)     For scaffold morphing, we try to replace the core by a new one. If the side_chains are provided, we use them.
   (...)
    [169](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:169)         kwargs: any argument to provide to the underlying generation function
...
--> [316](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/random.py:316)     raise ValueError("empty range for randrange() (%d, %d, %d)" % (istart, istop, width))
    [318](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/random.py:318) # Non-unit step argument supplied.
    [319](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/random.py:319) istep = int(step)

ValueError: empty range for randrange() (1, 1, 0)

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

@miquelduranfrigola ,

This is the input of the get_side_chain_pairs() function.

[1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F.[2*]C(C)(C)C.[3*]c1ccnc(N)n1

This is the output.

['[1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F', '[1*]C(C)(C)C', '[1*]c1ccnc(N)n1', '[1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F.[2*]C(C)(C)C', '[1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F.[2*]c1ccnc(N)n1', '[1*]C(C)(C)C.[2*]c1ccnc(N)n1', '[1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F.[2*]C(C)(C)C.[3*]c1ccnc(N)n1']

I've noticed that the model crashes for side chains like this [1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F', '[1*]C(C)(C)C', '[1*]c1ccnc(N)n1',

It has to be at least a pair

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

It takes 4 minutes for the model to run predictions for 4 input SMILES on Google Colab (GPU)

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

OK @Inyrkz thanks - GPU is a bit faster then. Noted.
Let's then do pairs and triplets.
One question: does the n_trials parameter help in generating more molecules?
Another question: does reducing the number of molecules per trial increase speed?

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

Thanks @Inyrkz

This is useful. We will have to accept that this is a slow model.
Let's do n_trials=10 and n_samples_per_trial=10, if you agree?
Can we try this config for a few molecules, keeping an eye on:

  • Time (seconds) per molecule
  • Number of molecules
  • Uniqueness of molecules
  • Quality of molecules: for this, just please paste images of the generated compounds (including the input)

Almost there!

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Alright,

I'll try n_trials=10 and n_samples_per_trial=10.

Yup, almost there!

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

This table shows how long it takes to execute the original code.

Tasks Execution Time on my System (mins) Execution Time on Colab with GPU (mins)
Side chain 1 3m46s 1m46s
Side chain 2 1m56s 1m34s
Side chain 3 2m2s 1m23s
Side chain 4 3m33s 1m29s
Total time 24 mins (using a loop) 6 mins (using a loop)
Average generation time per side chain 2m49s 1m33s

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Here is the notebook

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024
Input SMILES No. of core structures generated No. of side chains pairs Execution Time on Colab with GPU (sec) Number of output generated Notebook
CC(C)(C)c1nc(c(s1)-c1ccnc(N)n1)-c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F 4 13 1296 224 link
CN(C)c1cccc(c1)C(=O)Nc1ccc(C)c(NC(=O)c2ccc(O)cc2)c1 2 5 563 200 link

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

Thanks, @Inyrkz - these numbers look very reasonable; ~200 molecules generated is sufficient. Also, on a quick look, I like the molecules generated according to the notebooks. In my opinion, we are ready to wrap up - @GemmaTuron , what do you think?

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

@GemmaTuron

I've made an initial update to the main.py script.
These are the two files main.py and mol_gen.py

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

Hi @Inyrkz !

Thanks for this, it looks good, does it work fine with Ersilia?
I cannot have a deep dive into it this afternoon but will do tomorrow, thanks for the work.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

The run.sh file works.

I've been trying to test it with Ersilia but I keep getting a PingError. My internet connection is bad. I'll try again at night to see if it gets better.

Git LFS initialized.
17:02:22 | DEBUG    | Git LFS has been activated
17:02:47 | ERROR    | Ersilia exception class:
PingError

Detailed error:
No internet connection. Internet connection is required for downloading models from GitHub repositories.

Hints:
Make sure that your computer is connected to the internet and try again. 


🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨

Error message:

Ersilia exception class:
PingError

Detailed error:
No internet connection. Internet connection is required for downloading models from GitHub repositories.

Hints:
Make sure that your computer is connected to the internet and try again. 


If this error message is not helpful, open an issue at:
 - https://github.com/ersilia-os/ersilia
Or feel free to reach out to us at:
 - hello[at]ersilia.io

If you haven't, try to run your command in verbose mode (-v in the CLI)
 - You will find the console log file in: /home/affiah/eos/current.log

from ersilia.

miquelduranfrigola avatar miquelduranfrigola commented on June 10, 2024

Thanks @Inyrkz let us know when you have better connection.

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

@Inyrkz

I've had a look at the code and it seems fine, to be able to run it within Ersilia you'll need a few edits:

  1. add the metadata information
  2. Solve the paths issue to files? Though I would not hardcode these, now that we are already at implementation stage (I am talking about the input and output files in main.py)

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

Also, line 96 on mol_gen, spotted a typo:
core_structures = self.extract_core_structure(i) should be core_structures = self._extract_core_structure(i)
and other functions as well!

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Thanks for the update @GemmaTuron & @miquelduranfrigola

I corrected the typos in the mol_gen.py script.

I'm a bit confused about the path issue. The code will only require the input and output files in the main.py. It won't require any other file.

input_file = "data/my_molecules (copy).csv"
output_file = "data/results.csv"

The lines above were just for testing. This is what the main code will look like.

input_file = sys.argv[1]
output_file = sys.argv[2]

Is there any path issue to modify here?

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Also, my system freezes for hours when testing with ersilia. This usually happens when it gets to the Attempting to delete BentoML part.

I'm trying to set up ersilia on another system, so I can do the testing with the system.

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

I'm stuck here

16:57:29 | DEBUG    | Activation done
16:57:29 | DEBUG    | Previous command successfully run inside eos8bhe conda environment
16:57:29 | DEBUG    | Now trying to establish symlinks
16:57:29 | DEBUG    | BentoML location is /Users/ini-abasiaffiah/bentoml/repository/eos8bhe/20240212165727_4E41AB
16:57:29 | DEBUG    | Ersilia Bento location is /Users/ini-abasiaffiah/eos/repository/eos8bhe/20240212165727_4E41AB
16:57:29 | DEBUG    | Building symlinks between /Users/ini-abasiaffiah/eos/repository/eos8bhe/20240212165727_4E41AB and /Users/ini-abasiaffiah/bentoml/repository/eos8bhe/20240212165727_4E41AB
16:57:29 | DEBUG    | Creating model symlink bundle artifacts > dest
16:57:29 | DEBUG    | Creating model_install_commands.sh symlink dest <> bundle
16:57:29 | INFO     | Could not create symbolic link from /Users/ini-abasiaffiah/eos/dest/eos8bhe/data.h5 to /Users/ini-abasiaffiah/eos/isaura/lake/eos8bhe_public.h5
16:57:29 | DEBUG    | Run file found in framework: /Users/ini-abasiaffiah/eos/repository/eos8bhe/20240212165727_4E41AB/eos8bhe/artifacts/framework/run.sh
16:57:29 | DEBUG    | Run commandlines on eos8bhe
16:57:29 | DEBUG    | which python > /var/folders/md/k05hbtgj6zs3jprxfsjbs1540000gn/T/ersilia-i4oy4n5b/tmp.txt
16:57:30 | DEBUG    | Activating base environment
16:57:30 | DEBUG    | Current working directory: /Users/ini-abasiaffiah/ersilia
16:57:30 | DEBUG    | Running bash /var/folders/md/k05hbtgj6zs3jprxfsjbs1540000gn/T/ersilia-fl8xyer9/script.sh 2>&1 | tee -a /var/folders/md/k05hbtgj6zs3jprxfsjbs1540000gn/T/ersilia-w754m208/command_outputs.log
# conda environments:
#
base                     /Users/ini-abasiaffiah/anaconda3
eos8bhe               *  /Users/ini-abasiaffiah/anaconda3/envs/eos8bhe
eosbase-bentoml-0.11.0-py310     /Users/ini-abasiaffiah/anaconda3/envs/eosbase-bentoml-0.11.0-py310
ersilia                  /Users/ini-abasiaffiah/anaconda3/envs/ersilia

16:57:30 | DEBUG    | # conda environments:
#
base                     /Users/ini-abasiaffiah/anaconda3
eos8bhe               *  /Users/ini-abasiaffiah/anaconda3/envs/eos8bhe
eosbase-bentoml-0.11.0-py310     /Users/ini-abasiaffiah/anaconda3/envs/eosbase-bentoml-0.11.0-py310
ersilia                  /Users/ini-abasiaffiah/anaconda3/envs/ersilia


16:57:30 | DEBUG    | Activation done
16:57:30 | DEBUG    | Python executable: /Users/ini-abasiaffiah/anaconda3/envs/eos8bhe/bin/python
16:57:30 | DEBUG    | Conda is needed
16:57:30 | DEBUG    | Checking if model needs to be integrated to a tool
Traceback (most recent call last):
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 468, in _make_request
    self._validate_conn(conn)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1097, in _validate_conn
    conn.connect()
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connection.py", line 642, in connect
    sock_and_verified = _ssl_wrap_socket_and_match_hostname(
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connection.py", line 783, in _ssl_wrap_socket_and_match_hostname
    ssl_sock = ssl_wrap_socket(
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 471, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls, server_hostname)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 515, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/ssl.py", line 513, in wrap_socket
    return self.sslsocket_class._create(
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/ssl.py", line 1104, in _create
    self.do_handshake()
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/ssl.py", line 1375, in do_handshake
    self._sslobj.do_handshake()
TimeoutError: [Errno 60] Operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 845, in urlopen
    retries = retries.increment(
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/util/retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 791, in urlopen
    response = self._make_request(
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 492, in _make_request
    raise new_e
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 470, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 371, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=None)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/bin/ersilia", line 8, in <module>
    sys.exit(cli())
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/ini-abasiaffiah/ersilia/ersilia/cli/commands/__init__.py", line 22, in wrapper
    return func(*args, **kwargs)
  File "/Users/ini-abasiaffiah/ersilia/ersilia/cli/commands/fetch.py", line 89, in fetch
    _fetch(mf, model_id)
  File "/Users/ini-abasiaffiah/ersilia/ersilia/cli/commands/fetch.py", line 12, in _fetch
    mf.fetch(model_id)
  File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/fetch/fetch.py", line 228, in fetch
    self._fetch(model_id)
  File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/fetch/fetch.py", line 225, in _fetch
    self._fetch_not_from_dockerhub(model_id=model_id)
  File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/fetch/fetch.py", line 137, in _fetch_not_from_dockerhub
    self._content()
  File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/fetch/fetch.py", line 106, in _content
    cg = CardGetter(self.model_id, self.config_json)
  File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/fetch/actions/content.py", line 14, in __init__
    self.mc = ModelCard(config_json=config_json)
  File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/content/card.py", line 738, in __init__
    self.ac = AirtableCard(config_json=config_json)
  File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/content/card.py", line 676, in __init__
    AirtableInterface.__init__(self, config_json=config_json)
  File "/Users/ini-abasiaffiah/ersilia/ersilia/db/hubdata/interfaces.py", line 13, in __init__
    self.api_key = self._get_read_only_airtable_api_key()
  File "/Users/ini-abasiaffiah/ersilia/ersilia/db/hubdata/interfaces.py", line 24, in _get_read_only_airtable_api_key
    r = requests.get(url)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=None)

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Is there a way to test with ersilia on codespace?

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

Hi @Inyrkz

You can build a codespace from your repo and try run.sh, but I assume you already did this and it worked?

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Yes, the run.sh file works well.

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

ok Im running it and I'll let you know the outcome

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Alright

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024
  • I've updated the metadata.json file
  • I've fixed the function names in the mol_gen.py file
  • The main.py only uses sys.argv

I want to install safe-mol==0.1.4 without any dependency

These are the other packages from the pyproject.toml file

keywords = ["safe", "smiles", "de novo", "design", "molecules"]
dependencies = [
    "tqdm",
    "loguru",
    "typer",
    "universal_pathlib",
    "datamol",
    "numpy",
    "torch>=2.0",
    "transformers",
    "datasets",
    "tokenizers",
    "accelerate",
    "evaluate",
    "wandb",
    "huggingface-hub",
    "rdkit"
]

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

Hi @Inyrkz !

great that it works now, can you open a PR?

from ersilia.

GemmaTuron avatar GemmaTuron commented on June 10, 2024

@Inyrkz can you check why the docker upload is failing currently?

from ersilia.

Inyrkz avatar Inyrkz commented on June 10, 2024

Okay, how do I check that?

from ersilia.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.