Comments (115)
OK, I think it's all good ideas. Should we talk about it on Thursday?
from ersilia.
you are right @Inyrkz ..
Get as far as you can with the different options and on Thursday we will discuss them throughly to adopt the best one - the notebooks you are preparing are very useful, we'll discuss them all.
If you get to a point where you cannot proceed further, start looking into the Virtual Libraries (Mollib) as this will be our next model to work on
from ersilia.
Hi @miquelduranfrigola, @GemmaTuron,
I'll start with the "scaffold morphing." And also take note of the performance of the model.
from ersilia.
I've been able to download the model. My internet is much better now.
pytorch_model.bin: 100%|██████████| 349M/349M [02:51<00:00, 2.03MB/s]
generation_config.json: 100%|██████████| 132/132 [00:00<00:00, 3.84kB/s]
tokenizer.json: 100%|██████████| 46.3k/46.3k [00:00<00:00, 159kB/s]
SAFEDoubleHeadsModel(
(transformer): GPT2Model(
(wte): Embedding(1880, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-11): 12 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2Attention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPT2MLP(
(c_fc): Conv1D()
(c_proj): Conv1D()
(act): NewGELUActivation()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
...
from ersilia.
Thanks @Inyrkz for the update. Good work, we will use tomorrow's meeting to discuss how to proceed with this. Push the code so I can have a look and prepare!
Thanks
from ersilia.
On a quick look, I think the "core" can be obtained with MurckoScaffolds. But please double-check
from ersilia.
Thanks @Inyrkz - awesome stuff. This is really going in the right direction.
Can you please take a look at this page? https://portal.valencelabs.com/datamol/post/generate-scaffolds-iBUTqU8Im9N2zCM
from ersilia.
Also, @Inyrkz regarding the output of MolFrags:
I think we might not be getting more MolFrags because we are applying it to a MurckScaffold. what if we do the other way around, MolFrags and then MurckoScaffold?
from ersilia.
Links:
https://pubs.acs.org/doi/10.1021/acs.jcim.0c00296
https://rdkit.org/docs/source/rdkit.Chem.Scaffolds.rdScaffoldNetwork.html
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080744/
https://www.zbh.uni-hamburg.de/en/forschung/amd/datasets/brics.html
from ersilia.
I've removed the duplicates. Now we only have 7 outputs for the 4 input SMILES.
The model fails on some SMILES input. In this notebook, I only use one SMILES as input, the model couldn't generate any new molecule for each of the four core structures extracted.
I even tried converting the SMILES to a SAFE before passing it to the model. It didn't work.
I'm not sure why.
from ersilia.
I haven't prepared the Dockerfile
yet. I'll work on that.
from ersilia.
Okay, pairs and triplets. So I just do
all_combinations = combinations_2 + combinations_3
I'll experiment to get answers to the questions.
from ersilia.
Here's what I've observed so far (I'll update this list).
- The
n_samples_per_trial=12
will generate 12 sample outputs whenn_trials
is set to1
. It may give an empty list for some side chains. Maybe it's because then_trials
is 1 - When
n_trials
is set to5
, for example, we could get up to 60 sample outputs for each side chains passed to the model. This takes longer to run. Also for side chains that return an empty list as out (using the previous parameters), we get an output (non-empty list) when we adjust the n_trials hyperparameter. Instead of an empty list, we get about 6 elements. - Increasing
n_trials
increases the running time. - For some side chains, increasing
n_trials
will still give an empty list. - It's a bit faster when we reduce the
n_samples_per_trial
. But this mostly depends on then_trials
. So it will be a trade-off between then_trials
andn_samples_per_trial
parameters to get more samples or reduce generation time. It will take longer to generate more samples. - The maximum sample we can get for a side chain is n_trials * n_samples_per_trials.
I hope this helps.
from ersilia.
@GemmaTuron, You're welcome.
I'm yet to test it with Ersilia.
I want to make sure the code is okay before using it in my presentation tomorrow.
It took over 30 mins to run it on my system.
from ersilia.
Thanks for the update @GemmaTuron & @miquelduranfrigola
I corrected the typos in the
mol_gen.py
script.I'm a bit confused about the path issue. The code will only require the input and output files in the
main.py
. It won't require any other file.input_file = "data/my_molecules (copy).csv" output_file = "data/results.csv"The lines above were just for testing. This is what the main code will look like.
input_file = sys.argv[1] output_file = sys.argv[2]Is there any path issue to modify here?
When I was testing with the specified files instead of the argv inputed files, it was having issues with the paths, that's all!
from ersilia.
Please:
- Update the metadata.json
- Correct the issues on the function names in mol_gen.py
- Correct the main.py to use the sys.argv inputs only
Then, to solve the wandb dependency issue, install safe-mol without dependencies and manually install each package. Specify the versions you got when installing safe outside ersilia
from ersilia.
@GemmaTuron, this is what the metadata.json
file looks like. Is this okay?
{
"Identifier": "eos8bhe",
"Slug": "scaffold-morphing",
"Status": "In progress",
"Title": "safe",
"Description": "The context discusses a novel notation system called Sequential Attachment-based Fragment Embedding (SAFE) that improves upon traditional molecular string representations like SMILES. SAFE reframes SMILES strings as an unordered sequence of interconnected fragment blocks while maintaining compatibility with existing SMILES parsers. This streamlines complex molecular design tasks by facilitating autoregressive generation under various constraints. The effectiveness of SAFE is demonstrated by training a GPT2-like model on a dataset of 1.1 billion SAFE representations that exhibited versatile and robust optimization performance for molecular design.",
"Mode": "Pretrained",
"Task": ["Generation"],
"Input": ["Compound"],
"Input Shape": "Single",
"Output": ["Compound"],
"Output Type": ["String"],
"Output Shape": "List",
"Interpretation": "Model generates new molecules from input molecule by replacing core structures of input molecule.",
"Tag": "Compound Generation",
"Publication": "https://arxiv.org/pdf/2310.10773.pdf",
"Source Code": "https://github.com/datamol-io/safe/tree/main",
"License": "CC BY 4.0"
}
from ersilia.
I've opened a pull request.
from ersilia.
The model works locally.
This is the code sample
import safe
ibuprofen = "CC(Cc1ccc(cc1)C(C(=O)O)C)C"
# SMILES -> SAFE -> SMILES translation
try:
ibuprofen_sf = safe.encode(ibuprofen) # c12ccc3cc1.C3(C)C(=O)O.CC(C)C2
ibuprofen_smi = safe.decode(ibuprofen_sf, canonical=True) # CC(C)Cc1ccc(C(C)C(=O)O)cc1
except safe.EncoderError:
pass
except safe.DecoderError:
pass
ibuprofen_tokens = list(safe.split(ibuprofen_sf))
print(ibuprofen_tokens)
This is the output
['c', '1', '2', 'c', 'c', 'c', '3', 'c', 'c', '1', '.', 'C', '3', '(', 'C', ')', 'C', '(', '=', 'O', ')', 'O', '.', 'C', 'C', '(', 'C', ')', 'C', '2']
from ersilia.
Hi @Inyrkz !
Good, you can load and run the basic SAFE commands. Datamol is usually very well organised, so hopefully downstream tasks will also be working nicely.
This model has many many functionalities, as you can see in the graph they show in their repo. I need you to:
- Look at the Documentation (particularly molecular design) and play around with the different functions. You can use their examples or we can come up with new ones as well: https://safe-docs.datamol.io/stable/tutorials/design-with-safe.html
- Think of the best design for this. Should we do, one single model that does everything? Is this possible? Or incorporate each of the safe functionalities as independent models?
Then we can start incorporating the model(s) - please as you play with the different possibilities for molecular design, take a note of performance, if anything seems already very slow would be good to know!
This model is really exciting and we will be using it soon in chem-sampler
from ersilia.
Thank you for sharing the documentation. I'll check out other functions. It's an interesting model
from ersilia.
Hi both,
100% agree with @GemmaTuron - let's check their notebooks and try to reproduce one of them. In my opinion, the most interesting is "scaffold morphing" as shown in the tutorial. Let's start by that. If that works, then we can actually create multiple models in the hub, for example: safe-scaffold-morphing
, safe-scaffold-decoration
, etc. But let's go step by step and start by one case.
To me, the most crucial part now is to reproduce the tutorial to show that the model is not too heavy and take note of performance, as Gemma suggests.
from ersilia.
I've been stuck here (model downloading). My internet connection is poor. I'll try again later tonight
# Load pre-trained model
designer = sf.SAFEDesign.load_default(verbose=True)
designer.model
pytorch_model.bin: 0%| | 0.00/349M [00:00<?, ?B/s]Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/87/f1/87f1a96893e9890db129b18dcc1e87833c610379b09e91b9d21277d8242d6205/d243f31837ec16c8d6653b8b18aa6225469fa6cbae6057ccaf93b46af46ca8a8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1705232429&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNTIzMjQyOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzg3L2YxLzg3ZjFhOTY4OTNlOTg5MGRiMTI5YjE4ZGNjMWU4NzgzM2M2MTAzNzliMDllOTFiOWQyMTI3N2Q4MjQyZDYyMDUvZDI0M2YzMTgzN2VjMTZjOGQ2NjUzYjhiMThhYTYyMjU0NjlmYTZjYmFlNjA1N2NjYWY5M2I0NmFmNDZjYThhOD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=P7Zc5wO4gMBRzwflroJDW%7EKGK-qBU1%7E7PUl5U65t14T9fuvT5c-VSXFTYiEaah1r6hlVdn9YPcCzxa8dhiQTWdcpgtQpVjSwcmp-VLLapsFx2HJQt-QEDDyAamQgZ0d2ezbwnpErg2ObpoVCz0ta4M%7ErBr2SuWdx3IAac-exi4EZ2G40HZDSuis7c7s8Id46LdEg3Qsdb1RbsVOoie752ZoT2mHnamaN9SFG5j%7EQe2OG0OEbS4To%7Eb-cXovbtRJ-7ScBY1keLzlqbGhywArqDryKQRUd2ldZlbzmPGvJXjcEMpNP6R2E9gTLiBSZi-YLGh1Lxxf0taqU0g2YD9ocLA__&Key-Pair-Id=KCD77M1F0VK2B: HTTPSConnectionPool(host='cdn-lfs-us-1.huggingface.co', port=443): Read timed out.
Trying to resume download...
Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/87/f1/87f1a96893e9890db129b18dcc1e87833c610379b09e91b9d21277d8242d6205/d243f31837ec16c8d6653b8b18aa6225469fa6cbae6057ccaf93b46af46ca8a8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1705232429&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNTIzMjQyOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzg3L2YxLzg3ZjFhOTY4OTNlOTg5MGRiMTI5YjE4ZGNjMWU4NzgzM2M2MTAzNzliMDllOTFiOWQyMTI3N2Q4MjQyZDYyMDUvZDI0M2YzMTgzN2VjMTZjOGQ2NjUzYjhiMThhYTYyMjU0NjlmYTZjYmFlNjA1N2NjYWY5M2I0NmFmNDZjYThhOD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=P7Zc5wO4gMBRzwflroJDW%7EKGK-qBU1%7E7PUl5U65t14T9fuvT5c-VSXFTYiEaah1r6hlVdn9YPcCzxa8dhiQTWdcpgtQpVjSwcmp-VLLapsFx2HJQt-QEDDyAamQgZ0d2ezbwnpErg2ObpoVCz0ta4M%7ErBr2SuWdx3IAac-exi4EZ2G40HZDSuis7c7s8Id46LdEg3Qsdb1RbsVOoie752ZoT2mHnamaN9SFG5j%7EQe2OG0OEbS4To%7Eb-cXovbtRJ-7ScBY1keLzlqbGhywArqDryKQRUd2ldZlbzmPGvJXjcEMpNP6R2E9gTLiBSZi-YLGh1Lxxf0taqU0g2YD9ocLA__&Key-Pair-Id=KCD77M1F0VK2B: HTTPSConnectionPool(host='cdn-lfs-us-1.huggingface.co', port=443): Read timed out.
Trying to resume download...
Error while downloading from https://cdn-lfs-us-1.huggingface.co/repos/87/f1/87f1a96893e9890db129b18dcc1e87833c610379b09e91b9d21277d8242d6205/d243f31837ec16c8d6653b8b18aa6225469fa6cbae6057ccaf93b46af46ca8a8?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1705232429&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNTIzMjQyOX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzg3L2YxLzg3ZjFhOTY4OTNlOTg5MGRiMTI5YjE4ZGNjMWU4NzgzM2M2MTAzNzliMDllOTFiOWQyMTI3N2Q4MjQyZDYyMDUvZDI0M2YzMTgzN2VjMTZjOGQ2NjUzYjhiMThhYTYyMjU0NjlmYTZjYmFlNjA1N2NjYWY5M2I0NmFmNDZjYThhOD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=P7Zc5wO4gMBRzwflroJDW%7EKGK-qBU1%7E7PUl5U65t14T9fuvT5c-VSXFTYiEaah1r6hlVdn9YPcCzxa8dhiQTWdcpgtQpVjSwcmp-VLLapsFx2HJQt-QEDDyAamQgZ0d2ezbwnpErg2ObpoVCz0ta4M%7ErBr2SuWdx3IAac-exi4EZ2G40HZDSuis7c7s8Id46LdEg3Qsdb1RbsVOoie752ZoT2mHnamaN9SFG5j%7EQe2OG0OEbS4To%7Eb-cXovbtRJ-7ScBY1keLzlqbGhywArqDryKQRUd2ldZlbzmPGvJXjcEMpNP6R2E9gTLiBSZi-YLGh1Lxxf0taqU0g2YD9ocLA__&Key-Pair-Id=KCD77M1F0VK2B: HTTPSConnectionPool(host='cdn-lfs-us-1.huggingface.co', port=443): Read timed out.
Trying to resume download...
While waiting, I'll study the documentation.
from ersilia.
Hi Ini, sorry to hear that.
Not great because it means the model will be pretty large in the Hub as well. Let me try it in my end see how long it takes
from ersilia.
This table shows how long it takes to execute the original code.
Tasks | Execution Time (seconds) |
---|---|
Load model (first time) | 134 (depends on internet speed) |
Load model | 55 |
De novo generation | 193.1 |
Scaffold generation | 156.1 |
Scaffold morphing | 54.9 |
Super structure generation | 32.9 |
Motif extension | 25.7 |
Linker generation | 79.8 |
from ersilia.
Ok, so the model is fast but it just heavy so donwloading time will be the issue. That is helpful Ini!
Let's focus on the scaffold morphing. As we discussed in the meeting it would be good to play around with it using structures that we know best. @miquelduranfrigola I suggested running a first test with the OSM compounds.
The OSM project has been growing in the last years, but basically, there is a chemical series of interest (series 4) with a known core (triazolopyrazine). Could we try to use the scaffold morphing with that core? (in the docs they provide examples with side chains, and mention it can be done with cores instead but no examples) - could you try to look into the code to see how we could do this?
And then compare the scaffold morphing to the decorator.
As always, we can start working with Notebooks and once we have the results and decide what to exactly incorporate in the hub, we can work on the final model repo structure. I'll approve for the moment so you have a repo to put the code in
from ersilia.
/approve
from ersilia.
New Model Repository Created! 🎉
@Inyrkz ersilia model respository has been successfully created and is available at:
Next Steps ⭐
Now that your new model respository has been created, you are ready to start contributing to it!
Here are some brief starter steps for contributing to your new model repository:
Note: Many of the bullet points below will have extra links if this is your first time contributing to a GitHub repository
- 🍴 Get started by creating a fork of your new model repository - docs
- 👯 Clone your forked repository - docs
- ✏️ Make edits to your new forked model repository - docs - Edits might include:
- Updating the
README.md
file to accurately describe your model - Add source code for your model
- Adding documentation for your model
- Updating the
- 🚀 Open a Pull Request from your forked repository to the original repository. This will allow you to bring your local changes into the new ersilia model repository that was just created! - docs
Additional Resources 📚
If you have any questions, please feel free to open an issue and get support from the community!
from ersilia.
Yes, the model is heavy to download.
I'll start working with the OSM compounds, and see how the scaffold morphing performs.
from ersilia.
Ok, so the model is fast but it just heavy so donwloading time will be the issue. That is helpful Ini! Let's focus on the scaffold morphing. As we discussed in the meeting it would be good to play around with it using structures that we know best. @miquelduranfrigola I suggested running a first test with the OSM compounds. The OSM project has been growing in the last years, but basically, there is a chemical series of interest (series 4) with a known core (triazolopyrazine). Could we try to use the scaffold morphing with that core? (in the docs they provide examples with side chains, and mention it can be done with cores instead but no examples) - could you try to look into the code to see how we could do this? And then compare the scaffold morphing to the decorator. As always, we can start working with Notebooks and once we have the results and decide what to exactly incorporate in the hub, we can work on the final model repo structure. I'll approve for the moment so you have a repo to put the code in
I'm still going through the documentation. I found this.
compute_side_chains(mol, core, label_by_index=False) [¶](https://safe-docs.datamol.io/stable/api/safe.html#safe.utils.compute_side_chains)
Compute the side chain of a molecule given a core
Finding the side chains
The algorithm to find the side chains from core assumes that the core we get as input has attachment points. Those attachment points are never considered as part of the query, rather they are used to define the attachment points on the side chains. Removing the attachment points from the core is exactly the same as keeping them.
mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
core0 = "CC1(C)CN2C(CC2=O)S1"
core1 = "CC1(C)SC2C(-*)C(=O)N2C1-*"
core2 = "CC1N2C(SC1(C)C)C(N)C2=O"
side_chain = compute_side_chain(core=core0, mol=mol)
dm.to_image([side_chain, core0, mol])
Therefore on the above, core0 and core1 are equivalent for the molecule mol, but core2 is not.
Do you think we are to convert the core structure to a side chain?
from ersilia.
Also, from this text in the documentation
Scaffold Morphing
In scaffold morphing, we wish to replace a scaffold by another one in a molecule. The process requires as input that the user provides either the side chains or the input molecules and the core
This means we need two inputs, the side chains (or input molecules) and the core. In the example they gave, they didn't input any core. I don't know if they are using any default core. I'm trying to find that part of the documentation (for the pretrained model)
from ersilia.
OK this is extremely promising. Thanks @Inyrkz.
I will inspect the model too and get back to you with suggestions. For now, let's focus on placing the code accordingly in the model template structure and, meanwhile, we can come up ideas.
On a related note, as previously discussed, it is likely that we will use this model in different modalities (scaffold morphing, linker, etc.). Therefore, I would change the slug to something like: safe-scaffold-morphing
. Probably, in the near future, we'll do safe-linker
, safe-de-novo
etc. I hope this makes sense.
from ersilia.
Ok, so the model is fast but it just heavy so donwloading time will be the issue. That is helpful Ini! Let's focus on the scaffold morphing. As we discussed in the meeting it would be good to play around with it using structures that we know best. @miquelduranfrigola I suggested running a first test with the OSM compounds. The OSM project has been growing in the last years, but basically, there is a chemical series of interest (series 4) with a known core (triazolopyrazine). Could we try to use the scaffold morphing with that core? (in the docs they provide examples with side chains, and mention it can be done with cores instead but no examples) - could you try to look into the code to see how we could do this? And then compare the scaffold morphing to the decorator. As always, we can start working with Notebooks and once we have the results and decide what to exactly incorporate in the hub, we can work on the final model repo structure. I'll approve for the moment so you have a repo to put the code in
I'm still going through the documentation. I found this.
compute_side_chains(mol, core, label_by_index=False) [¶](https://safe-docs.datamol.io/stable/api/safe.html#safe.utils.compute_side_chains) Compute the side chain of a molecule given a core Finding the side chains The algorithm to find the side chains from core assumes that the core we get as input has attachment points. Those attachment points are never considered as part of the query, rather they are used to define the attachment points on the side chains. Removing the attachment points from the core is exactly the same as keeping them. mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O" core0 = "CC1(C)CN2C(CC2=O)S1" core1 = "CC1(C)SC2C(-*)C(=O)N2C1-*" core2 = "CC1N2C(SC1(C)C)C(N)C2=O" side_chain = compute_side_chain(core=core0, mol=mol) dm.to_image([side_chain, core0, mol]) Therefore on the above, core0 and core1 are equivalent for the molecule mol, but core2 is not.
Do you think we are to convert the core structure to a side chain?
@GemmaTuron @miquelduranfrigola
Both scaffold morphing and scaffold decoration require using side chains. To compute the side chains using the compute_side_chains(mol, core, label_by_index=False)
function, we need both the SMILES (mol) and the core (core0) structure (core0). I don't know how we are going to extract the core structure from the SAFE after converting the SMILES to SAFE.
My code can't run without an error without resolving this first. I'll push what I've written so far.
from ersilia.
This is the link to the notebook
from ersilia.
I found a documentation here. I'll try this.
from ersilia.
This is the function I wrote.
def extract_core_structure(molecule):
# Convert SMILES to RDKit molecule
mol = Chem.MolFromSmiles(molecule)
# Identify the core structure using the Murcko Scaffold
core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(mol)
# Convert the core structure back to SMILES
core_smiles = Chem.MolToSmiles(core)
return core_smiles
mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
core_structure = extract_core_structure(mol)
print("Core Structure:", core_structure)
This is the output: Core Structure: O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1
Do you think this core structure is correct?
I'll try to visualize it in marvinjs.
from ersilia.
Thanks Ini! think this is still too large, maybe we can try recursively applying the MurckoScaffold to this structure once more see if it trims it down?
Otherwise exploring the MolFragments (I've left some code on the other model request issue) can be useful.
Finally, we can add the binding sites with RDKIT as well (check the other issue where I left an example)
from ersilia.
from ersilia.
Thanks both. Good stuff.
Unfortunately, I don't think recursively using Murcko will further prune the molecule (I agree it is too big). The MolFragments
sounds like a reasonable second step. So: Murcko+MolFragments sounds good.
One question that will arise is: of all fragments, which one is the core? This question is very difficult to answer, so for now, let's just take the largest fragment generated (for example, the one with the highest numbre of atoms). We may have to explore this manually, but as a starting point I think it makes sense. Opinions?
from ersilia.
I tried this code
mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
mol = Chem.MolFromSmiles(mol)
core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(mol)
fragments = Chem.GetMolFrags(core, asMols=True)
for i in fragments:
core_smiles = Chem.MolToSmiles(i)
print(core_smiles)
and I got this output
O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1
I was expecting more than one value from the GetMolFrags()
function.
from ersilia.
I tried this code to get the molecule with the highest number of atoms.
mol = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
mol = Chem.MolFromSmiles(mol)
core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(mol)
fragments = Chem.GetMolFrags(core, asMols=True)
# Select the largest fragment based on the number of atoms
largest_fragment = max(fragments, key=lambda x: x.GetNumAtoms())
# Convert the largest fragment back to SMILES
core_smiles = Chem.MolToSmiles(largest_fragment)
print(core_smiles)
I got this output
O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1
I think it's because it's only one fragment that the GetMolFrags()
function gives. I'm going to try this on another sample molecule.
from ersilia.
Using this molecule mol = "N#Cc1ccc(-c2nnc3cncc(OCCc4ccc(F)c(F)c4)n23)cc1"
I get this as the output for the two code cells: c1ccc(CCOc2cncc3nnc(-c4ccccc4)n23)cc1
I'm guessing this is still too large.
from ersilia.
This molecule molecule = "FC1=CC=C(CCOC2=CN=CC3=NN=C(N23)C2=CC=C(C(OC3CC3)=C2)C2=CC(OC3CC3)=C(Cl)C=C2)C=C1F"
gives this as output c1ccc(CCOc2cncc3nnc(-c4ccc(-c5cccc(OC6CC6)c5)c(OC5CC5)c4)n23)cc1
from ersilia.
Thanks @Inyrkz - awesome stuff. This is really going in the right direction. Can you please take a look at this page? https://portal.valencelabs.com/datamol/post/generate-scaffolds-iBUTqU8Im9N2zCM
@miquelduranfrigola this looks interesting. I'll go through it.
from ersilia.
Also, @Inyrkz regarding the output of MolFrags:
I think we might not be getting more MolFrags because we are applying it to a MurckScaffold. what if we do the other way around, MolFrags and then MurckoScaffold?
@GemmaTuron okay, I'll try this to see the difference.
from ersilia.
Using this code
molecule = "CC1=C(C(=NO1)C2=CC=CC=C2Cl)C(=O)NC3C4N(C3=O)C(C(S4)(C)C)C(=O)O"
mol = Chem.MolFromSmiles(molecule)
fragments = Chem.GetMolFrags(mol, asMols=True)
for i in fragments:
print("Fragment:", Chem.MolToSmiles(i))
core = Chem.Scaffolds.MurckoScaffold.GetScaffoldForMol(i)
print("MurckoScaffold:", Chem.MolToSmiles(core))
This is the result I get
Fragment: Cc1onc(-c2ccccc2Cl)c1C(=O)NC1C(=O)N2C1SC(C)(C)C2C(=O)O
MurckoScaffold: O=C(NC1C(=O)N2CCSC12)c1conc1-c1ccccc1
It's still only one fragment generated from the input SMILES. But this time, the fragment and core are different.
Using this molecule molecule = "FC1=CC=C(CCOC2=CN=CC3=NN=C(N23)C2=CC=C(C(OC3CC3)=C2)C2=CC(OC3CC3)=C(Cl)C=C2)C=C1F"
we get
Fragment: Fc1ccc(CCOc2cncc3nnc(-c4ccc(-c5ccc(Cl)c(OC6CC6)c5)c(OC5CC5)c4)n23)cc1F
MurckoScaffold: c1ccc(CCOc2cncc3nnc(-c4ccc(-c5cccc(OC6CC6)c5)c(OC5CC5)c4)n23)cc1
It's only the Fluorine atom that is removed on both ends of the fragment.
from ersilia.
It's a great article. Thanks, it was helpful. The diagram in the article shows how the scaffolds are broken down from level 4 to level 1 to get the main core structure.
They showed two different ways of extracting the scaffold. We can use the RDKit or DataMol package.
from ersilia.
I used the code from the article on the previously mentioned molecule. It still gives the same result.
I also tried to apply the code recursively. Theoretically, this should work like they showed in the diagram, but it doesn't. It still gives the same scaffold as the result. It doesn't break it down any further. The same thing applies to other molecules.
The notebook is here.
from ersilia.
I see, it is a challenging issue!
I also have another suggestion, to ask the user to pass a subset of molecules from the same family, and then use the FindMCS function from rdkit to find the core (maybe you can try to get a few molecules from the OSM dataset and see what the MCS is for those @Inyrkz )
What do you think @miquelduranfrigola ?
from ersilia.
So instead of getting just one molecule and finding the core structure of the molecule, the user will pass a subset of related molecules (because they share a common core structure). Then we use the FindMCS()
function to extract the common substructure that represents the core of the chemical series.
from ersilia.
This is the notebook for it.
from ersilia.
OK, thanks @Inyrkz and @GemmaTuron, this is all very interesting. I would strongly try to use only one molecule for now. In this particular case, I'd rather recommend the user to just pass the core. At the end of the day, if they want to morph a particular core, then why shouldn't they pass the core directly? I think this is reasonable.
So, in practical terms, I would do the following:
- DataMol-based scaffold detection up to level 3 (or 4).
- Take the largest scaffold resulting from layer 3 (or 4).
- Use this scaffold as the "core".
So, in case the user already inputs a core or scaffold, it will be OK and, in case they input a molecule with side chains, we'll do our "best effort" to break it up. In my opinion, this is more than reasonable.
Would you agree?
from ersilia.
That will be fine if we want the users to input just the core. We'll convert the core they input to a side chain and then convert it to a SAFE before passing it to the model (or convert it to a SAFE first before converting it to a side chain).
If the user inputs a molecule, the DataMol code can only get up to level 4 (which is one large scaffold).
Should I continue with using the scaffold from level 4?
from ersilia.
I would not go with the startegy of large scaffolds as the molecules obtained will not be that significant.
We can also try what happens if we pass the core as a SMILES string (without the attachment points * ) and we simply try to create them with RDKIT, so the user only needs to input the core smiles -- would that work?
from ersilia.
@GemmaTuron, that could work for scaffold decoration. We just focus on converting the core molecule to a side chain, and pass that to the model.
I don't think it will work for scaffold morphing. The users will have to input both the molecule and the core.
from ersilia.
Thanks @GemmaTuron - completely agree with exploring Virtual Libraries (Mollib) meanwhile
from ersilia.
The rdScaffoldNetwork: The Scaffold Network Implementation in RDKit
works. It breaks the scaffolds into different levels. The number of scaffolds depends on the molecule size.
This is the notebook implementation. In some molecules, the attachment points are added.
We need to figure out which scaffold will be our core structure, especially for the last code cell in the notebook.
from ersilia.
Thanks Ini, that is very interesting.
To start with, I would discard the fragments that do not show attachment points (*) so that at least the core we select already has them.
We could then filter out by MW, can you add the MW to the structures you are getting so we can establish a cut-off (probably a high and low, we don't want things that are too small or too large)... and then the question remains of whether we want only 1 core per molecule or we would pass up to three cores per molecule, for example.
from ersilia.
@GemmaTuron, Okay.
I tried using the FindMCS
function to find the maximum common substructure from the generated list, but it didn't work.
I'll try this.
from ersilia.
@GemmaTuron, @miquelduranfrigola
The results are promising. Filtering by the molecular weight will help a lot. It seems the range of 70-80 is good.
Now, this is the question the question remains of whether we want only 1 core per molecule or we would pass up to three cores per molecule
from ersilia.
Hi @Inyrkz !
Great, I suggest being a bit less restrictive, molecular weight 60 -100
And then, maybe we can select the core with less heteroatoms (less Carbons) so that is a more interesting scaffold.
With this, we could proceed, in the output of the model by the way we should give the core that we have used as output as well
from ersilia.
- Filter by molecular weight within the range of 60-100
- Then filter by less heteroatoms core (count the number of 'c' in the strings)
from ersilia.
This result is promising.
One last step. We need to decide which scaffold we want to use, probably based on the number of attachment points.
Or we could work with all of them.
from ersilia.
Awesome stuff, @Inyrkz . Results are indeed very promising.
A few considerations:
- The filtering within the range 60-100 makes sense to me, but let's come up with a fallback strategy in case no scaffolds fall within this range. I would say: if scaffolds exist within this range, filter out the rest. If they don't exist, take the one that is closer to this range.
- Choosing based on number of attachment points makes sense to me. Using all of them too. Let's do a few test and have a look at the outputs!
- Let's get rid of scaffolds that are "generic" - in your previous results, that would be scaffolds 2, 4, 6, 8, 10, 12
Thanks again!
from ersilia.
@miquelduranfrigola, I have added another condition to get the scaffold that is closer to the range of 60 - 100 as a fallback strategy. The generic scaffolds have been excluded.
I'll be bringing all the functions together to do the scaffold morphing.
- Get Input SMILES
- Generate the core structures from the input SMILES.
- Compute the side chain using the input molecule and the core structure.
- Pass the side chain to the scaffold-morphing model
- Get the results.
from ersilia.
This notebook shows what running the main.py
file will look like with these as the input SMILES
smiles
CC(C)(C)c1nc(c(s1)-c1ccnc(N)n1)-c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F
CN(C)c1cccc(c1)C(=O)Nc1ccc(C)c(NC(=O)c2ccc(O)cc2)c1
Cc1ccc(Cl)c(Nc2ccccc2C(O)=O)c1Cl
N[C@@H](Cc1ccc(O)c(O)c1)C(O)=O
I'll create another notebook to show the result of only one SMILES as input.
from ersilia.
OK @Inyrkz this is going in the right direction. Could you add a method to keep, for each molecule, only unique molecules? I see repeated molecules in your previous notebook. Getting unique molecules can be done via 1. indexing them with InChIKeys and 2. getting the unique set.
from ersilia.
@miquelduranfrigola, Thanks for catching the repetition. I'll address it.
from ersilia.
Thanks @Inyrkz - I will investigate it in detail in preparation for our meeting tomorrow.
Meanwhile, have you already prepared the Dockerfile
, run.sh
etc for this model?
from ersilia.
I’ve adjusted the get_side_chain_pairs
function. It gives different pairs of side chains.
I tested the code on this molecule CC(C)(C)c1nc(c(s1)-c1ccnc(N)n1)-c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F
(one of the molecules that wasn’t working before.)
Not all the side chains combo works, only a few do. This approach is better so that we generate at least one side chain that works.
Here is the notebook.
It takes about 7 minutes to generate new molecules for one SMILES.
I want to try all four SMILES.
from ersilia.
Thanks @Inyrkz
The bit of information saying that "not all side chains combo works" is relevant. Actually, we might even want to try to do this with 1 side chain. So, in general, instead of generating pairs, we might want to generate all combination of, let's say, up to 3 elements.
As an example, consider the following, where "a", "b", "c", "d" would be four side chains.
Here are the combinations for the example list ['a', 'b', 'c', 'd']:
Combinations of 1 element:
('a',)
('b',)
('c',)
('d',)
Combinations of 2 elements:
('a', 'b')
('a', 'c')
('a', 'd')
('b', 'c')
('b', 'd')
('c', 'd')
Combinations of 3 elements:
('a', 'b', 'c')
('a', 'b', 'd')
('a', 'c', 'd')
('b', 'c', 'd')
This demonstrates how to generate all possible combinations of 1, 2, and 3 elements from a given list. You can use the same approach for any list of elements.
This code would generate all combinations
from itertools import combinations
# Example list
example_list = ['a', 'b', 'c', 'd']
# Generating all combinations of 1, 2, and 3 elements
combinations_1 = list(combinations(example_list, 1))
combinations_2 = list(combinations(example_list, 2))
combinations_3 = list(combinations(example_list, 3))
all_combinations = combinations_1 + combinations_2 + combinations_3
What do you think?
from ersilia.
This is great. Thanks for the code sample.
I can try this. The only problem is it may take longer for the model to run.
from ersilia.
@miquelduranfrigola, I've adjusted the get_side_chain_pairs()
function to use all combination.
The problem is that the scaffold-morphing model gives this error.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[53], [line 1](vscode-notebook-cell:?execution_count=53&line=1)
----> [1](vscode-notebook-cell:?execution_count=53&line=1) generated_smiles = designer.scaffold_morphing(
[2](vscode-notebook-cell:?execution_count=53&line=2) side_chains=side_chain_pairs[0],
[3](vscode-notebook-cell:?execution_count=53&line=3) n_samples_per_trial=12,
[4](vscode-notebook-cell:?execution_count=53&line=4) n_trials=1,
[5](vscode-notebook-cell:?execution_count=53&line=5) sanitize=True,
[6](vscode-notebook-cell:?execution_count=53&line=6) do_not_fragment_further=False,
[7](vscode-notebook-cell:?execution_count=53&line=7) random_seed=100,
[8](vscode-notebook-cell:?execution_count=53&line=8) )
[10](vscode-notebook-cell:?execution_count=53&line=10) print(generated_smiles)
File [~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:172](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:172), in SAFEDesign.scaffold_morphing(self, side_chains, mol, core, n_samples_per_trial, n_trials, sanitize, do_not_fragment_further, random_seed, **kwargs)
[137](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:137) def scaffold_morphing(
[138](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:138) self,
[139](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:139) side_chains: Optional[Union[dm.Mol, str, List[Union[str, dm.Mol]]]] = None,
(...)
[147](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:147) **kwargs,
[148](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:148) ):
[149](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:149) """Perform scaffold morphing decoration using the pretrained SAFE model
[150](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:150)
[151](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:151) For scaffold morphing, we try to replace the core by a new one. If the side_chains are provided, we use them.
(...)
[169](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/site-packages/safe/sample.py:169) kwargs: any argument to provide to the underlying generation function
...
--> [316](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/random.py:316) raise ValueError("empty range for randrange() (%d, %d, %d)" % (istart, istop, width))
[318](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/random.py:318) # Non-unit step argument supplied.
[319](https://file+.vscode-resource.vscode-cdn.net/home/affiah/Desktop/eos8bhe/model/framework/code/~/anaconda3/envs/safe/lib/python3.9/random.py:319) istep = int(step)
ValueError: empty range for randrange() (1, 1, 0)
from ersilia.
This is the input of the get_side_chain_pairs()
function.
[1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F.[2*]C(C)(C)C.[3*]c1ccnc(N)n1
This is the output.
['[1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F', '[1*]C(C)(C)C', '[1*]c1ccnc(N)n1', '[1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F.[2*]C(C)(C)C', '[1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F.[2*]c1ccnc(N)n1', '[1*]C(C)(C)C.[2*]c1ccnc(N)n1', '[1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F.[2*]C(C)(C)C.[3*]c1ccnc(N)n1']
I've noticed that the model crashes for side chains like this [1*]c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F', '[1*]C(C)(C)C', '[1*]c1ccnc(N)n1',
It has to be at least a pair
from ersilia.
It takes 4 minutes for the model to run predictions for 4 input SMILES on Google Colab (GPU)
from ersilia.
OK @Inyrkz thanks - GPU is a bit faster then. Noted.
Let's then do pairs and triplets.
One question: does the n_trials parameter help in generating more molecules?
Another question: does reducing the number of molecules per trial increase speed?
from ersilia.
Thanks @Inyrkz
This is useful. We will have to accept that this is a slow model.
Let's do n_trials=10
and n_samples_per_trial=10
, if you agree?
Can we try this config for a few molecules, keeping an eye on:
- Time (seconds) per molecule
- Number of molecules
- Uniqueness of molecules
- Quality of molecules: for this, just please paste images of the generated compounds (including the input)
Almost there!
from ersilia.
Alright,
I'll try n_trials=10
and n_samples_per_trial=10
.
Yup, almost there!
from ersilia.
This table shows how long it takes to execute the original code.
Tasks | Execution Time on my System (mins) | Execution Time on Colab with GPU (mins) |
---|---|---|
Side chain 1 | 3m46s | 1m46s |
Side chain 2 | 1m56s | 1m34s |
Side chain 3 | 2m2s | 1m23s |
Side chain 4 | 3m33s | 1m29s |
Total time | 24 mins (using a loop) | 6 mins (using a loop) |
Average generation time per side chain | 2m49s | 1m33s |
from ersilia.
Here is the notebook
from ersilia.
Input SMILES | No. of core structures generated | No. of side chains pairs | Execution Time on Colab with GPU (sec) | Number of output generated | Notebook |
---|---|---|---|---|---|
CC(C)(C)c1nc(c(s1)-c1ccnc(N)n1)-c1cccc(NS(=O)(=O)c2c(F)cccc2F)c1F | 4 | 13 | 1296 | 224 | link |
CN(C)c1cccc(c1)C(=O)Nc1ccc(C)c(NC(=O)c2ccc(O)cc2)c1 | 2 | 5 | 563 | 200 | link |
from ersilia.
Thanks, @Inyrkz - these numbers look very reasonable; ~200 molecules generated is sufficient. Also, on a quick look, I like the molecules generated according to the notebooks. In my opinion, we are ready to wrap up - @GemmaTuron , what do you think?
from ersilia.
I've made an initial update to the main.py
script.
These are the two files main.py and mol_gen.py
from ersilia.
Hi @Inyrkz !
Thanks for this, it looks good, does it work fine with Ersilia?
I cannot have a deep dive into it this afternoon but will do tomorrow, thanks for the work.
from ersilia.
The run.sh
file works.
I've been trying to test it with Ersilia but I keep getting a PingError
. My internet connection is bad. I'll try again at night to see if it gets better.
Git LFS initialized.
17:02:22 | DEBUG | Git LFS has been activated
17:02:47 | ERROR | Ersilia exception class:
PingError
Detailed error:
No internet connection. Internet connection is required for downloading models from GitHub repositories.
Hints:
Make sure that your computer is connected to the internet and try again.
🚨🚨🚨 Something went wrong with Ersilia 🚨🚨🚨
Error message:
Ersilia exception class:
PingError
Detailed error:
No internet connection. Internet connection is required for downloading models from GitHub repositories.
Hints:
Make sure that your computer is connected to the internet and try again.
If this error message is not helpful, open an issue at:
- https://github.com/ersilia-os/ersilia
Or feel free to reach out to us at:
- hello[at]ersilia.io
If you haven't, try to run your command in verbose mode (-v in the CLI)
- You will find the console log file in: /home/affiah/eos/current.log
from ersilia.
Thanks @Inyrkz let us know when you have better connection.
from ersilia.
I've had a look at the code and it seems fine, to be able to run it within Ersilia you'll need a few edits:
- add the metadata information
- Solve the paths issue to files? Though I would not hardcode these, now that we are already at implementation stage (I am talking about the input and output files in main.py)
from ersilia.
Also, line 96 on mol_gen, spotted a typo:
core_structures = self.extract_core_structure(i)
should be core_structures = self._extract_core_structure(i)
and other functions as well!
from ersilia.
Thanks for the update @GemmaTuron & @miquelduranfrigola
I corrected the typos in the mol_gen.py
script.
I'm a bit confused about the path issue. The code will only require the input and output files in the main.py
. It won't require any other file.
input_file = "data/my_molecules (copy).csv"
output_file = "data/results.csv"
The lines above were just for testing. This is what the main code will look like.
input_file = sys.argv[1]
output_file = sys.argv[2]
Is there any path issue to modify here?
from ersilia.
Also, my system freezes for hours when testing with ersilia. This usually happens when it gets to the Attempting to delete BentoML
part.
I'm trying to set up ersilia on another system, so I can do the testing with the system.
from ersilia.
I'm stuck here
16:57:29 | DEBUG | Activation done
16:57:29 | DEBUG | Previous command successfully run inside eos8bhe conda environment
16:57:29 | DEBUG | Now trying to establish symlinks
16:57:29 | DEBUG | BentoML location is /Users/ini-abasiaffiah/bentoml/repository/eos8bhe/20240212165727_4E41AB
16:57:29 | DEBUG | Ersilia Bento location is /Users/ini-abasiaffiah/eos/repository/eos8bhe/20240212165727_4E41AB
16:57:29 | DEBUG | Building symlinks between /Users/ini-abasiaffiah/eos/repository/eos8bhe/20240212165727_4E41AB and /Users/ini-abasiaffiah/bentoml/repository/eos8bhe/20240212165727_4E41AB
16:57:29 | DEBUG | Creating model symlink bundle artifacts > dest
16:57:29 | DEBUG | Creating model_install_commands.sh symlink dest <> bundle
16:57:29 | INFO | Could not create symbolic link from /Users/ini-abasiaffiah/eos/dest/eos8bhe/data.h5 to /Users/ini-abasiaffiah/eos/isaura/lake/eos8bhe_public.h5
16:57:29 | DEBUG | Run file found in framework: /Users/ini-abasiaffiah/eos/repository/eos8bhe/20240212165727_4E41AB/eos8bhe/artifacts/framework/run.sh
16:57:29 | DEBUG | Run commandlines on eos8bhe
16:57:29 | DEBUG | which python > /var/folders/md/k05hbtgj6zs3jprxfsjbs1540000gn/T/ersilia-i4oy4n5b/tmp.txt
16:57:30 | DEBUG | Activating base environment
16:57:30 | DEBUG | Current working directory: /Users/ini-abasiaffiah/ersilia
16:57:30 | DEBUG | Running bash /var/folders/md/k05hbtgj6zs3jprxfsjbs1540000gn/T/ersilia-fl8xyer9/script.sh 2>&1 | tee -a /var/folders/md/k05hbtgj6zs3jprxfsjbs1540000gn/T/ersilia-w754m208/command_outputs.log
# conda environments:
#
base /Users/ini-abasiaffiah/anaconda3
eos8bhe * /Users/ini-abasiaffiah/anaconda3/envs/eos8bhe
eosbase-bentoml-0.11.0-py310 /Users/ini-abasiaffiah/anaconda3/envs/eosbase-bentoml-0.11.0-py310
ersilia /Users/ini-abasiaffiah/anaconda3/envs/ersilia
16:57:30 | DEBUG | # conda environments:
#
base /Users/ini-abasiaffiah/anaconda3
eos8bhe * /Users/ini-abasiaffiah/anaconda3/envs/eos8bhe
eosbase-bentoml-0.11.0-py310 /Users/ini-abasiaffiah/anaconda3/envs/eosbase-bentoml-0.11.0-py310
ersilia /Users/ini-abasiaffiah/anaconda3/envs/ersilia
16:57:30 | DEBUG | Activation done
16:57:30 | DEBUG | Python executable: /Users/ini-abasiaffiah/anaconda3/envs/eos8bhe/bin/python
16:57:30 | DEBUG | Conda is needed
16:57:30 | DEBUG | Checking if model needs to be integrated to a tool
Traceback (most recent call last):
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 468, in _make_request
self._validate_conn(conn)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1097, in _validate_conn
conn.connect()
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connection.py", line 642, in connect
sock_and_verified = _ssl_wrap_socket_and_match_hostname(
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connection.py", line 783, in _ssl_wrap_socket_and_match_hostname
ssl_sock = ssl_wrap_socket(
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 471, in ssl_wrap_socket
ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls, server_hostname)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 515, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/ssl.py", line 513, in wrap_socket
return self.sslsocket_class._create(
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/ssl.py", line 1104, in _create
self.do_handshake()
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/ssl.py", line 1375, in do_handshake
self._sslobj.do_handshake()
TimeoutError: [Errno 60] Operation timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 845, in urlopen
retries = retries.increment(
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/util/retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/util/util.py", line 39, in reraise
raise value
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 791, in urlopen
response = self._make_request(
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 492, in _make_request
raise new_e
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 470, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/urllib3/connectionpool.py", line 371, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=None)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/bin/ersilia", line 8, in <module>
sys.exit(cli())
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/Users/ini-abasiaffiah/ersilia/ersilia/cli/commands/__init__.py", line 22, in wrapper
return func(*args, **kwargs)
File "/Users/ini-abasiaffiah/ersilia/ersilia/cli/commands/fetch.py", line 89, in fetch
_fetch(mf, model_id)
File "/Users/ini-abasiaffiah/ersilia/ersilia/cli/commands/fetch.py", line 12, in _fetch
mf.fetch(model_id)
File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/fetch/fetch.py", line 228, in fetch
self._fetch(model_id)
File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/fetch/fetch.py", line 225, in _fetch
self._fetch_not_from_dockerhub(model_id=model_id)
File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/fetch/fetch.py", line 137, in _fetch_not_from_dockerhub
self._content()
File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/fetch/fetch.py", line 106, in _content
cg = CardGetter(self.model_id, self.config_json)
File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/fetch/actions/content.py", line 14, in __init__
self.mc = ModelCard(config_json=config_json)
File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/content/card.py", line 738, in __init__
self.ac = AirtableCard(config_json=config_json)
File "/Users/ini-abasiaffiah/ersilia/ersilia/hub/content/card.py", line 676, in __init__
AirtableInterface.__init__(self, config_json=config_json)
File "/Users/ini-abasiaffiah/ersilia/ersilia/db/hubdata/interfaces.py", line 13, in __init__
self.api_key = self._get_read_only_airtable_api_key()
File "/Users/ini-abasiaffiah/ersilia/ersilia/db/hubdata/interfaces.py", line 24, in _get_read_only_airtable_api_key
r = requests.get(url)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/Users/ini-abasiaffiah/anaconda3/envs/ersilia/lib/python3.10/site-packages/requests/adapters.py", line 532, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Read timed out. (read timeout=None)
from ersilia.
Is there a way to test with ersilia on codespace?
from ersilia.
Hi @Inyrkz
You can build a codespace from your repo and try run.sh, but I assume you already did this and it worked?
from ersilia.
Yes, the run.sh file works well.
from ersilia.
ok Im running it and I'll let you know the outcome
from ersilia.
Alright
from ersilia.
- I've updated the metadata.json file
- I've fixed the function names in the mol_gen.py file
- The main.py only uses sys.argv
I want to install safe-mol==0.1.4 without any dependency
These are the other packages from the pyproject.toml
file
keywords = ["safe", "smiles", "de novo", "design", "molecules"]
dependencies = [
"tqdm",
"loguru",
"typer",
"universal_pathlib",
"datamol",
"numpy",
"torch>=2.0",
"transformers",
"datasets",
"tokenizers",
"accelerate",
"evaluate",
"wandb",
"huggingface-hub",
"rdkit"
]
from ersilia.
Hi @Inyrkz !
great that it works now, can you open a PR?
from ersilia.
@Inyrkz can you check why the docker upload is failing currently?
from ersilia.
Okay, how do I check that?
from ersilia.
Related Issues (20)
- 🐛 Bug: Discrepancy between docker image sizes on ARM machines HOT 3
- 🐛 Bug: Docker builds fail because of incorrect path to model python binary HOT 2
- [🐕 Batch]: Add a list of example molecules within model repositories
- 📑 Feature Request: Configure model docker builds to be rebuilt with older Dockerfiles HOT 5
- 🦠 Model Request: REINVENT 4 LibInvent HOT 12
- 📑 Feature Request: Create a workflow to add example.csv to model repositories HOT 2
- 🐛 Bug: conda.sh file not properly located HOT 2
- 🦠 Model Request: REINVENT 4 LibInvent HOT 8
- [🐅 Epic]: Add support for simultaneous serving of dockerized models
- [🐅 Epic]: Add GPU support within Ersilia for GPU optimized models in the hub
- 🦠 Model Request: WHALES search on Q-Mug HOT 6
- 🦠 Model Request: DelFTa quantum mechanical properties HOT 5
- 🦠 Model Request: PhaKinPro Incorporation HOT 19
- 🐛 Bug: Ersilia Install action failing due to conda no longer being shipped with MacOS Runner Images
- 🦠 Model Request: SQUID 3D shape generative model HOT 4
- ECS Instance for running Ersilia Models HOT 27
- [🐕 Batch]: Ersilia Release Management
- [🐈 Task]: Create a PR Template
- 🦠 Model Request: REINVENT 4 LinkInvent HOT 2
- [Project]: Train REINVENT Mol2MolSimilarity model to predict molecules similar in 3d shape
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ersilia.