xinhaoli74 / molpmofit Goto Github PK

View Code? Open in Web Editor NEW

42.0 42.0 20.0 15.18 MB

Jupyter Notebook 96.31% Python 3.69%

molpmofit's People

Contributors

Stargazers

Watchers

molpmofit's Issues

Ask a question-问题请教

Hello Authors.

Regarding your publications "Inductive transfer learning for molecula activity prediction : Next-Gen QSAR Models with MolPMoFiT"and

"SMILES Pair Encoding: a Data-Driven Substructure Tokenization Algorithm for Deep Learning".

I have encountered the following problems in duplicating your work and would like to ask you for advice.

In the first paper, what is the coding basis of the data enhancement part in the code utils.py you uploaded, and how the enhanced molecules are determined to have the same properties as the original molecules; also, I would like to ask what is the reason for the partial error in this code.
In the second paper, you used the SPE form to divide the molecules, which is higher than the ECFP coding form in terms of effect, but is the sub-structure accurate in terms of interpretation; I also want to ask, after the molecules are divided in this part, what is the form of data input to the network model.
Can you share a complete code.

I hope to get your reply, thank you very much!

Translated with www.DeepL.com/Translator (free version)

2022.11.02

作者您好：

关于您发表的《SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning》和

《Inductive transfer learning for molecula activity prediction : Next-Gen QSAR Models with MolPMoFiT》期刊，

我在重复您的工作过程中遇到了以下问题，特此向您请教；

1、在第二篇文献中，您上传的代码utils.py中的数据增强部分的编码依据是什么，增强后的分子如何确定与原分子具有相同的属性；同时想问一下，不知道是什么原因该代码存在部分错误；

2、在第一篇文献中，您使用SPE形式对分子进行划分，在效果上是高于ECFP编码形式，但是在解释上子结构是否准确；同时想问一下，该部分对分子划分后，是以什么形式进行数据输入到网络模型当中的；

3、是否可以分享一份完整代码。

希望得到您的回复，非常感谢！

陆同学

2022.11.02

[01_MSPM_Pretraining.ipynb] nan valid_loss when doing learner.lr_find

Thanks for such a great work!

I am trying to replicate your work and by the time I was running 01_MSPM_Pretraining.ipynb, I found that valid_loss got #na# when executing learner.lr_find(), as shown below:

I am figuring out the cause. Did you encounter that in your experiment?

I trained a model with experimental data (split into train, valid, test) and wanted to use it for prediction on an independent library. For prediction, I used the example script and substituted the test data by the new library. I was wondering what would be the best practice in this case regarding the qsar_vocab. In the example, the qsar_vocab seems to be build from train and valid data:

    qsar_vocab = TextLMDataBunch.from_df(path, train_aug, valid_aug, bs=bs, tokenizer=tok, 
                                  chunksize=50000, text_cols=0,label_cols=1, max_vocab=60000, include_bos=False)

    test_data_clas = TextClasDataBunch.from_df(path, train, test, bs=bs, tokenizer=tok, 
                          chunksize=50000, text_cols='smiles',label_cols='label', vocab=qsar_vocab.vocab, max_vocab=60000,
                                          include_bos=False)

When I now use the new library as test data, does the qsar_vocab, which would come from the experimental library used for training and validation, influence the results? Why does test_data_clas need a reference to the train data?

```KeyError: '0.encoder.weight'``` when running classification task

Good morning, thanks for the interesting work!
Could you please provide a more detailed explanation as to how a pretrained network can be used for a classification task?

As far as I understood, a user should download at least the ChEMBL_1M_atom models folder and put it into a parent folder inside the project. Then, notebook 05_Pretrained_Models.ipynb should be run entirely. Ultimately, the user can skip the finetuning procedure present in notebook 02_MSPM_TS_finetuning.ipynb and jump directly to running notebook 03_QSAR_Classifcation.ipynb.

I am having troubles running the code present in notebook 03_QSAR_Classifcation.ipynb.
Specifically, I keep getting the following error when running
lm_learner = lm_learner.load_pretrained(*fnames):
KeyError: '0.encoder.weight'.

Even when running the code as-is on a fresh clone on Google Colab without any modification, I cannot get rid of this error.
I only installed RDKit and FastAI (v. 1.0.61) to make the code run.

Thank you in advance for your time!

conda env dependency versioning

Hi,

Could you please push a conda .yml that includes versioning? I'm trying to reproduce your work but I'm running into a range of versioning issues, primarily in fastai (which seems to have had some major refactoring over the last year or so).

Conda environment doesn't include sklearn

It looks like molpmofit.yml doesn't include scikit-learn, which is necessary to run the notebooks.

Can you share a notebook how to use the ChEMBL_1M Atom model

Hi thanks for the sharing the code and workflow . Could you share how to use the ChEMBL_1M Atom model on the regression task to predict new molecules ? I am getting a vocab kind of error at my end on BBBP data . It has 48 vocabs and chembl have 80 . I would appreciate your feedback.

Sincerely
Abhik

xinhaoli74 / molpmofit Goto Github PK

molpmofit's People

Contributors

Stargazers

Watchers

Forkers

molpmofit's Issues

Ask a question-问题请教

[01_MSPM_Pretraining.ipynb] nan valid_loss when doing learner.lr_find

Prediction on new dataset

```KeyError: '0.encoder.weight'``` when running classification task

conda env dependency versioning

Conda environment doesn't include sklearn

Can you share a notebook how to use the ChEMBL_1M Atom model

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs