GithubHelp home page GithubHelp logo

molpmofit's People

Contributors

vhorvath avatar xinhaoli74 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

molpmofit's Issues

Ask a question-问题请教

Hello Authors.

Regarding your publications "Inductive transfer learning for molecula activity prediction : Next-Gen QSAR Models with MolPMoFiT"and

"SMILES Pair Encoding: a Data-Driven Substructure Tokenization Algorithm for Deep Learning".

I have encountered the following problems in duplicating your work and would like to ask you for advice.

  1. In the first paper, what is the coding basis of the data enhancement part in the code utils.py you uploaded, and how the enhanced molecules are determined to have the same properties as the original molecules; also, I would like to ask what is the reason for the partial error in this code.

  2. In the second paper, you used the SPE form to divide the molecules, which is higher than the ECFP coding form in terms of effect, but is the sub-structure accurate in terms of interpretation; I also want to ask, after the molecules are divided in this part, what is the form of data input to the network model.

  3. Can you share a complete code.

I hope to get your reply, thank you very much!

Translated with www.DeepL.com/Translator (free version)

Lu

2022.11.02

作者您好:

关于您发表的《SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning》和

《Inductive transfer learning for molecula activity prediction : Next-Gen QSAR Models with MolPMoFiT》期刊,

我在重复您的工作过程中遇到了以下问题,特此向您请教;

1、在第二篇文献中,您上传的代码utils.py中的数据增强部分的编码依据是什么,增强后的分子如何确定与原分子具有相同的属性;同时想问一下,不知道是什么原因该代码存在部分错误;

2、在第一篇文献中,您使用SPE形式对分子进行划分,在效果上是高于ECFP编码形式,但是在解释上子结构是否准确;同时想问一下,该部分对分子划分后,是以什么形式进行数据输入到网络模型当中的;

3、是否可以分享一份完整代码。

希望得到您的回复,非常感谢!

陆同学

2022.11.02

[01_MSPM_Pretraining.ipynb] nan valid_loss when doing learner.lr_find

Thanks for such a great work!

I am trying to replicate your work and by the time I was running 01_MSPM_Pretraining.ipynb, I found that valid_loss got #na# when executing learner.lr_find(), as shown below:

nan_valid_loss

I am figuring out the cause. Did you encounter that in your experiment?

Prediction on new dataset

Hello,

I trained a model with experimental data (split into train, valid, test) and wanted to use it for prediction on an independent library. For prediction, I used the example script and substituted the test data by the new library. I was wondering what would be the best practice in this case regarding the qsar_vocab. In the example, the qsar_vocab seems to be build from train and valid data:

    qsar_vocab = TextLMDataBunch.from_df(path, train_aug, valid_aug, bs=bs, tokenizer=tok, 
                                  chunksize=50000, text_cols=0,label_cols=1, max_vocab=60000, include_bos=False)

    test_data_clas = TextClasDataBunch.from_df(path, train, test, bs=bs, tokenizer=tok, 
                          chunksize=50000, text_cols='smiles',label_cols='label', vocab=qsar_vocab.vocab, max_vocab=60000,
                                          include_bos=False)

When I now use the new library as test data, does the qsar_vocab, which would come from the experimental library used for training and validation, influence the results? Why does test_data_clas need a reference to the train data?

```KeyError: '0.encoder.weight'``` when running classification task

Good morning, thanks for the interesting work!
Could you please provide a more detailed explanation as to how a pretrained network can be used for a classification task?

As far as I understood, a user should download at least the ChEMBL_1M_atom models folder and put it into a parent folder inside the project. Then, notebook 05_Pretrained_Models.ipynb should be run entirely. Ultimately, the user can skip the finetuning procedure present in notebook 02_MSPM_TS_finetuning.ipynb and jump directly to running notebook 03_QSAR_Classifcation.ipynb.

I am having troubles running the code present in notebook 03_QSAR_Classifcation.ipynb.
Specifically, I keep getting the following error when running
lm_learner = lm_learner.load_pretrained(*fnames):
KeyError: '0.encoder.weight'.

Even when running the code as-is on a fresh clone on Google Colab without any modification, I cannot get rid of this error.
I only installed RDKit and FastAI (v. 1.0.61) to make the code run.

Thank you in advance for your time!

conda env dependency versioning

Hi,

Could you please push a conda .yml that includes versioning? I'm trying to reproduce your work but I'm running into a range of versioning issues, primarily in fastai (which seems to have had some major refactoring over the last year or so).

Can you share a notebook how to use the ChEMBL_1M Atom model

Hi thanks for the sharing the code and workflow . Could you share how to use the ChEMBL_1M Atom model on the regression task to predict new molecules ? I am getting a vocab kind of error at my end on BBBP data . It has 48 vocabs and chembl have 80 . I would appreciate your feedback.

Sincerely
Abhik

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.