GithubHelp home page GithubHelp logo

dar19 / mlsf-protocol Goto Github PK

View Code? Open in Web Editor NEW

This project forked from vktrannguyen/mlsf-protocol

0.0 0.0 0.0 228.41 MB

License: MIT License

Shell 4.41% Python 29.71% Awk 0.17% Jupyter Notebook 65.72%

mlsf-protocol's Introduction

MLSF-protocol

Protocol-Workflow

You will find herein the code (bash scripts, Python scripts, Jupyter notebooks), the input and output files related to our Nature Protocols paper:

Tran-Nguyen, V. K., Junaid, M., Simeon, S. & Ballester, P. J. A practical guide to machine-learning scoring for structure-based virtual screening. Nat. Protoc. (2023)

Here we provide examples for three targets: ACHE (acetylcholinesterase), HMGR (HMG-CoA reductase), and PPARA (peroxisome proliferator-activated receptor alpha). The input and output files for a target are found in the corresponding folder.

Inside each of the three folders ACHE, HMGR, PPARA, you will find the following sub-folders:

  • DEKOIS2.0 sub-folder: input and output files for Section A of this protocol.
  • Own_data sub-folder: input and output files for Sections B, C, D of this protocol. This sub-folder contains:
    • SMILES sub-folder: SMILES strings of the users' own true actives, true inactives and decoys.
    • ChemAxon sub-folder: raw data for Figure 4 in the manuscript and Figures S1-S2 in Supporting Information.
    • MLSF_PETS sub-folder: input and output files related to training-test partitions obtained from the "Pre-Existing Test Set" (PETS) option.
    • MLSF_OTS sub-folder: input and output files related to training-test partitions obtained from the "Own Test Set" (OTS) option.
    • data.xlsx: the master Excel file where all necessary information on the users' own true actives and true inactives is stored.
  • D-test-sets sub-folder: files related to the "Dissimilar" (D) test sets of each target's PETS and OTS (Tables 2, 3).

The code is found in the Protocol_Code folder:

  • bash-commands.zip: zip file containing all bash commands for the compilation of scores from Smina, CNN-Score, RF-Score-VS, IFP (Steps 14, 21, 30, 41).
  • Precision-Recall-curve.ipynb: Jupyter notebook for plotting the precision-recall (PR) curve.
  • EF1-NEF1.sh: bash script for computing the EF1% and NEF1% values.
  • vlookup.awk and potency.sh: necessary files for extracting the potency of true hits among the top 1%-ranked molecules.
  • Morgan-fp-simil.ipynb: Jupyter notebook for computing the similarity of Morgan fingerprints.
  • Compound-clustering_Morgan-fp.ipynb: Jupyter notebook for clustering molecules based on their Morgan fingerprint similarity.
  • Remove_AVE_Python3.py: Python code for splitting a data set into four subsets (training actives, training inactives, test actives, and test inactives) in an unbiased manner.
  • MLSFs.ipynb: Jupyter notebook for training and evaluating target-specific SFs.

ATTENTION: if you use the scripts EF1-NEF1.sh and potency.sh to process the csv hit lists issued by the MLSFs.ipynb Jupyter notebook (in Section D):

  • You must remove the "Predicted_Class" column (while keeping the "Real_Class" column as is) from the hit lists beforehand.
  • The EF1-NEF1.sh code must be slightly modified as follows (you can either edit it using the vi/vim command or open it in a text editor, e.g. Notepad++, and save it after modifying):
#Count the number of true active molecules (true hits) in the whole test set:
A=$(grep -c 'Active' $hitlistname)-1
A=$((A))

All Supporting Information files are found in the Supporting_Information folder:

  • Supporting-Information_MLSFs-SBVS.docx: Tables S1-S5, Figures S1, S2 and other supplementary information as indicated in the manuscript.
  • Supporting-Information_MLSFs-SBVS_DEKOIS-retrieved-actives.xlsx: raw data for Figure 3 in the manuscript.
  • Supporting-Information_MLSFs-SBVS_10-runs.xlsx: virtual screening performance of five learning algorithms across 10 training-test runs on five training-test partitions (full partitions and "Dissimilar" test sets), raw data for Figures 5A, 5C in the manuscript.

Two environments have to be set up in order to run the code of this protocol: DeepCoy-env and protocol-env. The DeepCoy-env environment can be installed according to DeepCoy authors (https://github.com/fimrie/DeepCoy). The protocol-env environment can be set up by using the protocol-env.yml file provided here as follows:

conda env create -f protocol-env.yml
conda activate protocol-env

A part of the code for Step 60 (to retrieve SMILES strings from PubChem) is provided here (you can simply copy and do not have to retype):

ipython
import pandas as pd
import pubchempy as pcp
cid_list = df['cid']
smiles_list = []
def get_smiles(input_cid):
    mol = pcp.Compound.from_cid(int(input_cid))
    smiles_list.append(mol.canonical_smiles)
    return smiles_list
[get_smiles(mol) for mol in cid_list]
df['smiles']= smiles_list

The Anaconda installer v4.13.0 is needed to set up these environments. Installation instructions for Anaconda can be found at https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html.

Several parts of the code/scripts/Jupyter notebooks used in this protocol were developed from the original code accessible in the following github repositories:

Programming languages used in this protocol: Python v3.7, Jupyter notebook, Bash.

All other necessary information is included in our protocol.

For further queries, please contact Dr. Viet-Khoa Tran-Nguyen ([email protected]) or Dr. Pedro J. Ballester ([email protected]).

mlsf-protocol's People

Contributors

vietkhoa-trannguyen avatar vktrannguyen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.