GithubHelp home page GithubHelp logo

chembiohtp / enzyhtp Goto Github PK

View Code? Open in Web Editor NEW
8.0 8.0 1.0 110.13 MB

EnzyHTP is a python library that automates the complete life-cycle of enzyme modeling

Home Page: https://enzyhtp-doc.readthedocs.io

License: Other

Python 99.56% Shell 0.43% Rust 0.01%

enzyhtp's People

Contributors

chrisjurich avatar kleinesmesser avatar mtremblay0817 avatar seb124 avatar shaoqx avatar so-dopamine avatar swordjack avatar zjygrp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

rufus-willy

enzyhtp's Issues

A bug in the slurm system on ACCRE

From accre admin

In slurm we have a lua script to check the balance use between the cores/
memory and GPU cards (in case some user may request most of cores and
memory
but only use one GPU card, then the other three cards are complete
useless).
This script may have some problem with the option --mem-per-gpu. Can you
use
the normal option --mem instead? That may solve the issue.

Issues concerning PDB input

In general, PDB2PQR, embedded in the get_protonation() function, can process raw PDB files, which will remove all redundant information, add missing atoms, and determine the protonation status at the same time. However, some cases will cause problems.

Case #1: Biological assembly is not equivalent to the asymmetric unit. (To find out what "biological assembly" and "asymmetric unit" are, go to http://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies.) For example, 1PKX is supposed to have two chains whereas the file has four chains. Chain A and B are one biological assembly and Chain C and D are another biological assembly. Proposed solution: Download biological assemblies instead of the original PDB file as the initial input file.

Case #2: PDB2PQR sometimes deletes the terminal residues (1Q17 as an example). Especially at the N terminal, if the original terminal residue is deleted, PDB2PQR will not change the next one into a new N terminal residue. N terminal -NH3 hydrogens have the name of H1, H2, H3 whereas middle residue O=C-N-H hydrogen simply has the name of H, causing LEAP to have a fatal error as (in this case the terminal is a GLY):
"FATAL: Atom .R<NGLY 239>.A<H 10> does not have a type."
To resolve this issue, one should first figure out why PDB2PQR deletes the terminal residues. (Maybe related to the residue numbering? 1Q17 chains start indexing the residue from negative numbers. But PDB2PQR does not delete all residues with non-positive numbering.)

Case #3: Multiple residues at one residue site. For example res240 in 1E25: LYS/ALA/GLY/LYS. PDB2PQR will merge them as one "big residue" which will cause problems later in Leap.

Case #4: Original PDB files cause get_protonation() to crash directly. Haven't got a chance to look into the exact reasons. Examples: 1K20, 1NI9, 1WN1.
For example, the 1k20.pdb will give these error messages:

Traceback (most recent call last):
File "0_MAIN.py", line 21, in
PDB1.get_protonation() #PDB2PQR has different atom order as leap
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_PDB.py", line 478, in get_protonation
self._protonation_Fix(out_path, ph=ph)
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_PDB.py", line 522, in _protonation_Fix
new_stru.protonation_metal_fix(Fix = 1)
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 540, in protonation_metal_fix
metal.get_donor_residue(method = 'INC')
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 1598, in get_donor_residue
self.get_donor_atom(method=method)
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 1588, in get_donor_atom
if dist <= (R_d + R_m):
TypeError: unsupported operand type(s) for +: 'float' and 'NoneType'

Common questions for some specfic functions

  1. How does EnzyHTP generate the force field files for ligands?
  2. How does EnzyHTP determine the output order of the electric field and dipole of each snapshot?
  3. If the Analysis (QM cluster, etc.) is conducted separately from previous MD simulations, what additional parameters should be included in the python code?

Gather designs in different dir_gen.py

There are multiple versions of dir_gen.py & plot-E-BD-dist.py under /Test_files corresponding to different HTP workflow and dir design. These should be unified into a module in the new EnzyHTP.

PDB2FF ligand test

  1. If there are no ligands in the PDB file, a leap error will occur (1GBG as an example):

*** Error: tl_getline(): not interactive, use stdio.

  1. Error occurs using PDB2FF (3WIW):

Traceback (most recent call last):
File "0_MutaGen_main.py", line 34, in
main()
File "0_MutaGen_main.py", line 14, in main
PDB1.PDB2FF('ff_prep_temp/')
File "/gpfs23/scratch/jiany37/HTP_trial/03022021_enzyme_workflow-develop_qz/Class_PDB.py", line 845, in PDB2FF
ligands_pathNchrg = self.stru.build_ligands(lig_dir, ifcharge=1)
File "/gpfs23/scratch/jiany37/HTP_trial/03022021_enzyme_workflow-develop_qz/Class_Structure.py", line 512, in build_ligands
net_charge = lig.get_net_charge(method=c_method, ph=ph)
File "/gpfs23/scratch/jiany37/HTP_trial/03022021_enzyme_workflow-develop_qz/Class_Structure.py", line 1855, in get_net_charge
mol = next(pybel.readfile('pdb', temp_pdb3_path))
File "/home/jiany37/Software_installed/openbabel-3-1-1_installed/lib/python3.8/site-packages/openbabel/pybel.py", line 161, in readfile
raise IOError("No such file: '%s'" % filename)
OSError: No such file: './cache/ligand_temp3.pdb'

  1. 3WIW ligand EPE. The error occurs before ff can be generated. But pdb2pqr adds one H atom to each N and O(=S) atoms, which is obviously not reasonable.

  2. Another issue with PDB2PQR: it cannot process selenocysteine correctly and the output PDB will skip this residue. Example: 3F3K

  3. Ions that get_protonation do not support: Mn(II), Co(III), Fe(III) (5HIO)

  4. Transition metal ions Co only has +2 in leap tip3p ff. In the future, we need to figure out a way to determine the charge for these ions because the coordinate parts in PDB files do not have this information. Most common example: Fe.

Decouple structure module with PDB I/O

Structure class should serve solely as the core data structure of EnzyHTP and should only handle accessing and editing of data. Other parts of functions that use Structure should be binding modules. (e.g.: PDB I/O or structure selection)

sbatch time-out issue

The current version can successfully handle the sacct and squeue time out issue. But sbatch may also have the time out issue that has not been solved.

rm_wat method removes all the water molecules #water

the function removes all the water molecules. this could cause problems because some water molecules could be conserved in different crystal structures and are essential for the enzymatic process. Might figure out a way to remove waters when doing docking but adding them back in after the substrate has been docked.

Fix the bug reading ANISOU record

Current method of dealing with missing chain id is to detect residues that are not neighbors in terms of lines in the PDB file.

This would work in most time since the only non-atom line in PDB most time will be TER and END which fits the case. But also in a lot of cases, PDBs from the PDB database contain ANISOU records that separate every ATOM line, which I think will be the only exception. #76

Solvent Engineering

Background: Some hydrolases exhibit a special preference towards organic solvent. For lipase, an organic phase is even desirable because it acti-vates most lipases by 10 to 100-fold. Also, the regioselectivity of the transesterification of a 2-octyl-1,4-dihydroxybenzene butyric acid ester reversed from favoring the 4-position in cyclohexane to the 1-position in acetonitrile (Rubio et al., 1991). Rubio et al. (1991) rationalized the reversal in selectivity according to differences in substrate solvation. Cyclohexane solvates the octyl group well, thus, the ester at the less hindered 4-position reacts. Acetonitrile solvates the octyl group poorly, thus the substrate binds in a manner that places the octyl group within the hydrophobic lipase active site. For this reason, the ester at the 1-position now reacts more rapidly.

Issue: The process of selecting organic solvent for enzyme catalysis is known as "solvent engineering". We might consider having a module in EnzyHTP to assess the stability and kinetics of enzyme variants in different solvent environments.

development need for EnzyTrun

Rank based on importance: (means low ranking ones can be temporarily manually replaced or remove)

  1. mutation function
  2. MD and Amber interface
  3. Rosetta ddG Interface
  4. Structural alignment function
  5. AlphaFold2 interface
  6. QM interface
  7. SCA interface

The topology error of structures

In rare cases, for the old mutator, the mutation will generate structure with the "topology error", that is, for example, a ring from TYR circles around the side chain of TYR:
wrong_mutation

PyBel inconsistant output

Most of the time using the PyBel (Openbabel 3.1.0) api in enzy_htp to protonate ligand will give a file with "HETATM" lines. (e.g.: test/preparation/test_protonate.py::test_pybel_protonate_pdb_ligand_4CO)
But in the case of 4NEG, PyBel gives "ATOM" lines while parsing the ligand HEZ. (e.g.: test/preparation/test_protonate.py::test_fix_pybel_output)
There is no obvious reason PyBel treats these 2 ligand files differently.

This causes a bug that _fix_pybel_output cant read the protonated ligand from PyBel in the current version of enzy_htp.

PDB2PQR issue when adding hydrogen atoms

The get_protonation function is affected by the residue name in the input PDB file. For the case I tested, this is specific for histidine. If the initial PDB file names every histidine as HIS, get_protonation will determine the protonation state and change HIS to HIE/HID/HIP accordingly. But if in the input PDB file histidine already has a name of either HIE/HID/HIP, pdb2pqr will only add hydrogen atoms according to their names.

So before running the get_protonation function, it is better to change every HIE/HID/HIP back to HIS just in case. This step can also be the first step of the get_protonation function.

Add a check for mutation space

In the case that desired number of non-repeating mutant is not possible to achieve in the mutation space without repeat, this infinite loop will show up:
image

Atom naming consistency

Another thing I noticed is the atom names of the new residue generated by PyMol are not the classical PDB style naming. I compared and found it is mainly the hydrogen namings are different. Fixing this is a good lv. 3 task! I suggest we can choose from the 2 plans below:

  1. Assume H naming is the only difference:
    detect the inconsistency between atom namings after the mutation. (you can use the standard atom naming information from chemical.residue) and change all the naming of H back (it's just simply where to place the digit) and abort when it is more than just H
  2. Difference in naming in general:
    still detect the inconsistency between atom namings after the mutation. (this should be a general method of Structure or Atom)
    Wrote some 1 time code and collect a naming map between our standard atom naming and pymol's atom naming. (this can be achieved by make pymol generate each residue and record their atom naming.) And store it in chemical.

To-do list for functional group workflow and substrate database

  1. Locate the outside phosphate for NTP -> NDP + Pi and similar reaction.
  2. Locate the inside phosphate for NTP -> NMP + diPi and similar reaction.
  3. The dipole may be largely influenced by the protonation state: eg, Phosphate, Amidine
  4. The dipole may be largely influenced by the chirality: eg, saccharides
  5. A few beta-lactams have a pretty short C-N bond, which makes this amide bond hard to identify.
  6. The chirality of saccharides is really hard. Locate glycosidic bonds more accurately.
  7. Find a way to distinguish the C-O bonds in oxygen glycosidic bonds.
  8. There are some very surprising reactions: 3.11.1.1, 3.5.99.2, 3.7.1.18, 3.7.1.8
  9. Locating the reactive bond for Substrates of amidine hydrolase is hard.
  10. Locate the peptide cleavage position in protease/peptidases.
  11. In general, chirality is usually not reflected in the SMILES string. Many substrates have the wrong chirality.
  12. The high throughput filtering should be improved. There are cases where products are treated as substrates, a category name is treated as a specific substrate name (peptide, peptidyl, -ate for a series of esters). And some seemingly unrelated chemicals are treated as substrates.
  13. Some SMILES strings are not consistent with the substrate name.
  14. Some SMILES strings of peptides lead to chemicals like -NH-CA-[CO-CO]-CA-[NH-NH]-.

Structure parser chain ID issue

If the structure parser has to assign more than 9 new chain IDs, the 10th unique chain ID is 10 and thus making later information on each line move one digit forward, which will cause a problem when reading the resulting PDB.

Example: PDB ID 7tpt. See the screenshot in the attached figure.

Suggested change: Assign chain ID as follows: A-Z, a-z, and 0-9. This way, a maximum of 62 unique 1-digit chain IDs can be generated. And if the input structure has more than 62 unique chains (very extreme case), the get_structure() function can assign an integer starting from 10. But because of the limitation of the PDB format, these chain IDs are not supposed to be printed out by the get_file_str() function.

Besides, the current get_file_str() function output chains by 0-9, A-Z, and a-z. A better order should consistently be A-Z, a-z, and 0-9. And the ordering for later integer chain IDs should be by the variable type of integer instead of a string.
7tpt_output_example

Ligand adding H issue and new unsupported ion record

Bad ligand protonation:

  • SO4^2- is added with two H atoms, making it a neutral molecule when pH=7.
  • EPE in 3WIW
  • U12 in 2GG2. This time it is not only protonation status problem. N2 in U12 has 5 covalent bonds ...
  • GLV in 4PXE. Oxoacetic acid becomes hydroxyacetic acid. This could be a tricky one because if there are no H atoms, it is hard for the algorithm to distinguish if C-O is a double bond or a single bond.
  • SAM in 1rjd. S+ atom is added to 4 valance.
  • FAD in 2YG4
  • TPO in 5TJ3

New ion:

  • MN (Mn+2) in 1K20.

Name of co-crystal molecules:

  • GOL in 1K20
  • EDO in 2VC7

Structure diversity checker

We need a checker that tells us the diversity features of a structure. Such features would be:

  • Contain NCAA?
  • Contain metal?
  • Contain transition metal?
  • Contain missing loop?
  • Contain missing heavyatom?
  • Contain water?
  • Contain ligand?
  • etc...

Realize Job Queue Management

The ideal way to run various external software by EnzyHTP on a cluster and maximize the use of the resources is:

  • have EnzyHTP main script run as a job manager for heavy jobs on 1 CPU for a longer time
  • submit GPU/CPU job within EnzyHTP.

Support for artificial residues

We original have a hot fix for this in the branch shaoqz/tmp_art_resi
Look at including CONECT lines in the original PDB for case like lasso (works for disulfide bond)

Check current Amber interface

if the interface support have different nmropt settings for multiple minimization called in different part of the workflow. e.g.: PDBMin vs PDBMD

Protonation function input error

For the 5FLM.pdb, the RNA polymerase, which contains the nucleotide U. The pdb structure parser will recognize the U as a metal ion in the ligand

import enzy_htp.structure as struct
import enzy_htp.structure.structure_operation as stru_oper
from enzy_htp.preparation import protonate as prot

sp = struct.PDBParser()

def test_protonate(test_pdb:str):
    stru = sp.get_structure(test_pdb)
    prot.protonate_stru(stru, 7.0, protonate_ligand=True)


def args():
    pass

def main():
    pdb_list = ["5FLM.pdb", "1F6G.pdb"]
    for pdbs in pdb_list:
        test_protonate(f"../{pdbs}")
        print("finish {pdbs}")

if __name__ == "__main__":
    main()

Existing Amber nc files mess up the trajectory result

The pre-existing .rst files will mess up the MD run if the structure is different. In .out there will be:
ERROR: natom mismatch in inpcrd/restrt and prmtop files!
In old EnzyHTP, this bug will be super delayed if running over files from an incomplete run with a different mutation, as:

Traceback (most recent call last):
  File "/gpfs52/data/yang_lab/shaoqz/KE-DE/R5/group_2_1/KE-metrics.py", line 130, in <module>
    main()
  File "/gpfs52/data/yang_lab/shaoqz/KE-DE/R5/group_2_1/KE-metrics.py", line 117, in main
    Dipoles = PDB.get_bond_dipole(pdb_obj.qm_cluster_fchk, a1qm, a2qm)
  File "/home/shaoq1/bin/EnzyHTP/Class_PDB.py", line 2837, in get_bond_dipole
    if 'Sum' in line:
Exception: Cannot find bond:25-26

This is cause by using existing .mdcrd file from last run with a different mutation.

Cavity Engineering for Reactive Docking

Following the work of enzyHTP, one direction is to develop a module that allows reactive docking of substrate into the cavity of an enzyme mutant.
Background: Expanding the substrate scope for synthetically-useful enzymes inevitably involves engineering the size of the active site cavity. Following the work of reactive docking โ€“ the computational protocol to dock different substrates to wild-type enzyme in a pre-reaction complex form โ€“ we should build a module that allows docking of a substrate to an enzyme mutant whose wild-type counterpart does not fit to because of the size mismatch. The module, or the "cut-to-fit" strategy, may involve two functions:

  1. Remove the substrate substituents and attempt the docking; if succeed, project the removed part to the enzyme cavity to see what residues should be mutated to fit the original substrate.
  2. Conduct mutations and dock the original substrate.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.