chembiohtp / enzyhtp Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 1.0 110.13 MB

EnzyHTP is a python library that automates the complete life-cycle of enzyme modeling

Home Page: https://enzyhtp-doc.readthedocs.io

License: Other

Python 99.56% Shell 0.43% Rust 0.01%

enzyhtp's People

Contributors

Stargazers

Watchers

Forkers

rufus-willy

enzyhtp's Issues

A bug in the slurm system on ACCRE

From accre admin

In slurm we have a lua script to check the balance use between the cores/
memory and GPU cards (in case some user may request most of cores and
memory
but only use one GPU card, then the other three cards are complete
useless).
This script may have some problem with the option --mem-per-gpu. Can you
use
the normal option --mem instead? That may solve the issue.

Address overwriting situations of add/insert children objects of Structure class

Both demands exist.

Issues concerning PDB input

In general, PDB2PQR, embedded in the get_protonation() function, can process raw PDB files, which will remove all redundant information, add missing atoms, and determine the protonation status at the same time. However, some cases will cause problems.

Case #1: Biological assembly is not equivalent to the asymmetric unit. (To find out what "biological assembly" and "asymmetric unit" are, go to http://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies.) For example, 1PKX is supposed to have two chains whereas the file has four chains. Chain A and B are one biological assembly and Chain C and D are another biological assembly. Proposed solution: Download biological assemblies instead of the original PDB file as the initial input file.

Case #2: PDB2PQR sometimes deletes the terminal residues (1Q17 as an example). Especially at the N terminal, if the original terminal residue is deleted, PDB2PQR will not change the next one into a new N terminal residue. N terminal -NH3 hydrogens have the name of H1, H2, H3 whereas middle residue O=C-N-H hydrogen simply has the name of H, causing LEAP to have a fatal error as (in this case the terminal is a GLY):
"FATAL: Atom .R<NGLY 239>.A<H 10> does not have a type."
To resolve this issue, one should first figure out why PDB2PQR deletes the terminal residues. (Maybe related to the residue numbering? 1Q17 chains start indexing the residue from negative numbers. But PDB2PQR does not delete all residues with non-positive numbering.)

Case #3: Multiple residues at one residue site. For example res240 in 1E25: LYS/ALA/GLY/LYS. PDB2PQR will merge them as one "big residue" which will cause problems later in Leap.

Case #4: Original PDB files cause get_protonation() to crash directly. Haven't got a chance to look into the exact reasons. Examples: 1K20, 1NI9, 1WN1.
For example, the 1k20.pdb will give these error messages:

Traceback (most recent call last):
File "0_MAIN.py", line 21, in
PDB1.get_protonation() #PDB2PQR has different atom order as leap
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_PDB.py", line 478, in get_protonation
self._protonation_Fix(out_path, ph=ph)
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_PDB.py", line 522, in _protonation_Fix
new_stru.protonation_metal_fix(Fix = 1)
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 540, in protonation_metal_fix
metal.get_donor_residue(method = 'INC')
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 1598, in get_donor_residue
self.get_donor_atom(method=method)
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 1588, in get_donor_atom
if dist <= (R_d + R_m):
TypeError: unsupported operand type(s) for +: 'float' and 'NoneType'

Capability of multi-chain system

The method of generating the MutaFlag only works on the first chain.

Address missing parameter from antechamber

In the amber interface we should consider cover the case that antechamber can not find a good parameter and gives ATTN: NEEDS REVISION Give a error for this.

Support user assigned ligand dir in PDB2FF

Common questions for some specfic functions

How does EnzyHTP generate the force field files for ligands?
How does EnzyHTP determine the output order of the electric field and dipole of each snapshot?
If the Analysis (QM cluster, etc.) is conducted separately from previous MD simulations, what additional parameters should be included in the python code?

Gather designs in different dir_gen.py

There are multiple versions of dir_gen.py & plot-E-BD-dist.py under /Test_files corresponding to different HTP workflow and dir design. These should be unified into a module in the new EnzyHTP.

Fix the messed up atom order from PyMol

          `retain_order` seems not necessary. It will make the saved PDB looks weird. You can try look into the saved PDB.

Originally posted by @shaoqx in #116 (comment)

We need to find a way to deal with re-run

Find a bug after a day for a 4 day run is such a pain

PDB2FF ligand test

If there are no ligands in the PDB file, a leap error will occur (1GBG as an example):

*** Error: tl_getline(): not interactive, use stdio.

Error occurs using PDB2FF (3WIW):

Traceback (most recent call last):
File "0_MutaGen_main.py", line 34, in
main()
File "0_MutaGen_main.py", line 14, in main
PDB1.PDB2FF('ff_prep_temp/')
File "/gpfs23/scratch/jiany37/HTP_trial/03022021_enzyme_workflow-develop_qz/Class_PDB.py", line 845, in PDB2FF
ligands_pathNchrg = self.stru.build_ligands(lig_dir, ifcharge=1)
File "/gpfs23/scratch/jiany37/HTP_trial/03022021_enzyme_workflow-develop_qz/Class_Structure.py", line 512, in build_ligands
net_charge = lig.get_net_charge(method=c_method, ph=ph)
File "/gpfs23/scratch/jiany37/HTP_trial/03022021_enzyme_workflow-develop_qz/Class_Structure.py", line 1855, in get_net_charge
mol = next(pybel.readfile('pdb', temp_pdb3_path))
File "/home/jiany37/Software_installed/openbabel-3-1-1_installed/lib/python3.8/site-packages/openbabel/pybel.py", line 161, in readfile
raise IOError("No such file: '%s'" % filename)
OSError: No such file: './cache/ligand_temp3.pdb'

3WIW ligand EPE. The error occurs before ff can be generated. But pdb2pqr adds one H atom to each N and O(=S) atoms, which is obviously not reasonable.
Another issue with PDB2PQR: it cannot process selenocysteine correctly and the output PDB will skip this residue. Example: 3F3K
Ions that get_protonation do not support: Mn(II), Co(III), Fe(III) (5HIO)
Transition metal ions Co only has +2 in leap tip3p ff. In the future, we need to figure out a way to determine the charge for these ions because the coordinate parts in PDB files do not have this information. Most common example: Fe.

Make a testing dataset of different types of structures to prepare

Address dead methods in Structure class

some methods in the code are not working. Should figure out which of them are basics of most future development and address them first.

Decouple structure module with PDB I/O

Structure class should serve solely as the core data structure of EnzyHTP and should only handle accessing and editing of data. Other parts of functions that use Structure should be binding modules. (e.g.: PDB I/O or structure selection)

sbatch time-out issue

The current version can successfully handle the sacct and squeue time out issue. But sbatch may also have the time out issue that has not been solved.

Need to make file writing thread safe

the unfinished problem in this commit 8e8bf81
When running multiple mutations in parallel, need to apply thread lock when shared files are writing and wait until its finish.

rm_wat method removes all the water molecules #water

the function removes all the water molecules. this could cause problems because some water molecules could be conserved in different crystal structures and are essential for the enzymatic process. Might figure out a way to remove waters when doing docking but adding them back in after the substrate has been docked.

Fix the bug reading ANISOU record

Current method of dealing with missing chain id is to detect residues that are not neighbors in terms of lines in the PDB file.

This would work in most time since the only non-atom line in PDB most time will be TER and END which fits the case. But also in a lot of cases, PDBs from the PDB database contain ANISOU records that separate every ATOM line, which I think will be the only exception. #76

Solvent Engineering

Background: Some hydrolases exhibit a special preference towards organic solvent. For lipase, an organic phase is even desirable because it acti-vates most lipases by 10 to 100-fold. Also, the regioselectivity of the transesterification of a 2-octyl-1,4-dihydroxybenzene butyric acid ester reversed from favoring the 4-position in cyclohexane to the 1-position in acetonitrile (Rubio et al., 1991). Rubio et al. (1991) rationalized the reversal in selectivity according to differences in substrate solvation. Cyclohexane solvates the octyl group well, thus, the ester at the less hindered 4-position reacts. Acetonitrile solvates the octyl group poorly, thus the substrate binds in a manner that places the octyl group within the hydrophobic lipase active site. For this reason, the ester at the 1-position now reacts more rapidly.

Issue: The process of selecting organic solvent for enzyme catalysis is known as "solvent engineering". We might consider having a module in EnzyHTP to assess the stability and kinetics of enzyme variants in different solvent environments.

development need for EnzyTrun

Rank based on importance: (means low ranking ones can be temporarily manually replaced or remove)

mutation function
MD and Amber interface
Rosetta ddG Interface
Structural alignment function
AlphaFold2 interface
QM interface
SCA interface

The topology error of structures

In rare cases, for the old mutator, the mutation will generate structure with the "topology error", that is, for example, a ring from TYR circles around the side chain of TYR:

PyBel inconsistant output

Most of the time using the PyBel (Openbabel 3.1.0) api in enzy_htp to protonate ligand will give a file with "HETATM" lines. (e.g.: test/preparation/test_protonate.py::test_pybel_protonate_pdb_ligand_4CO)
But in the case of 4NEG, PyBel gives "ATOM" lines while parsing the ligand HEZ. (e.g.: test/preparation/test_protonate.py::test_fix_pybel_output)
There is no obvious reason PyBel treats these 2 ligand files differently.

This causes a bug that _fix_pybel_output cant read the protonated ligand from PyBel in the current version of enzy_htp.

PDB2PQR issue when adding hydrogen atoms

The get_protonation function is affected by the residue name in the input PDB file. For the case I tested, this is specific for histidine. If the initial PDB file names every histidine as HIS, get_protonation will determine the protonation state and change HIS to HIE/HID/HIP accordingly. But if in the input PDB file histidine already has a name of either HIE/HID/HIP, pdb2pqr will only add hydrogen atoms according to their names.

So before running the get_protonation function, it is better to change every HIE/HID/HIP back to HIS just in case. This step can also be the first step of the get_protonation function.

Add a check for mutation space

In the case that desired number of non-repeating mutant is not possible to achieve in the mutation space without repeat, this infinite loop will show up:

Change default setting for minimization

use CPU for minimization to avoid error relate to http://archive.ambermd.org/202203/0149.html

Atom naming consistency

Another thing I noticed is the atom names of the new residue generated by PyMol are not the classical PDB style naming. I compared and found it is mainly the hydrogen namings are different. Fixing this is a good lv. 3 task! I suggest we can choose from the 2 plans below:

Assume H naming is the only difference:
detect the inconsistency between atom namings after the mutation. (you can use the standard atom naming information from chemical.residue) and change all the naming of H back (it's just simply where to place the digit) and abort when it is more than just H
Difference in naming in general:
still detect the inconsistency between atom namings after the mutation. (this should be a general method of Structure or Atom)
Wrote some 1 time code and collect a naming map between our standard atom naming and pymol's atom naming. (this can be achieved by make pymol generate each residue and record their atom naming.) And store it in chemical.

To-do list for functional group workflow and substrate database

Locate the outside phosphate for NTP -> NDP + Pi and similar reaction.
Locate the inside phosphate for NTP -> NMP + diPi and similar reaction.
The dipole may be largely influenced by the protonation state: eg, Phosphate, Amidine
The dipole may be largely influenced by the chirality: eg, saccharides
A few beta-lactams have a pretty short C-N bond, which makes this amide bond hard to identify.
The chirality of saccharides is really hard. Locate glycosidic bonds more accurately.
Find a way to distinguish the C-O bonds in oxygen glycosidic bonds.
There are some very surprising reactions: 3.11.1.1, 3.5.99.2, 3.7.1.18, 3.7.1.8
Locating the reactive bond for Substrates of amidine hydrolase is hard.
Locate the peptide cleavage position in protease/peptidases.
In general, chirality is usually not reflected in the SMILES string. Many substrates have the wrong chirality.
The high throughput filtering should be improved. There are cases where products are treated as substrates, a category name is treated as a specific substrate name (peptide, peptidyl, -ate for a series of esters). And some seemingly unrelated chemicals are treated as substrates.
Some SMILES strings are not consistent with the substrate name.
Some SMILES strings of peptides lead to chemicals like -NH-CA-[CO-CO]-CA-[NH-NH]-.

Make sure `autoimage` is in new sample protocol

check commit b8283de

Address residue/atom indexing issue in Structure parsing

Either also return a mapper or always keep the consistent index with the input file.

Structure parser chain ID issue

If the structure parser has to assign more than 9 new chain IDs, the 10th unique chain ID is 10 and thus making later information on each line move one digit forward, which will cause a problem when reading the resulting PDB.

Example: PDB ID 7tpt. See the screenshot in the attached figure.

Suggested change: Assign chain ID as follows: A-Z, a-z, and 0-9. This way, a maximum of 62 unique 1-digit chain IDs can be generated. And if the input structure has more than 62 unique chains (very extreme case), the get_structure() function can assign an integer starting from 10. But because of the limitation of the PDB format, these chain IDs are not supposed to be printed out by the get_file_str() function.

Besides, the current get_file_str() function output chains by 0-9, A-Z, and a-z. A better order should consistently be A-Z, a-z, and 0-9. And the ordering for later integer chain IDs should be by the variable type of integer instead of a string.

Support NCAA in protonate_stru

Current engine (PDB2PQR) in protonate_stru does not support modified AA. It will just delete them in the output.

Ligand adding H issue and new unsupported ion record

Bad ligand protonation:

SO4^2- is added with two H atoms, making it a neutral molecule when pH=7.
EPE in 3WIW
U12 in 2GG2. This time it is not only protonation status problem. N2 in U12 has 5 covalent bonds ...
GLV in 4PXE. Oxoacetic acid becomes hydroxyacetic acid. This could be a tricky one because if there are no H atoms, it is hard for the algorithm to distinguish if C-O is a double bond or a single bond.
SAM in 1rjd. S+ atom is added to 4 valance.
FAD in 2YG4
TPO in 5TJ3

New ion:

MN (Mn+2) in 1K20.

Name of co-crystal molecules:

GOL in 1K20
EDO in 2VC7

cpu_mem cannot parse strings with unit 'G'

If res_setting is a string, cpu_mem has to be an integer or string with pure numbers like '3'. The current version cannot parse strings with trailing unit 'G'.

Change the way to configure resources when use job manager

Using job manager no longer limit all jobs in the same workflow to use identical resources.

Structure diversity checker

We need a checker that tells us the diversity features of a structure. Such features would be:

Contain NCAA?
Contain metal?
Contain transition metal?
Contain missing loop?
Contain missing heavyatom?
Contain water?
Contain ligand?
etc...

Add mutation calculation functions in the enzy_htp/mutation module

Combine experience from KE_accre scripts about getting unique mutation and get relative mutation:

calculate overall combination of mutations with given lists of mutaflag lists.
calculate relative mutations with given a reference mutaflag list.

Realize Job Queue Management

The ideal way to run various external software by EnzyHTP on a cluster and maximize the use of the resources is:

have EnzyHTP main script run as a job manager for heavy jobs on 1 CPU for a longer time
submit GPU/CPU job within EnzyHTP.

Add PDB2PQR version check

Support for artificial residues

We original have a hot fix for this in the branch shaoqz/tmp_art_resi
Look at including CONECT lines in the original PDB for case like lasso (works for disulfide bond)

Check current Amber interface

if the interface support have different nmropt settings for multiple minimization called in different part of the workflow. e.g.: PDBMin vs PDBMD

Useful tool to solve the stoichiometry problem

https://doi.org/10.1007/s10822-012-9626-2
HyPPI
https://proteins.plus/

Protonation function input error

For the 5FLM.pdb, the RNA polymerase, which contains the nucleotide U. The pdb structure parser will recognize the U as a metal ion in the ligand

import enzy_htp.structure as struct
import enzy_htp.structure.structure_operation as stru_oper
from enzy_htp.preparation import protonate as prot

sp = struct.PDBParser()

def test_protonate(test_pdb:str):
    stru = sp.get_structure(test_pdb)
    prot.protonate_stru(stru, 7.0, protonate_ligand=True)


def args():
    pass

def main():
    pdb_list = ["5FLM.pdb", "1F6G.pdb"]
    for pdbs in pdb_list:
        test_protonate(f"../{pdbs}")
        print("finish {pdbs}")

if __name__ == "__main__":
    main()

Existing Amber nc files mess up the trajectory result

The pre-existing .rst files will mess up the MD run if the structure is different. In .out there will be:
ERROR: natom mismatch in inpcrd/restrt and prmtop files!
In old EnzyHTP, this bug will be super delayed if running over files from an incomplete run with a different mutation, as:

Traceback (most recent call last):
  File "/gpfs52/data/yang_lab/shaoqz/KE-DE/R5/group_2_1/KE-metrics.py", line 130, in <module>
    main()
  File "/gpfs52/data/yang_lab/shaoqz/KE-DE/R5/group_2_1/KE-metrics.py", line 117, in main
    Dipoles = PDB.get_bond_dipole(pdb_obj.qm_cluster_fchk, a1qm, a2qm)
  File "/home/shaoq1/bin/EnzyHTP/Class_PDB.py", line 2837, in get_bond_dipole
    if 'Sum' in line:
Exception: Cannot find bond:25-26

This is cause by using existing .mdcrd file from last run with a different mutation.

Address residue key in Residue class

need to figure out different scenarios of its application and figure out different information needed in residue key in different cases.

Disulfide bonds are not correctly annotated in Amber

PDB2PQR can identify the disulfide bonds and change the residue name to CYX but I need to find in their API how to obtain the information of the residue idx of those residues.

ToDo List for issued problems

#11 (comment)
#10 (comment)

Ring containing residue problem checker and modifier

Cavity Engineering for Reactive Docking

Following the work of enzyHTP, one direction is to develop a module that allows reactive docking of substrate into the cavity of an enzyme mutant.
Background: Expanding the substrate scope for synthetically-useful enzymes inevitably involves engineering the size of the active site cavity. Following the work of reactive docking – the computational protocol to dock different substrates to wild-type enzyme in a pre-reaction complex form – we should build a module that allows docking of a substrate to an enzyme mutant whose wild-type counterpart does not fit to because of the size mismatch. The module, or the "cut-to-fit" strategy, may involve two functions:

Remove the substrate substituents and attempt the docking; if succeed, project the removed part to the enzyme cavity to see what residues should be mutated to fit the original substrate.
Conduct mutations and dock the original substrate.

add better support for nmr constrain in Amber interface

Need to support pairwise alignment

Needed for getting aligned index for homo-polymer to achieve automatic syncing mutations over chain

chembiohtp / enzyhtp Goto Github PK

enzyhtp's People

Contributors

Stargazers

Watchers

Forkers

enzyhtp's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs