chembiohtp / enzyhtp Goto Github PK
View Code? Open in Web Editor NEWEnzyHTP is a python library that automates the complete life-cycle of enzyme modeling
Home Page: https://enzyhtp-doc.readthedocs.io
License: Other
EnzyHTP is a python library that automates the complete life-cycle of enzyme modeling
Home Page: https://enzyhtp-doc.readthedocs.io
License: Other
From accre admin
In slurm we have a lua script to check the balance use between the cores/
memory and GPU cards (in case some user may request most of cores and
memory
but only use one GPU card, then the other three cards are complete
useless).
This script may have some problem with the option --mem-per-gpu. Can you
use
the normal option --mem instead? That may solve the issue.
Both demands exist.
In general, PDB2PQR, embedded in the get_protonation() function, can process raw PDB files, which will remove all redundant information, add missing atoms, and determine the protonation status at the same time. However, some cases will cause problems.
Case #1: Biological assembly is not equivalent to the asymmetric unit. (To find out what "biological assembly" and "asymmetric unit" are, go to http://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies.) For example, 1PKX is supposed to have two chains whereas the file has four chains. Chain A and B are one biological assembly and Chain C and D are another biological assembly. Proposed solution: Download biological assemblies instead of the original PDB file as the initial input file.
Case #2: PDB2PQR sometimes deletes the terminal residues (1Q17 as an example). Especially at the N terminal, if the original terminal residue is deleted, PDB2PQR will not change the next one into a new N terminal residue. N terminal -NH3 hydrogens have the name of H1, H2, H3 whereas middle residue O=C-N-H hydrogen simply has the name of H, causing LEAP to have a fatal error as (in this case the terminal is a GLY):
"FATAL: Atom .R<NGLY 239>.A<H 10> does not have a type."
To resolve this issue, one should first figure out why PDB2PQR deletes the terminal residues. (Maybe related to the residue numbering? 1Q17 chains start indexing the residue from negative numbers. But PDB2PQR does not delete all residues with non-positive numbering.)
Case #3: Multiple residues at one residue site. For example res240 in 1E25: LYS/ALA/GLY/LYS. PDB2PQR will merge them as one "big residue" which will cause problems later in Leap.
Case #4: Original PDB files cause get_protonation() to crash directly. Haven't got a chance to look into the exact reasons. Examples: 1K20, 1NI9, 1WN1.
For example, the 1k20.pdb will give these error messages:
Traceback (most recent call last):
File "0_MAIN.py", line 21, in
PDB1.get_protonation() #PDB2PQR has different atom order as leap
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_PDB.py", line 478, in get_protonation
self._protonation_Fix(out_path, ph=ph)
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_PDB.py", line 522, in _protonation_Fix
new_stru.protonation_metal_fix(Fix = 1)
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 540, in protonation_metal_fix
metal.get_donor_residue(method = 'INC')
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 1598, in get_donor_residue
self.get_donor_atom(method=method)
File "/gpfs23/scratch/jiany37/Partial_Charge/trial_3/Class_Structure.py", line 1588, in get_donor_atom
if dist <= (R_d + R_m):
TypeError: unsupported operand type(s) for +: 'float' and 'NoneType'
The method of generating the MutaFlag only works on the first chain.
In the amber interface we should consider cover the case that antechamber can not find a good parameter and gives ATTN: NEEDS REVISION
Give a error for this.
There are multiple versions of dir_gen.py & plot-E-BD-dist.py under /Test_files corresponding to different HTP workflow and dir design. These should be unified into a module in the new EnzyHTP.
`retain_order` seems not necessary. It will make the saved PDB looks weird. You can try look into the saved PDB.
Originally posted by @shaoqx in #116 (comment)
Find a bug after a day for a 4 day run is such a pain
*** Error: tl_getline(): not interactive, use stdio.
Traceback (most recent call last):
File "0_MutaGen_main.py", line 34, in
main()
File "0_MutaGen_main.py", line 14, in main
PDB1.PDB2FF('ff_prep_temp/')
File "/gpfs23/scratch/jiany37/HTP_trial/03022021_enzyme_workflow-develop_qz/Class_PDB.py", line 845, in PDB2FF
ligands_pathNchrg = self.stru.build_ligands(lig_dir, ifcharge=1)
File "/gpfs23/scratch/jiany37/HTP_trial/03022021_enzyme_workflow-develop_qz/Class_Structure.py", line 512, in build_ligands
net_charge = lig.get_net_charge(method=c_method, ph=ph)
File "/gpfs23/scratch/jiany37/HTP_trial/03022021_enzyme_workflow-develop_qz/Class_Structure.py", line 1855, in get_net_charge
mol = next(pybel.readfile('pdb', temp_pdb3_path))
File "/home/jiany37/Software_installed/openbabel-3-1-1_installed/lib/python3.8/site-packages/openbabel/pybel.py", line 161, in readfile
raise IOError("No such file: '%s'" % filename)
OSError: No such file: './cache/ligand_temp3.pdb'
3WIW ligand EPE. The error occurs before ff can be generated. But pdb2pqr adds one H atom to each N and O(=S) atoms, which is obviously not reasonable.
Another issue with PDB2PQR: it cannot process selenocysteine correctly and the output PDB will skip this residue. Example: 3F3K
Ions that get_protonation do not support: Mn(II), Co(III), Fe(III) (5HIO)
Transition metal ions Co only has +2 in leap tip3p ff. In the future, we need to figure out a way to determine the charge for these ions because the coordinate parts in PDB files do not have this information. Most common example: Fe.
some methods in the code are not working. Should figure out which of them are basics of most future development and address them first.
Structure class should serve solely as the core data structure of EnzyHTP and should only handle accessing and editing of data. Other parts of functions that use Structure should be binding modules. (e.g.: PDB I/O or structure selection)
The current version can successfully handle the sacct and squeue time out issue. But sbatch may also have the time out issue that has not been solved.
the unfinished problem in this commit 8e8bf81
When running multiple mutations in parallel, need to apply thread lock when shared files are writing and wait until its finish.
the function removes all the water molecules. this could cause problems because some water molecules could be conserved in different crystal structures and are essential for the enzymatic process. Might figure out a way to remove waters when doing docking but adding them back in after the substrate has been docked.
Current method of dealing with missing chain id is to detect residues that are not neighbors in terms of lines in the PDB file.
This would work in most time since the only non-atom line in PDB most time will be TER and END which fits the case. But also in a lot of cases, PDBs from the PDB database contain ANISOU records that separate every ATOM line, which I think will be the only exception. #76
Background: Some hydrolases exhibit a special preference towards organic solvent. For lipase, an organic phase is even desirable because it acti-vates most lipases by 10 to 100-fold. Also, the regioselectivity of the transesterification of a 2-octyl-1,4-dihydroxybenzene butyric acid ester reversed from favoring the 4-position in cyclohexane to the 1-position in acetonitrile (Rubio et al., 1991). Rubio et al. (1991) rationalized the reversal in selectivity according to differences in substrate solvation. Cyclohexane solvates the octyl group well, thus, the ester at the less hindered 4-position reacts. Acetonitrile solvates the octyl group poorly, thus the substrate binds in a manner that places the octyl group within the hydrophobic lipase active site. For this reason, the ester at the 1-position now reacts more rapidly.
Issue: The process of selecting organic solvent for enzyme catalysis is known as "solvent engineering". We might consider having a module in EnzyHTP to assess the stability and kinetics of enzyme variants in different solvent environments.
Rank based on importance: (means low ranking ones can be temporarily manually replaced or remove)
Most of the time using the PyBel (Openbabel 3.1.0) api in enzy_htp to protonate ligand will give a file with "HETATM" lines. (e.g.: test/preparation/test_protonate.py::test_pybel_protonate_pdb_ligand_4CO)
But in the case of 4NEG, PyBel gives "ATOM" lines while parsing the ligand HEZ. (e.g.: test/preparation/test_protonate.py::test_fix_pybel_output)
There is no obvious reason PyBel treats these 2 ligand files differently.
This causes a bug that _fix_pybel_output cant read the protonated ligand from PyBel in the current version of enzy_htp.
The get_protonation function is affected by the residue name in the input PDB file. For the case I tested, this is specific for histidine. If the initial PDB file names every histidine as HIS, get_protonation will determine the protonation state and change HIS to HIE/HID/HIP accordingly. But if in the input PDB file histidine already has a name of either HIE/HID/HIP, pdb2pqr will only add hydrogen atoms according to their names.
So before running the get_protonation function, it is better to change every HIE/HID/HIP back to HIS just in case. This step can also be the first step of the get_protonation function.
use CPU for minimization to avoid error relate to http://archive.ambermd.org/202203/0149.html
Another thing I noticed is the atom names of the new residue generated by PyMol are not the classical PDB style naming. I compared and found it is mainly the hydrogen namings are different. Fixing this is a good lv. 3 task! I suggest we can choose from the 2 plans below:
check commit b8283de
Either also return a mapper or always keep the consistent index with the input file.
If the structure parser has to assign more than 9 new chain IDs, the 10th unique chain ID is 10 and thus making later information on each line move one digit forward, which will cause a problem when reading the resulting PDB.
Example: PDB ID 7tpt. See the screenshot in the attached figure.
Suggested change: Assign chain ID as follows: A-Z, a-z, and 0-9. This way, a maximum of 62 unique 1-digit chain IDs can be generated. And if the input structure has more than 62 unique chains (very extreme case), the get_structure()
function can assign an integer starting from 10. But because of the limitation of the PDB format, these chain IDs are not supposed to be printed out by the get_file_str()
function.
Besides, the current get_file_str()
function output chains by 0-9, A-Z, and a-z. A better order should consistently be A-Z, a-z, and 0-9. And the ordering for later integer chain IDs should be by the variable type of integer instead of a string.
Current engine (PDB2PQR) in protonate_stru does not support modified AA. It will just delete them in the output.
Bad ligand protonation:
New ion:
Name of co-crystal molecules:
If res_setting is a string, cpu_mem has to be an integer or string with pure numbers like '3'. The current version cannot parse strings with trailing unit 'G'.
Using job manager no longer limit all jobs in the same workflow to use identical resources.
We need a checker that tells us the diversity features of a structure. Such features would be:
Combine experience from KE_accre scripts about getting unique mutation and get relative mutation:
The ideal way to run various external software by EnzyHTP on a cluster and maximize the use of the resources is:
We original have a hot fix for this in the branch shaoqz/tmp_art_resi
Look at including CONECT lines in the original PDB for case like lasso (works for disulfide bond)
if the interface support have different nmropt settings for multiple minimization called in different part of the workflow. e.g.: PDBMin vs PDBMD
For the 5FLM.pdb, the RNA polymerase, which contains the nucleotide U. The pdb structure parser will recognize the U as a metal ion in the ligand
import enzy_htp.structure as struct
import enzy_htp.structure.structure_operation as stru_oper
from enzy_htp.preparation import protonate as prot
sp = struct.PDBParser()
def test_protonate(test_pdb:str):
stru = sp.get_structure(test_pdb)
prot.protonate_stru(stru, 7.0, protonate_ligand=True)
def args():
pass
def main():
pdb_list = ["5FLM.pdb", "1F6G.pdb"]
for pdbs in pdb_list:
test_protonate(f"../{pdbs}")
print("finish {pdbs}")
if __name__ == "__main__":
main()
The pre-existing .rst files will mess up the MD run if the structure is different. In .out there will be:
ERROR: natom mismatch in inpcrd/restrt and prmtop files!
In old EnzyHTP, this bug will be super delayed if running over files from an incomplete run with a different mutation, as:
Traceback (most recent call last):
File "/gpfs52/data/yang_lab/shaoqz/KE-DE/R5/group_2_1/KE-metrics.py", line 130, in <module>
main()
File "/gpfs52/data/yang_lab/shaoqz/KE-DE/R5/group_2_1/KE-metrics.py", line 117, in main
Dipoles = PDB.get_bond_dipole(pdb_obj.qm_cluster_fchk, a1qm, a2qm)
File "/home/shaoq1/bin/EnzyHTP/Class_PDB.py", line 2837, in get_bond_dipole
if 'Sum' in line:
Exception: Cannot find bond:25-26
This is cause by using existing .mdcrd file from last run with a different mutation.
need to figure out different scenarios of its application and figure out different information needed in residue key in different cases.
PDB2PQR can identify the disulfide bonds and change the residue name to CYX but I need to find in their API how to obtain the information of the residue idx of those residues.
Following the work of enzyHTP, one direction is to develop a module that allows reactive docking of substrate into the cavity of an enzyme mutant.
Background: Expanding the substrate scope for synthetically-useful enzymes inevitably involves engineering the size of the active site cavity. Following the work of reactive docking โ the computational protocol to dock different substrates to wild-type enzyme in a pre-reaction complex form โ we should build a module that allows docking of a substrate to an enzyme mutant whose wild-type counterpart does not fit to because of the size mismatch. The module, or the "cut-to-fit" strategy, may involve two functions:
Needed for getting aligned index for homo-polymer to achieve automatic syncing mutations over chain
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.