mobleylab / benchmarksets Goto Github PK

Benchmark sets for binding free energy calculations: Perpetual review paper, discussion, datasets, and standards

License: BSD 3-Clause "New" or "Revised" License

TeX 84.98% Python 15.02%

benchmarksets's Introduction

Benchmark sets for free energy calculations

This repository relates to the perpetual review (definition) paper called "Predicting binding free energies: Frontiers and benchmarks" by David L. Mobley, Germano Heinzelmann, Niel M. Henriksen, and Michael K. Gilson. The repository's focus is benchmark sets for binding free energy calculations, including the perpetual review paper, but also supporting files and other materials relating to free energy benchmarks. Thus, the repository includes not only the perpetual review paper but also further discussion, datasets, and (hopefully ultimately) standards for datasets and data deposition.

The latest version of the paper is always available on this GitHub repository, as well as all previous versions. Additionally, all release versions -- current and prior -- are available via Zenodo DOIs, with the latest release at this DOI:

The paper

Versions

The most up-to-date version of our perpetual review paper always available here. Additionally, this repository provides the authoritative source for all versions of this paper. Released versions of the paper are also archived as preprints on eScholarship, and have Zenodo DOIs as noted above. An early version of this work was published in Annual Review of Biophysics 46:531-558 (2017).

Publication in Annual Review

While a portion of this work was originally published with Annual Review, the version here is substantially expanded and updated, and will continue to deviate further from the AR version. Thus, we refer to this version as the "perpetual review version" and this is refelected in its title.

Ongoing updates and credit

Source files for the paper are deposited here on this GitHub repository, as detailed below, and comments/suggestions, etc. are welcome via the issue tracker (https://github.com/MobleyLab/benchmarksets/issues).

The Annual Review portion of this work is posted with permission from the Annual Review of Biophysics, Volume 46 © 2017 by Annual Reviews. Only the Annual Reviews version of the work is peer reviewed; versions posted here are effectively preprints updated at the authors' discretion. The right to create derivative works (exercised here) is also exercised with permission from the Annual Review of Biophysics, Volume 46 © 2017 by Annual Reviews, http://www.annualreviews.org/

A list of authors is provided below.

Citing this work

To cite this work, please cite both:

The latest eScholarship version (archiving point releases of this repo) at https://escholarship.org/uc/item/9p37m6bq, with the authors currently listed there and the title "Predicting binding free energies: Frontiers and benchmarks (a perpetual review)"
Our Annual Review in Biophysics work (DOI)

The vision

The field vitally needs benchmark sets to test and advance free energy calculations, as we detail in our paper. Currently, there are no such standard benchmark systems. And when good test systems are found, the relevant data tends to be published but then forgotten, and never becomes widely available. Here, we want the community to be involved in selecting benchmark systems, highlighting their key challenges, and making the data and results readily available to drive new science.

To make this happen, we need community input. Please bring new, relevant work to our attention, including experimental or modeling work on the benchmark systems currently available here, or new work on systems that might make good candidate benchmark systems for the future. And please help us create consensus around a modest set of benchmark systems which can be used to drive forward progress in the field.

The benchmark sets

Currently proposed benchmark sets are detailed in the paper and include:

Host guest systems
- CB7
- Gibb deep cavity cavitands (GDCCs) OA and TEMOA
- Cyclodextrins (alpha and beta)
Lysozyme model binding sites
- apolar L99A
- polar L99A/M102Q
Bromodomain BRD4-1

Other near-term candidates include:

Thrombin
Suggest and vote on your favorites via a feature request below

Community involvement is needed to pick and advance the best benchmarks.

Get involved

We need your help to pick the most informative systems, identify the challenges they present, and help make them standard benchmarks. Please provide your input:

Vote on what we should do next

For long-term directions, please help us prioritize what we ought to be doing in terms of benchmarks and other changes. Please click below to vote on one of these priorities or to suggest your own (such as addition of specific benchmark systems):

Submit an issue

If you have a specific suggestion or request relating to the material on GitHub or our paper, please submit a request on our issue tracker.

Submit a pull request

We also welcome contributions to the material which is already here to extend it (see Section IV in our paper) and encourage you to actually propose changes via a "pull request", even to the paper itself. This will allow us to track your contributions, as well. Specifically, the full list of contributors to the updated paper and data can be appended to subsequent versions of this work, as they would be for a software project. New versions of this work are assigned unique, cite-able DOIs and essentially constitute preprints, so they can be cited as interim research products.

Authors

David L. Mobley (UCI)
Germano Heinzelmann (Universidade Federal de Santa Catarina)
Niel M. Henriksen (UCSD)
Michael K. Gilson (UCSD)

Your name, too, can go here if you help us substantially revise/extend the paper.

Acknowledgments

We want to thank the following people who contributed to this repository and the paper, in addition to those acknowledged within the text itself

David Slochower (UCSD, Gilson lab): Grammar corrections and improved table formatting
Nascimento (in a comment on biorxiv): Highlighted PDB code error for n-phenylglycinonitrile
Jian Yin (UCSD, Gilson lab): Provided host-guest structures and input files for the CB7 and GDCC host-guest sets described in the paper

Please note that GitHub's automatic "contributors" list does not provide a full accounting of everyone contributing to this work, as some contributions have been received by e-mail or other mechanisms.

Versions

AR: Annual Review in Biophysics 46:531-558 (2017). This version split from this repo around the time of the 1.0 release below.
v1.0: As posted to bioRxiv
v1.0.1 (10.5281/zenodo.155330): Incorporating improved tables and typo fixes from D. Slochower; also, versions now have unique DOIs via Zenodo.
v1.0.4 (10.5281/zenodo.167349): Maintenance version fixing an incorrect PDB code and adding a new reference and some new links.
v1.1 (10.5281/zenodo.197428): Adds significant additional discussion on potential future benchmark sets, needs for workflow science, etc. See release notes for more details. Versions also now include the date and version number within the PDF.
v1.1.1 (10.5281/zenodo.254619): Adds input files for host-guest benchmarks; some revisions to text as recommended by Annual Reviews. See release notes for more details.
v1.1.2 (10.5281/zenodo.569575): Adds consistently handled SMILES for aromatics, Annual Reviews copyright/rights info in TeX and README, additional citation information for one reference, and new discussion of some new bromodomain absolute binding free energy work.
v1.1.3 (10.5281/zenodo.571227): Changes title to include "(a perpetual review)" to make more clear that this is not the same paper as the Annual Reviews version; makes clarifications to README.md about which version is which.
v1.1.4 (10.5281/zenodo.838361): Updates README.md to reflect publication; clarify differences in material; reflect availability on eScholarship. Updates paper to reflect migration to eScholarship rather than bioRxiv.
v1.2 (10.5281/zenodo.839047): Addition of bromodomain BRD4(1) test cases as a new ``soft'' benchmark, with help from Germano Heinzelmann. Addition of Heinzelmann as an author. Addition of files for BRD4(1) benchmark. Removed bromodomain material from future benchmarks in view of its presence now as a benchmark system.
v1.3: Include cyclodextrin benchmarks to data and to paper; removal of most of cyclodextrin material from future benchmarks. Addition of Niel Henriksen as an author based on his work on this. BRD4(1) changes: Reorganize data files; improve BRD4(1) README; switch sd to sdf files; give each BRD4(1) ligand a unique identifier specific to this paper.

Changes not yet in a release

Add info on how to cite this paper to main README.md
Fix experimental reference for catechol binding free energy value in Table VIII

Manifest

paper: Provides LaTeX source files and final PDF for the current version of the manuscript (reformatted and expanded from the version submitted to Ann. Rev. and with 2D structures added to the tables); images, etc. are also available in sub-directories, as is the supporting information.
input_files: Ultimately to include structures and simulation input files for all of the benchmark systems present as well as (we hope) gold standard calculated values for these files. Currently this includes:
- README.md: A more extensive document describing the files present
- BRD4 structures and simulation input files from Germano Heinzelmann
- CB7 structures and simulation input files from Jian Yin (Gilson lab)
- GDCC structures and simulation input files from Jian Yin (Gilson lab)
- Cyclodextrin structures and simulation input files from Niel Henriksen (Gilson lab)

benchmarksets's People

Contributors

Stargazers

Watchers

Forkers

slochower brjagger yongwangcph andrrizzi renm1 diegoenry guoweiqi jchodera int-zero bwang-ecnu angusezhang djhuggins saharctech snyga lmacaya yychuang

benchmarksets's Issues

Get CD benchmarks in?

@nhenriksen - we just got Germano's bromodomain stuff merged in, so would now be a good time for you to proceed towards getting your cyclodextrin stuff merged in as well? It seems like it will likely become important to do so fairly soon since it's potentially being used for the Yank paper (cc #41 ). Since we're done working on the bromodomain stuff now it should be possible to proceed towards merging without having to resolve conflicts multiple times.

Or, are there other things you need to do first?

Typo in reference for Lysozyme L99A/M102Q - catechol experimental value?

The reference for the binding free energy of catechol to T4 L99A/M102Q in Table VIII is indicated to be [224]. I took a look at the original paper to figure out buffer conditions, but I couldn't find a reference to ITC measurements with catechol (see Table 3 in [224]). Also, in the benchmark sets paper, catechol is the only compound coming from [224] with an error associated with the experimental binding free energy.

Is it possible that this is a typo? It could be that the value comes from [19] instead.

Update lysozyme figures with residue labeling

As suggested by @slochower, I should label the residues in the T4 binding site figure for the discussion on p. 11, right column.

Provide suitable README files for CB7, GDCC benchmarks

The CB7 and GDCC benchmarks do not have a README providing data or references to other computational works. We should provide these since (a) it's the right thing to do, and (b) we want to be consistent with the other data sets.

(This was mentioned in #47 here #47 (comment) but I'm creating an issue so we don't forget.)

The SDF files for charged compounds aren't listing the formal charges of the atoms

example, butylammonium: https://github.com/MobleyLab/benchmarksets/blob/master/input_files/cd-set1/sdf/guest-1.sdf

Mixed 1-4 scaling?

@nhenriksen - we were trying to work with the cyclopentanol (guest 4) example from CD set 1, and noticed that the AMBER prmtop file has mixed 1-4 scaling factors (SCEE). Was this intended and, if so, can you explain why?

This makes conversion into some code bases (GROMACS for example) impossible, and I've also never seen it before, so I am very curious where this came from/why it's done here.

Thanks.

cc @elkhoury

Include additional data, references

References:

Read Lindorff-Larsen, Miao papers on lsozyme binding; include references and relevant insights. Lindorff-Larsen reference: https://elifesciences.org/articles/17505
Reference Pan paper: http://pubs.acs.org/doi/abs/10.1021/acs.jctc.7b00172
Add Minh/Xie reference on multiple binding modes in L99A (10.1021/acs.jctc.6b01183); this same paper also notes difficulty in converging Yank binding free energy calculations (section 3.3).
Add additional trypsin references -- Tiwary, De Fabritiis, Noe, Doerr, Buch, Dickson, Amaro

Data curation

Provide isomeric SMILES for all compounds outside the LaTeX
Add info on ionization states. Tables typically show 2D structures for neutral forms; files provide a guess. This needs explaining. Suggestion: "Add info to the legends of the tables with the benchmark compounds and data some information about the relationship between what is shown in the table and what is provided in the molecule files. We may state that the molecules files provide a reasonable guess for the ionization states for the free ligands, but it is ultimately up to the user to decide what they want to do about ionization states if they are trying to match experiment."
Update octa acid uncertainty information (see DLM tasks)
Clean host/guest files so guest is not multi-residue, and hosts have residue number 1.
Possibly add Pan/Xu FKBP data/example/inputs, see DLM tasks
Possibly add lysozyme inputs from Rizzi, see DLM tasks
Check carboxylic acid bond order problem (?) -- #47 (comment)
Possibly re-generate guest starting structures using docking, #50

Additional discussion/paper editing:

Discuss hydration free energies and FreeSolv (by request; readers felt this was important to mention)
Note availability of Minh/Xie curated set of lysozyme binders, which now lives on Mobleylab GitHub
Clearly delineate soft and hard benchmarks in subsection headers
Fix typo -- p8, beginning of GDCC section 2, "directly with bound hosts" -> "directly with bound guests"
Note L99A/M102H as of possible future interest.
Add Ponder insights from SAMPL7 webinar (available online on Zenodo) to paper -- particularly that changing flexibility around upper ring affects binding of guests by 4-5 kcal/mol (said about 8:34 am) and that the diphenyl ether has two coupled torsions; can’t be fit as a sum of 1D C-O torsions. So they had to use 2D torsion-torsion coupling term (!!).

Other

License writing under CC-BY since only code can be MIT
Update eScholarship links to GitHub and vise versa

Attempt to associate benchmarks with readily available computational materials

This is the computational analog of #2 -- one would like to make it easy to do new studies on existing benchmark systems for a variety of possible tests as detailed in the paper, including things like:

Test a new method on systems studied with an existing forcefield and method
Test a new forcefield
Cross-compare simulation packages
Test sampling methods

etc.

We need to plan how to facilitate this. We'll need to sort out how to make available computational materials - structures, input files, etc. Ultimately, we will likely even want a way to specify specific order parameters to analyze for convergence, etc. (e.g., something machine-handleable which can tell automated analysis to be sure to check sampling of Val103 in lysozyme L99A).

Decide whether to write and include info on what new benchmarks are most needed

Originally, I'd planned to write material on what new benchmark systems are most needed (i.e. what attributes they should have or problems they should exemplify -- water sampling problems, for example?) but I ran out of space before getting to this.

Perhaps this should still be done; input will be helpful.

Connect with Zenodo for citeable DOIs for versions of this repo

Consider using Zenodo to make releases of this get permanent DOIs, as per https://guides.github.com/activities/citable-code/

Include link to BindingDB host-guest data which is now up

Gilson got BindingDB updated to include the HG data in this set, so we should update with the link: http://bindingdb.org/bind/HostGuest.jsp

Split out sections into separate TeX files for better editing

Since the community is welcome to edit/propose changes to the paper, I perhaps should split out the major sections into separate TeX files to make it easier to deal with multiple changes at once without editing clashes.

On the other hand, maybe this will make it harder since people will have to figure out which file they need to edit.

Errors using cyclic host `mol2` files with non-unique atom names

I'm dropping a note here to mention that reading in cyclic molecules with non-unique atom names can cause an error with ff.createSystem as detailed here. I'll update this issue once we decide on a robust solution.

Provide compound ID for all files

I think we should probably move towards a model where all ligands (or guests) in each benchmark set have an appropriate, unique, paper-specific numerical compound ID, rather than the current model where this is dependent on what set we're looking at. For example:

CB7 Tables 1&2: Has unique CID we assigned
GDCC Tables 3: Has unique CID we assigned, but will get broken if we want to provide structures docked into hosts as there are two hosts but only one set of compound IDs
GDCC Table 4: Has unique CID we assigned
CD Table 5 and 6: Has unique CID we assigned
lysozyme Tables 7 and 8: No CIDs, uses compound names only
BRD4(1) Table 9: Uses heterogeneous identifiers -- "Compound 4", "alprazolam", "Bzt-7", "JQ1(+)" etc.; this is probably the worst offender since some of these are pretty unsuitable as filenames due to special characters and/or spaces (e.g. some tools can't load files with spaces in their filenames and/or handle some of these special characters).

@GHeinzelmann @nhenriksen - thoughts? My preference I think is to make sure every set has a unique numerical compound ID in the tables and that this is used for all of the relevant files.

Add machine-readable tables for all the sets

Can we add Markdown tables for the CB7 and GDCC sets like we have for the CD sets? Those are really helpful.

I realize this information is in the manuscript itself, but when setting up calculations on the entire set of systems, it's way easier to use the Markdown (or just csv) tables than the PDF. For example, for each file I'm processing, I can fairly easily write a function to parse the tables and return host, guest, and store the experimental binding affinity for later analysis. Even better would be to list host and guest residue names, along with charge (although I should be able to get that from the SMILES without too much difficulty, but having the charge listed directly would avoid dependencies on e.g., OpenEye or other chemistry-parsing code here and help ensure everyone starts with the same exact state).

I also realize I could submit a PR myself -- and it's on my to-do list -- but by listing it here, someone might take a stab at it before it surfaces to the top for me.

Resurrect my spec sheet for benchmark set data

Ultimately, we want to have computational data available for benchmark systems to make it easy for new researchers to reproduce and then learn from (by building on or deviating from) the work of previous researchers. To facilitate this, we need to sort out more guidance in terms of how such computational data should be made available. Ideally, I think it would be made available in a way such that if you wanted to begin studies on my system, you could do it automatically given my archived data files, without even having to have a human being read a set of README files.

To make this possible, we need to decide what data we would provide and how.

At one point I started a Google Doc for discussion of how we could make this happen, and I need to resurrect that and get discussion going again here and elsewhere.

Add additional supporting data on host-guest and lysozyme systems

We will probably want to provide some additional data to accompany the existing benchmarks already noted in the paper in order to facilitate new science beyond the benchmarks proposed:

Provide detailed lists of additional binders/nonbinders for the host-guest benchmark systems
Provide a full list of lysozyme binders/nonbinders with references, since the Shoichet lab no longer seems to be maintaining their lists

gdcc-set2 and CB experimental Delta H is missing

Dear Mobley's team,

This is not really an issue and more like a request. I am using the host-guest system for my research and I need Experimental Delta H results. I realized that Delta H results are missing for gdcc-set2, CB set1 and CB set 2 and I was wondering if you will be able to provide it. I appreciate your support on this.

Best regards,
Sahar

Fix PDB code in table

From Nascimento on Biorxiv:

It seems that there is a mismatch in on of the lysozyme T4 (M102Q) complexes cited in table VI. The crystal structure 2RBO does not contain the n-phenylglycinonitrile as a ligand. Instead, 2-nitrothiophene is the binder there. So, the PDB code should be 2RBN to correctly point to the complex between T4 Lys M102Q and n-phenylglycinonitrile. Just to 2c for this very interesting paper!

Add README.md for host-guest inputs; add additional info

#22 added an extensive set of host-guest input files for the host-guest sets from the paper, courtesy of Jian Yin from the Gilson lab. I need to adjust the README she kindly provided into a README.md, and add a manifest of what files were added and how they were organized.

Update text to reflect host-guest sets added in #22

I need to adjust the paper text to mention the availability of the host-guest data sets added in #22 .

Provide guidance on here about documenting benchmark systems

Include a set of bullet points describing what should be documented about a benchmark system and how to document it

Add additional supplementary data, perhaps in markdown files?

In a recent round of edits on bromodomain ligands, Mike Gilson suggested:

Would it make sense to add some info to the table providing the references for the computational papers for each ligand to date? On the other hand, this would deviate from what we are doing in the other benchmark data tables...

@GHeinzelmann - what do you think of this? Not for BRD4(1) in specific, but should we be perhaps compiling supplemental data (perhaps in markdown files that other people can easily edit) for each benchmark set that lists all of the studies of each ligand, perhaps by DOI, maybe also with a spot where people can remark on key insights from each study?

This would provide a way for the community to effectively add notes to this repo on what they think has been shown in the literature; then we could link to it from the paper but it wouldn't be part of the paper itself.

Should we use table titles and not numbers in markdown files?

@nhenriksen @GHeinzelmann - I notice both of you reference (in your markdown files, which are great!) specific table numbers in the paper. I wonder if we should be referencing tables by title rather than by number. Otherwise, if we change things in the paper such that tables auto-renumber, then the table references will all be wrong and someone will have to fix them. If we just refer to them by title then we won't have to remember to change.

Thoughts?

CD mol2 files coordinates starting in the binding pocket

Hi all, we'd like to use the CD input files to run YANK calculations. In particular, we'd like to start from the .mol2 files currently in the nieldev branch to prepare our solvation boxes in TIP4P-EW waters. A couple of questions (tagging @nhenriksen who is working the branch):

Could you confirm that the mol2 files already have the same protonation state/charges that were used in the reference calculation?
Would it be possible to have the coordinates in the mol2 to be the same as the final rst7 file so that the guest will be in the binding pocket? I can work on this myself in case you don't have time but you are still interested. I'll have to do it anyway in the next couple of days to set up my simulations.

Put raw data for paper tables here in a more easily accessible format

What formats?

Currently it is in LaTeX tables, but we should probably also provide .csv and .json.

Basically, everything we provide that anyone might want ought to be easily available in convenient electronic formats

Parmed bug affecting cyclodextrin files

I just want make sure you've noticed this issue ParmEd/ParmEd#898 . Briefly, manipulating the cyclodextrin mol2 files with parmed results in a ring breaking. A work-around would be assigning a single residue number to all cyclodextrin atoms (currently 7 for beta-CD and 6 for alpha-CD).

@davidlmobley, if somebody in your group has run cyclodextrin calculations with YANK using non-OpenEye charges, this bug surely affected the setup.

Provide bound-state starting structures for hosts

The CB7 and GDCC guest input files do not have coordinates which correspond to a bound state in the host.

Per Niel:

They are close [to bound], but clearly not a plausible bound state. Jane made these files, and I don't see a way to fix this without manually setting them up or extracting conformations from the equilibrated prmtop/rst7 files.

I now have a Jupyter notebook I've prepared for SAMPL6 which can dock guests to hosts, so we should be able to re-generate these files from compound isomeric SMILES strings. It'll just take me a bit of time to get to that.

PDB files for BRD4 parameterized systems are missing CONECT records

The BRD4-*.pdb files in this directory are missing the required CONECT records for the small molecule ligands.

As a result:

They are not even valid PDB files, which require CONECT records for nonstandard residues
They cannot be used to build parameterized molecular systems

Decide what criteria benchmark systems should meet

It would be good to develop a set of criteria a benchmark system should typically meet in terms of data quality, structure availability, etc.

Originally, I thought that we would be able to develop a universal set of such criteria (i.e. high quality structures of such-and-such a resolution, ITC or SPR binding affinities, etc.), but then as the paper developed we realized that different types of data are needed depending on the purpose of a test, as in Section II.A ("hard" and "soft" benchmarks). So, it may not be that we can provide a universal set of criteria -- but it would be good to discuss criteria that might apply in the different categories.

Update README.md to reflect how to cite (once discussion is final)

Deposit calculated values for provided files when available

To move this in the direction of helping people benchmark, we should provide calculated values from gold standard calculations with the provided files, when available. These should be in a markdown file in the relevant directory, I think.

@nhenriksen - is this something you're able to add? I think you have values for all of the files you've deposited?

@GHeinzelmann - I think you may not?

@Janeyin600 - do you?

At some point we'll actually need to repeat the lysozyme calculations (or another group will) and get input files for those, and calculated values, in here as well.

Attempt to associate benchmarks with readily available experimental materials

An interesting question is whether it is possible to facilitate new experiments on existing benchmark systems. Specifically, could we make it easy to access the necessary materials for new experiments? For example, one could imagine being able to lay out for host-guest systems that one should buy this host and this guest from this supplier.

Perhaps there are vendors who would participate in this, or perhaps even NIST could provide standard reference materials?

mobleylab / benchmarksets Goto Github PK

benchmarksets's Introduction

Benchmark sets for free energy calculations

The paper

Versions

Publication in Annual Review

Ongoing updates and credit

Citing this work

The vision

The benchmark sets

Get involved

Vote on what we should do next

Submit an issue

Submit a pull request

Authors

Acknowledgments

Versions

Changes not yet in a release

Manifest

benchmarksets's People

Contributors

Stargazers

Watchers

Forkers

benchmarksets's Issues

References:

Data curation

Additional discussion/paper editing:

Other

Recommend Projects

Recommend Topics

Recommend Org

Jobs