mosdef-hub / reproducibility_study Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 14.0 23.26 MB

Repo for data collection, discussion, etc for a MoSDeF reproducibility study.

License: MIT License

Python 41.41% Shell 1.25% Jinja 0.25% Roff 0.44% Jupyter Notebook 56.64%

reproducibility_study's People

Contributors

Stargazers

Watchers

Forkers

justingilmer bc118 jennyfothergill ramanishsingh zijiewu3 calcraven chrisjonesbsu rwsmith7531 daico007 emarinri jpotoff dyukovsm joaander

reproducibility_study's Issues

Include Compound JSON files in each statepoint

For each statepoint, to ensure that we can get back the exact same mbuild Compound for future studies, we should also save out a JSON serialization of each mBuild compound.

For example:
filled_system.save("starting_compound.json")

This will ensure that we preserve the hierarchy of the compounds, and any other information that is unique to mBuild.

This can happen at the same time as the other file io when writing out the systems for our engines.

Add continuous integration

Now that this repo is public, let's add continuous integration! I'm most familiar with github actions and codecov, so if I work on this I will go with those unless anyone has any preference for a different service.

Add user instructions for how to run init/project

The signac structure is awesome, but a little higher level than I have used before--let's add instructions for folks who may not be familiar with how to initialize/run this project with signac.

Sanitize simulation outputs

From chat with @tcmoore3 and @justinGilmer:
Each engine-specific project.py file should have an additional operation--one which would sanitize the the log (containing energy and other property information) and trajectory (containing the positions, (and types?, bonds?) box information) into a standard format (perhaps a txt or npy file)
This way, the analysis project can operate on the entire workspace and look for standard output files.

We should discuss what all information should be saved and what format to use.

Molecules to study

A table of molecules we plan to include in this study, as well as conditions. Edit as needed.

NOTE: This does not currently include any challenge molecules we discussed before.

Molecules	Forcefield	Conditions	Ensembles	Reasoning
United-Atom Methane	TraPPE-UA	3 (T, rho), 3 (T,P)	NVT, NpT, GEMC-NVT	LJ only
United-Atom Ethane	TraPPE-UA	1 (T, rho), 1 (T,P)	NVT, NpT	Adds bonds
United-Atom Propane	TraPPE-UA	1 (T, rho), 1 (T,P)	NVT, NpT	Adds angles
United-Atom Butane	TraPPE-UA	1 (T, rho), 1 (T,P)	NVT, NpT GEMC-NVT	Adds dihedrals
United-Atom Hexane	TraPPE-UA	3 (T, rho), 3 (T,P)	NVT, NpT	Adds dihedrals and angles
SPC/E Water	SPC/E	3 (T, rho), 3(T, P)	NVT, NpT	Electrostatics
Atomistic Ethanol	OPLS-AA	3 (T, rho), 3 (T,P)	NVT, NpT	Adds dihedrals

Clean up unnecessary files

Subprojects input files, readme, etc. update or delete

Engine specific smoothing functions?

Let's compile a list of the smoothing functions per engine to ensure we choose the same one across the study.

edit copied from google sheet

--	Long-Range Correction (LRC)	Switch	Shift	Hard cutoff	XPLOR
GROMACS
LAMMPS	Yes	Yes	Yes	Yes
HOOMD	No	No	Yes	Yes	Yes
GOMC	Yes	Yes	Yes	Yes
Cassandra	Yes	Yes	Yes	Yes
MCCCS-MN	Yes	No	Yes	Yes

Option to change log filename in `plot_job_property_with_t0`

Currently the function only reads file named "log.txt".
This might not work for systems with two boxes, where we have something like "log_box1.txt", and "log_box2.txt".
New optional argument, log_filename can be introduced in the function.

Add templates folder

Help everyone get set up on whichever cluster they use as well.

Selectively delete workspace folder based on molecule

Given a molecule name (and/or engine name) delete all the statepoint folders from the workspace.

Helpful when something goes wrong with the simulation of one molecule type and the workspace needs to be cleared up just for that molecule.

Box's build different initial atom coordinates when run on my Mac for Methane

Box's build different initial atom coordinates when run on my Mac for Methane, which yields different initial energies.

LJ tail correction

Putting this here instead of on the spreadsheet. We need to be consistent with our handling of the LJ tail correction. If we are going to use it, we will have to implement it in HOOMD-blue. I know LAMMPS and GROMACS have switches for handling the LJ tail correction, but I'm not sure about the other engines.

Should we add codecov

might be helpful to verify our unit test coverage

How to collate/combine output data

Each engine-specific project.py file will create simulation/analysis output, how should we combine all these outputs?
Some discussion about using zenodo: Would we just upload each job directory (as the hashed name will be the same across systems) as a separate entry to zenodo?

Create containers with our software stack

I think the most reproducible way to sample with our various engines would be to use a docker/singularity container. It'll also make the installation of engines not available on conda (gomc, mcccs) much easier.

Changing the water and ethanol box sizes

In our meeting, we discussed and agreed on the following changes to the project:

Water box size change to ~32 Ang, around 1,000 waters.
Ethanol's molecules from 700 to 500.

I made the exact changes to the molecule count and box sizes in the Excel sheet, box_sizes tab. In summary, they are:

Water box side length = 32.07 Ang, with 1,100 water molecules
Ethanol box side length = 36.46 Ang, with 500 ethanol molecules

Feel free to change these numbers slightly, but the MC folks, or at least I, like to keep a rounded number of molecules, as the MC steps are a multiple of the number of molecules. I prefer rounded to 100. For example, 1029 molecule yields weird output steps.

How do we want to access processed data from simulation trajectories?

We need a way to store processed data (for now rdf's, densities, etc.) and keep that in this repo for further analysis. This needs to make sure that the trajectory information can be easily accessed for each simulation engine, but to make sure that none of the trajectory information itself is pushed to this repository due to data storage concerns.

My first thought was to have the workspace folder be saved to the git repo, but keep all of the trajectory information in the .gitignore so only the processed files would be tracked by git. We could also just have an entire separate directory to copy the information over and be accessible to the analysis routines. That could also be separated by job or by simulation engine, or even by molecule. Anyone have any suggestions?

Variations in density depending on calculation method

I ran into some confusion with the tail correction where I was getting different densities based on how the density is calculated--see snippet for example. Which method is the most correct?

Codes snippet: (uses #137)

import unyt as u
import numpy as np
from pymbar.timeseries import subsampleCorrelatedData as subsample          
                                                                            
import reproducibility_project.src.analysis.equilibration as eq  
from reproducibility_project.src.engines.hoomd.project import clean_data

system_mass = 14436.0 * u.amu  # job.sp.mass * u.amu * job.sp.N_liquid
data = np.genfromtxt(logfile, names=True)
data = clean_data(data)
volume = data["volume"] * u.nm**3
density = np.mean(system_mass/volume).to("g/cm**3")
density2 = (system_mass/np.mean(volume)).to("g/cm**3")
                                           
iseq, _, _, _ = eq.is_equilibrated(volume)                               
if iseq:                                                                    
    uncorr, i, g, N = eq.trim_non_equilibrated(volume)                   
    indices = subsample(uncorr, g=g, conservative=True)                     
    pymbar_volume= volume[indices]
    pymbar_density = (system_mass/np.mean(pymbar_volume)).to("g/cm**3")
    pymbar_density2 = np.mean(system_mass/pymbar_volume).to("g/cm**3")
print(pymbar_density, pymbar_density2, density, density2)

0.3598699603928832 g/cm**3 0.3748876935629535 g/cm**3 0.37496358408470365 g/cm**3 0.3604007295859862 g/cm**3

Attached logfile from job 6bb57f05676b7dc09c020295463f8846 in lrc_shift_subproject
log-npt.txt

Sampling rates

What sampling rates do we want to use for our properties and trajectories in our simulations?

NOTE: Suggestions for MD

trajectory_sample_rate: 10000 timesteps
property_sample_rate: 1000 timesteps

Just a generic guess at first, but the sooner we decide on this the better. I think @bc118 mentioned that the MC folks have defined values for their sims already.

Equilibration and sampling

How should we decide that our systems are equilibrated and decorrelated?

Add pymbar to environment
Create a function that works on MD (using pymbar's timeseries module on the potential energy should work for MD)
Find a solution for equilibration/decorrelation times for MC

How to handle jobs that don't pass `is_equilibrated` after completion.

As we're making progress on some of the post-simulation analysis workflow, one of the added features is testing if a simulation is considered equilibrated given a threshold (#55 #56 ). Right now, if it fails that test we get an error raised, but nothing else happens. We should probably discuss how to handle restarting a job in this scenario.

Pinning all software in environment.yml

Per discussion with @CalCraven and @ramanishsingh, we think it's best that we pin down all the software used in this project (in the environment.yml).

Trajectory Name For Diffusion Calculations

Looking at the calculation for RDF and for diffusion, both apply their analysis for default trajectory names of trajectory.gsd. However, the diffusion calculation needs to be applied to an NVT production simulation, whereas the RDF does not have that stipulation. Unless the plan is to also do all of our analysis from the NVT simulation, instead of the "production" simulation that is in NPT, we might want to allow for a keyword to be passed to these analysis routines to make sure the correct trajectory is grabbed. It also might be nice to add to the project some guidelines for naming conventions so we all keep relatively similar.

Here are links to the relevant lines of code:

reproducibility_study/reproducibility_project/src/analysis/rdf.py

Line 42 in 8b24e48

rdf = _gsd_rdf(job.fn("trajectory.gsd"), frames, stride, bins, r_min, r_max)

reproducibility_study/reproducibility_project/src/analysis/diffusion.py

Line 43 in 8b24e48

msd, timesteps = _gsd_msd(job.fn("trajectory.gsd"), skip, stride, unwrapped)

Updates to SPE for Foyer forcefield parameters

According to discussion with @ramanishsingh, the parameters used for the MCCCS components were take directly from the TraPPE website instead of the Foyer forcefield if it used the TraPPE UA model (methane, benzene, pentane). Due to minor 1e-4 truncation differences in epsilon parameters specified in K (TraPPE units) and in kJ/mol (Foyer units), the resulting forcefields are different enough to show differences in a single point energy calculation.

MethaneUA _CH4 epsilon:
TraPPE original is specified as 148.0K
TraPPE Foyer =1.23054 kJ/mol = 148.0018969973220K

Engine FF Energy (K) Energy (kj/mol) Percentage diff w.r.t. LAMMPS

LAMMPS TraPPE_foyer - 536743.553800000 0

MCCCS-MN TraPPE_orig 64554583.8347322 536736.674124320 0.001281744

MCCCS-MN TraPPE_foyer 64555411.2663036 536743.553773189 4.99513E-09

Engine	FF	Energy (K)	Energy (kj/mol)	Percentage diff w.r.t. LAMMPS
LAMMPS	TraPPE_foyer	-	536743.553800000	0
MCCCS-MN	TraPPE_orig	64554583.8347322	536736.674124320	0.001281744
MCCCS-MN	TraPPE_foyer	64555411.2663036	536743.553773189	4.99513E-09

Analysis TO-DOs

Fully define energy breakdowns for each simulation engine
Pick/write RDF analysis code
Pick/write diffusion coefficient code

What constraints are needed for MD simulations?

I am running into some issues when using with constraints = all-angles for a box of ethanol in GROMACS (with the LINCS algorithm). The simulation ran fine with constraints = all-bonds, so I am just wondering what is the appropriate constraints we should use for the project or for MD specifically?

Choose standard units.

We should decide on a standard unit system to stick with for standard output.
It makes sense to me to stick with foyer units.

Quantity	Units
distance	nm
angle	radians
mass	amu
energy	kJ/mol

Let's make sure there's nothing missing and get consensus. :)

Should we use `constrainmol` to handle bond constraints?

It might be reasonable to use constrainmol to solve for our coordinates, especially for our constrained molecules, as this this will allow us to build up our molecules, parametrize them, and then use their r_eq values to define their bond lengths.

https://github.com/rsdefever/constrainmol/blob/0baa4c25abf2d01ad4c3078a541b4fd943514121/constrainmol/constrainmol.py#L30-L36

Although i dont think this handles the cases of angle constraints though.

Add forcefields to init

Right now the forcefields are not filled in the job statepoints. Because the Forcefield object is not serializable, it makes the most sense to me to use either a string or the path to an xml file. I lean towards string. Then we could do something like

if job.sp.forcefield == "OPLSAA":
    ff = foyer.forcefields.load_OPLSAA()
elif job.sp.forcefield == "TRAPPE":
    ff = foyer.forcefields.Forcefield("path/to/benzene_trappe-ua_like.xml")

also, are we using the included trappe xml for benzene only?

Hydrogen bond analysis

During March 7 gather.town meeting, we decided to conduct H-bond analysis (average per frame and lifetime if possible) for the systems that have hydrogen bonds.
We can use MDAnalysis (https://docs.mdanalysis.org/2.0.0/documentation_pages/analysis/hydrogenbonds.html#input) or mdtraj (https://mdtraj.org/1.9.4/examples/hbonds.html) for conducting the analysis.
This requires all the engines using the same topology file while saving their .gsd trajectories (it will be also be helpful for the RDF analysis) as we will have to select the donor and acceptor atoms. The topology file can be generated by saving the filled_box generated using the construct_system function (https://github.com/ramanishsingh/reproducibility_study/blob/afecf059e5dfbfc140623ecb273b8d353d58971c/reproducibility_project/src/engines/mcccs/project.py#L627).

add liquid and vapor densities and N_molecules and pressure to init.py

We need to add to following (MD can ignore the vapor data):

liquid and vapor densities
liquid and vapor N_molecules
pressure
trappe-like FF (benzene FF) in forcefield section

System parameters

Let's discuss the system parameters here. We've talked quite a bit about it on the calls, and a little bit in Slack, but an issue in the repo would probably be a better place to discuss this.

Maybe once it's settled someone can make a PR to update the README with the decided details.

Standardize matplotlib styles

We should include a matplotlibrc file so we do not need to worry as much about the plot formatting. If anyone has a nice rc file already, id be happy to include something like that.

Easy packages for Thermo Analysis

So for lammps, the default is to read out your data to a log file with information about setting up each simulation run, and then writing out the specified thermo data for each interval (currently, I'm writing out step, temp, press, total energy, kinetic energy, potential energy, and density information). These files take a bit of parsing to grab the correct data for checking something like equilibration. Ryan DeFever has a really nice utility called Lammps Thermo that makes this process really simple.

I know gromacs has a similar package that is extremely useful Panedr.

Panedr is available on conda-forge it looks like, but Lammps Thermo has to be pip installed from a git clone. If possible, I would like to add both of these packages to the environment.yml file.

NaN values from `is_equilibrated`?

My NPT sim is clearly not equilibrated at the beginning and maybe my pressure coupling is too high (guessing from the initial oscillations in volume, see attached plots of volume vs timestep), but eventually it appears that the simulation volume equilibrates. I would think that this kind of data would be fine for pymbar, but if the whole run is used, I get NaN values:

It's odd though because if I chop of the first 50 values, is_equilibrated works as expected. what should be the standard procedure for choosing where to start this equilibration analysis? I'm including my logfile in case anyone wants to test for themselves.
log-npt.txt

Signac Flow Job Labels

We need to be consistent with labeling our jobs if we're going to have a singular combined workspace. Speaking with @justinGilmer and @daico007, there might be a few ways to address this:

A simple method would be to use boolean statements for checking if files exist. For example:

>>> @Project.label
>>> def lammps_applied_ff(job):
>>>    return job.isfile("box.lammps") and (job.sp.simulation_engine == "lammps")

A second option could be to use signac label filtering syntax. I'm not an expert on that so I'll leave that up to someone else explaining how that might work, but that could also simplify things.

mosdef-hub / reproducibility_study Goto Github PK

reproducibility_study's People

Contributors

Stargazers

Watchers

Forkers

reproducibility_study's Issues

NOTE: Suggestions for MD

Recommend Projects

Recommend Topics

Recommend Org

Jobs