mosdef-hub / reproducibility_study Goto Github PK
View Code? Open in Web Editor NEWRepo for data collection, discussion, etc for a MoSDeF reproducibility study.
License: MIT License
Repo for data collection, discussion, etc for a MoSDeF reproducibility study.
License: MIT License
For each statepoint, to ensure that we can get back the exact same mbuild Compound for future studies, we should also save out a JSON serialization of each mBuild compound.
For example:
filled_system.save("starting_compound.json")
This will ensure that we preserve the hierarchy of the compounds, and any other information that is unique to mBuild.
This can happen at the same time as the other file io when writing out the systems for our engines.
Now that this repo is public, let's add continuous integration! I'm most familiar with github actions and codecov, so if I work on this I will go with those unless anyone has any preference for a different service.
The signac structure is awesome, but a little higher level than I have used before--let's add instructions for folks who may not be familiar with how to initialize/run this project with signac.
From chat with @tcmoore3 and @justinGilmer:
Each engine-specific project.py file should have an additional operation--one which would sanitize the the log (containing energy and other property information) and trajectory (containing the positions, (and types?, bonds?) box information) into a standard format (perhaps a txt or npy file)
This way, the analysis project can operate on the entire workspace and look for standard output files.
We should discuss what all information should be saved and what format to use.
A table of molecules we plan to include in this study, as well as conditions. Edit as needed.
NOTE: This does not currently include any challenge molecules we discussed before.
Molecules | Forcefield | Conditions | Ensembles | Reasoning |
---|---|---|---|---|
United-Atom Methane | TraPPE-UA | 3 (T, rho), 3 (T,P) | NVT, NpT, GEMC-NVT | LJ only |
United-Atom Ethane | TraPPE-UA | 1 (T, rho), 1 (T,P) | NVT, NpT | Adds bonds |
United-Atom Propane | TraPPE-UA | 1 (T, rho), 1 (T,P) | NVT, NpT | Adds angles |
United-Atom Butane | TraPPE-UA | 1 (T, rho), 1 (T,P) | NVT, NpT GEMC-NVT | Adds dihedrals |
United-Atom Hexane | TraPPE-UA | 3 (T, rho), 3 (T,P) | NVT, NpT | Adds dihedrals and angles |
SPC/E Water | SPC/E | 3 (T, rho), 3(T, P) | NVT, NpT | Electrostatics |
Atomistic Ethanol | OPLS-AA | 3 (T, rho), 3 (T,P) | NVT, NpT | Adds dihedrals |
Subprojects input files, readme, etc. update or delete
Let's compile a list of the smoothing functions per engine to ensure we choose the same one across the study.
edit copied from google sheet
-- | Long-Range Correction (LRC) | Switch | Shift | Hard cutoff | XPLOR |
---|---|---|---|---|---|
GROMACS | |||||
LAMMPS | Yes | Yes | Yes | Yes | |
HOOMD | No | No | Yes | Yes | Yes |
GOMC | Yes | Yes | Yes | Yes | |
Cassandra | Yes | Yes | Yes | Yes | |
MCCCS-MN | Yes | No | Yes | Yes |
Currently the function only reads file named "log.txt".
This might not work for systems with two boxes, where we have something like "log_box1.txt", and "log_box2.txt".
New optional argument, log_filename
can be introduced in the function.
Help everyone get set up on whichever cluster they use as well.
Given a molecule name (and/or engine name) delete all the statepoint folders from the workspace.
Helpful when something goes wrong with the simulation of one molecule type and the workspace needs to be cleared up just for that molecule.
Box's build different initial atom coordinates when run on my Mac for Methane, which yields different initial energies.
Putting this here instead of on the spreadsheet. We need to be consistent with our handling of the LJ tail correction. If we are going to use it, we will have to implement it in HOOMD-blue. I know LAMMPS and GROMACS have switches for handling the LJ tail correction, but I'm not sure about the other engines.
might be helpful to verify our unit test coverage
Each engine-specific project.py file will create simulation/analysis output, how should we combine all these outputs?
Some discussion about using zenodo: Would we just upload each job directory (as the hashed name will be the same across systems) as a separate entry to zenodo?
I think the most reproducible way to sample with our various engines would be to use a docker/singularity container. It'll also make the installation of engines not available on conda (gomc, mcccs) much easier.
In our meeting, we discussed and agreed on the following changes to the project:
I made the exact changes to the molecule count and box sizes in the Excel sheet, box_sizes tab. In summary, they are:
Feel free to change these numbers slightly, but the MC folks, or at least I, like to keep a rounded number of molecules, as the MC steps are a multiple of the number of molecules. I prefer rounded to 100. For example, 1029 molecule yields weird output steps.
We need a way to store processed data (for now rdf's, densities, etc.) and keep that in this repo for further analysis. This needs to make sure that the trajectory information can be easily accessed for each simulation engine, but to make sure that none of the trajectory information itself is pushed to this repository due to data storage concerns.
My first thought was to have the workspace folder be saved to the git repo, but keep all of the trajectory information in the .gitignore so only the processed files would be tracked by git. We could also just have an entire separate directory to copy the information over and be accessible to the analysis routines. That could also be separated by job or by simulation engine, or even by molecule. Anyone have any suggestions?
I ran into some confusion with the tail correction where I was getting different densities based on how the density is calculated--see snippet for example. Which method is the most correct?
Codes snippet: (uses #137)
import unyt as u
import numpy as np
from pymbar.timeseries import subsampleCorrelatedData as subsample
import reproducibility_project.src.analysis.equilibration as eq
from reproducibility_project.src.engines.hoomd.project import clean_data
system_mass = 14436.0 * u.amu # job.sp.mass * u.amu * job.sp.N_liquid
data = np.genfromtxt(logfile, names=True)
data = clean_data(data)
volume = data["volume"] * u.nm**3
density = np.mean(system_mass/volume).to("g/cm**3")
density2 = (system_mass/np.mean(volume)).to("g/cm**3")
iseq, _, _, _ = eq.is_equilibrated(volume)
if iseq:
uncorr, i, g, N = eq.trim_non_equilibrated(volume)
indices = subsample(uncorr, g=g, conservative=True)
pymbar_volume= volume[indices]
pymbar_density = (system_mass/np.mean(pymbar_volume)).to("g/cm**3")
pymbar_density2 = np.mean(system_mass/pymbar_volume).to("g/cm**3")
print(pymbar_density, pymbar_density2, density, density2)
0.3598699603928832 g/cm**3 0.3748876935629535 g/cm**3 0.37496358408470365 g/cm**3 0.3604007295859862 g/cm**3
Attached logfile from job 6bb57f05676b7dc09c020295463f8846 in lrc_shift_subproject
log-npt.txt
What sampling rates do we want to use for our properties and trajectories in our simulations?
trajectory_sample_rate: 10000 timesteps
property_sample_rate: 1000 timesteps
Just a generic guess at first, but the sooner we decide on this the better. I think @bc118 mentioned that the MC folks have defined values for their sims already.
How should we decide that our systems are equilibrated and decorrelated?
As we're making progress on some of the post-simulation analysis workflow, one of the added features is testing if a simulation is considered equilibrated given a threshold (#55 #56 ). Right now, if it fails that test we get an error raised, but nothing else happens. We should probably discuss how to handle restarting a job in this scenario.
Per discussion with @CalCraven and @ramanishsingh, we think it's best that we pin down all the software used in this project (in the environment.yml
).
Looking at the calculation for RDF and for diffusion, both apply their analysis for default trajectory names of trajectory.gsd
. However, the diffusion calculation needs to be applied to an NVT production simulation, whereas the RDF does not have that stipulation. Unless the plan is to also do all of our analysis from the NVT simulation, instead of the "production" simulation that is in NPT, we might want to allow for a keyword to be passed to these analysis routines to make sure the correct trajectory is grabbed. It also might be nice to add to the project some guidelines for naming conventions so we all keep relatively similar.
Here are links to the relevant lines of code:
According to discussion with @ramanishsingh, the parameters used for the MCCCS components were take directly from the TraPPE website instead of the Foyer forcefield if it used the TraPPE UA model (methane, benzene, pentane). Due to minor 1e-4 truncation differences in epsilon parameters specified in K (TraPPE units) and in kJ/mol (Foyer units), the resulting forcefields are different enough to show differences in a single point energy calculation.
MethaneUA _CH4 epsilon:
TraPPE original is specified as 148.0K
TraPPE Foyer =1.23054 kJ/mol = 148.0018969973220K
Engine | FF | Energy (K) | Energy (kj/mol) | Percentage diff w.r.t. LAMMPS |
---|---|---|---|---|
LAMMPS | TraPPE_foyer | - | 536743.553800000 | 0 |
MCCCS-MN | TraPPE_orig | 64554583.8347322 | 536736.674124320 | 0.001281744 |
MCCCS-MN | TraPPE_foyer | 64555411.2663036 | 536743.553773189 | 4.99513E-09 |
I am running into some issues when using with constraints = all-angles
for a box of ethanol in GROMACS (with the LINCS algorithm). The simulation ran fine with constraints = all-bonds
, so I am just wondering what is the appropriate constraints we should use for the project or for MD specifically?
We should decide on a standard unit system to stick with for standard output.
It makes sense to me to stick with foyer units.
Quantity | Units |
---|---|
distance | nm |
angle | radians |
mass | amu |
energy | kJ/mol |
Let's make sure there's nothing missing and get consensus. :)
It might be reasonable to use constrainmol
to solve for our coordinates, especially for our constrained molecules, as this this will allow us to build up our molecules, parametrize them, and then use their r_eq
values to define their bond lengths.
Although i dont think this handles the cases of angle constraints though.
Right now the forcefields are not filled in the job statepoints. Because the Forcefield object is not serializable, it makes the most sense to me to use either a string or the path to an xml file. I lean towards string. Then we could do something like
if job.sp.forcefield == "OPLSAA":
ff = foyer.forcefields.load_OPLSAA()
elif job.sp.forcefield == "TRAPPE":
ff = foyer.forcefields.Forcefield("path/to/benzene_trappe-ua_like.xml")
also, are we using the included trappe xml for benzene only?
During March 7 gather.town meeting, we decided to conduct H-bond analysis (average per frame and lifetime if possible) for the systems that have hydrogen bonds.
We can use MDAnalysis (https://docs.mdanalysis.org/2.0.0/documentation_pages/analysis/hydrogenbonds.html#input) or mdtraj (https://mdtraj.org/1.9.4/examples/hbonds.html) for conducting the analysis.
This requires all the engines using the same topology file while saving their .gsd trajectories (it will be also be helpful for the RDF analysis) as we will have to select the donor and acceptor atoms. The topology file can be generated by saving the filled_box
generated using the construct_system
function (https://github.com/ramanishsingh/reproducibility_study/blob/afecf059e5dfbfc140623ecb273b8d353d58971c/reproducibility_project/src/engines/mcccs/project.py#L627).
We need to add to following (MD can ignore the vapor data):
Let's discuss the system parameters here. We've talked quite a bit about it on the calls, and a little bit in Slack, but an issue in the repo would probably be a better place to discuss this.
Maybe once it's settled someone can make a PR to update the README with the decided details.
We should include a matplotlibrc
file so we do not need to worry as much about the plot formatting. If anyone has a nice rc file already, id be happy to include something like that.
So for lammps, the default is to read out your data to a log file with information about setting up each simulation run, and then writing out the specified thermo data for each interval (currently, I'm writing out step, temp, press, total energy, kinetic energy, potential energy, and density information). These files take a bit of parsing to grab the correct data for checking something like equilibration. Ryan DeFever has a really nice utility called Lammps Thermo that makes this process really simple.
I know gromacs has a similar package that is extremely useful Panedr.
Panedr is available on conda-forge it looks like, but Lammps Thermo has to be pip installed from a git clone. If possible, I would like to add both of these packages to the environment.yml
file.
My NPT sim is clearly not equilibrated at the beginning and maybe my pressure coupling is too high (guessing from the initial oscillations in volume, see attached plots of volume vs timestep), but eventually it appears that the simulation volume equilibrates. I would think that this kind of data would be fine for pymbar, but if the whole run is used, I get NaN values:
It's odd though because if I chop of the first 50 values, is_equilibrated
works as expected. what should be the standard procedure for choosing where to start this equilibration analysis? I'm including my logfile in case anyone wants to test for themselves.
log-npt.txt
We need to be consistent with labeling our jobs if we're going to have a singular combined workspace. Speaking with @justinGilmer and @daico007, there might be a few ways to address this:
>>> @Project.label
>>> def lammps_applied_ff(job):
>>> return job.isfile("box.lammps") and (job.sp.simulation_engine == "lammps")
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.