GithubHelp home page GithubHelp logo

mosdef-hub / reproducibility_study Goto Github PK

View Code? Open in Web Editor NEW
6.0 6.0 14.0 23.26 MB

Repo for data collection, discussion, etc for a MoSDeF reproducibility study.

License: MIT License

Python 41.41% Shell 1.25% Jinja 0.25% Roff 0.44% Jupyter Notebook 56.64%

reproducibility_study's People

Contributors

bc118 avatar calcraven avatar chrisjonesbsu avatar daico007 avatar dyukovsm avatar emarinri avatar jennyfothergill avatar joaander avatar justingilmer avatar pre-commit-ci[bot] avatar ramanishsingh avatar rwsmith7531 avatar tcmoore3 avatar zijiewu3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

reproducibility_study's Issues

Include Compound JSON files in each statepoint

For each statepoint, to ensure that we can get back the exact same mbuild Compound for future studies, we should also save out a JSON serialization of each mBuild compound.

For example:
filled_system.save("starting_compound.json")

This will ensure that we preserve the hierarchy of the compounds, and any other information that is unique to mBuild.

This can happen at the same time as the other file io when writing out the systems for our engines.

Add continuous integration

Now that this repo is public, let's add continuous integration! I'm most familiar with github actions and codecov, so if I work on this I will go with those unless anyone has any preference for a different service.

Add user instructions for how to run init/project

The signac structure is awesome, but a little higher level than I have used before--let's add instructions for folks who may not be familiar with how to initialize/run this project with signac.

Sanitize simulation outputs

From chat with @tcmoore3 and @justinGilmer:
Each engine-specific project.py file should have an additional operation--one which would sanitize the the log (containing energy and other property information) and trajectory (containing the positions, (and types?, bonds?) box information) into a standard format (perhaps a txt or npy file)
This way, the analysis project can operate on the entire workspace and look for standard output files.

We should discuss what all information should be saved and what format to use.

Molecules to study

A table of molecules we plan to include in this study, as well as conditions. Edit as needed.

NOTE: This does not currently include any challenge molecules we discussed before.

Molecules Forcefield Conditions Ensembles Reasoning
United-Atom Methane TraPPE-UA 3 (T, rho), 3 (T,P) NVT, NpT, GEMC-NVT LJ only
United-Atom Ethane TraPPE-UA 1 (T, rho), 1 (T,P) NVT, NpT Adds bonds
United-Atom Propane TraPPE-UA 1 (T, rho), 1 (T,P) NVT, NpT Adds angles
United-Atom Butane TraPPE-UA 1 (T, rho), 1 (T,P) NVT, NpT GEMC-NVT Adds dihedrals
United-Atom Hexane TraPPE-UA 3 (T, rho), 3 (T,P) NVT, NpT Adds dihedrals and angles
SPC/E Water SPC/E 3 (T, rho), 3(T, P) NVT, NpT Electrostatics
Atomistic Ethanol OPLS-AA 3 (T, rho), 3 (T,P) NVT, NpT Adds dihedrals

Engine specific smoothing functions?

Let's compile a list of the smoothing functions per engine to ensure we choose the same one across the study.

edit copied from google sheet

-- Long-Range Correction (LRC) Switch Shift Hard cutoff XPLOR
GROMACS          
LAMMPS Yes Yes Yes Yes  
HOOMD No No Yes Yes Yes
GOMC Yes Yes Yes Yes  
Cassandra Yes Yes Yes Yes  
MCCCS-MN Yes No Yes Yes  

Option to change log filename in `plot_job_property_with_t0`

Currently the function only reads file named "log.txt".
This might not work for systems with two boxes, where we have something like "log_box1.txt", and "log_box2.txt".
New optional argument, log_filename can be introduced in the function.

Selectively delete workspace folder based on molecule

Given a molecule name (and/or engine name) delete all the statepoint folders from the workspace.

Helpful when something goes wrong with the simulation of one molecule type and the workspace needs to be cleared up just for that molecule.

LJ tail correction

Putting this here instead of on the spreadsheet. We need to be consistent with our handling of the LJ tail correction. If we are going to use it, we will have to implement it in HOOMD-blue. I know LAMMPS and GROMACS have switches for handling the LJ tail correction, but I'm not sure about the other engines.

How to collate/combine output data

Each engine-specific project.py file will create simulation/analysis output, how should we combine all these outputs?
Some discussion about using zenodo: Would we just upload each job directory (as the hashed name will be the same across systems) as a separate entry to zenodo?

Create containers with our software stack

I think the most reproducible way to sample with our various engines would be to use a docker/singularity container. It'll also make the installation of engines not available on conda (gomc, mcccs) much easier.

Changing the water and ethanol box sizes

In our meeting, we discussed and agreed on the following changes to the project:

  • Water box size change to ~32 Ang, around 1,000 waters.
  • Ethanol's molecules from 700 to 500.

I made the exact changes to the molecule count and box sizes in the Excel sheet, box_sizes tab. In summary, they are:

  • Water box side length = 32.07 Ang, with 1,100 water molecules
  • Ethanol box side length = 36.46 Ang, with 500 ethanol molecules

Feel free to change these numbers slightly, but the MC folks, or at least I, like to keep a rounded number of molecules, as the MC steps are a multiple of the number of molecules. I prefer rounded to 100. For example, 1029 molecule yields weird output steps.

How do we want to access processed data from simulation trajectories?

We need a way to store processed data (for now rdf's, densities, etc.) and keep that in this repo for further analysis. This needs to make sure that the trajectory information can be easily accessed for each simulation engine, but to make sure that none of the trajectory information itself is pushed to this repository due to data storage concerns.

My first thought was to have the workspace folder be saved to the git repo, but keep all of the trajectory information in the .gitignore so only the processed files would be tracked by git. We could also just have an entire separate directory to copy the information over and be accessible to the analysis routines. That could also be separated by job or by simulation engine, or even by molecule. Anyone have any suggestions?

Variations in density depending on calculation method

I ran into some confusion with the tail correction where I was getting different densities based on how the density is calculated--see snippet for example. Which method is the most correct?

Codes snippet: (uses #137)

import unyt as u
import numpy as np
from pymbar.timeseries import subsampleCorrelatedData as subsample          
                                                                            
import reproducibility_project.src.analysis.equilibration as eq  
from reproducibility_project.src.engines.hoomd.project import clean_data

system_mass = 14436.0 * u.amu  # job.sp.mass * u.amu * job.sp.N_liquid
data = np.genfromtxt(logfile, names=True)
data = clean_data(data)
volume = data["volume"] * u.nm**3
density = np.mean(system_mass/volume).to("g/cm**3")
density2 = (system_mass/np.mean(volume)).to("g/cm**3")
                                           
iseq, _, _, _ = eq.is_equilibrated(volume)                               
if iseq:                                                                    
    uncorr, i, g, N = eq.trim_non_equilibrated(volume)                   
    indices = subsample(uncorr, g=g, conservative=True)                     
    pymbar_volume= volume[indices]
    pymbar_density = (system_mass/np.mean(pymbar_volume)).to("g/cm**3")
    pymbar_density2 = np.mean(system_mass/pymbar_volume).to("g/cm**3")
print(pymbar_density, pymbar_density2, density, density2)
0.3598699603928832 g/cm**3 0.3748876935629535 g/cm**3 0.37496358408470365 g/cm**3 0.3604007295859862 g/cm**3

Attached logfile from job 6bb57f05676b7dc09c020295463f8846 in lrc_shift_subproject
log-npt.txt

Sampling rates

What sampling rates do we want to use for our properties and trajectories in our simulations?

NOTE: Suggestions for MD

trajectory_sample_rate: 10000 timesteps
property_sample_rate: 1000 timesteps

Just a generic guess at first, but the sooner we decide on this the better. I think @bc118 mentioned that the MC folks have defined values for their sims already.

Equilibration and sampling

How should we decide that our systems are equilibrated and decorrelated?

  • Add pymbar to environment
  • Create a function that works on MD (using pymbar's timeseries module on the potential energy should work for MD)
  • Find a solution for equilibration/decorrelation times for MC

How to handle jobs that don't pass `is_equilibrated` after completion.

As we're making progress on some of the post-simulation analysis workflow, one of the added features is testing if a simulation is considered equilibrated given a threshold (#55 #56 ). Right now, if it fails that test we get an error raised, but nothing else happens. We should probably discuss how to handle restarting a job in this scenario.

Trajectory Name For Diffusion Calculations

Looking at the calculation for RDF and for diffusion, both apply their analysis for default trajectory names of trajectory.gsd. However, the diffusion calculation needs to be applied to an NVT production simulation, whereas the RDF does not have that stipulation. Unless the plan is to also do all of our analysis from the NVT simulation, instead of the "production" simulation that is in NPT, we might want to allow for a keyword to be passed to these analysis routines to make sure the correct trajectory is grabbed. It also might be nice to add to the project some guidelines for naming conventions so we all keep relatively similar.

Here are links to the relevant lines of code:

rdf = _gsd_rdf(job.fn("trajectory.gsd"), frames, stride, bins, r_min, r_max)

msd, timesteps = _gsd_msd(job.fn("trajectory.gsd"), skip, stride, unwrapped)

Updates to SPE for Foyer forcefield parameters

According to discussion with @ramanishsingh, the parameters used for the MCCCS components were take directly from the TraPPE website instead of the Foyer forcefield if it used the TraPPE UA model (methane, benzene, pentane). Due to minor 1e-4 truncation differences in epsilon parameters specified in K (TraPPE units) and in kJ/mol (Foyer units), the resulting forcefields are different enough to show differences in a single point energy calculation.

MethaneUA _CH4 epsilon:
TraPPE original is specified as 148.0K
TraPPE Foyer =1.23054 kJ/mol = 148.0018969973220K

Engine FF Energy (K) Energy (kj/mol) Percentage diff w.r.t. LAMMPS
LAMMPS TraPPE_foyer - 536743.553800000 0
MCCCS-MN TraPPE_orig 64554583.8347322 536736.674124320 0.001281744
MCCCS-MN TraPPE_foyer 64555411.2663036 536743.553773189 4.99513E-09

Analysis TO-DOs

  • Fully define energy breakdowns for each simulation engine
  • Pick/write RDF analysis code
  • Pick/write diffusion coefficient code

What constraints are needed for MD simulations?

I am running into some issues when using with constraints = all-angles for a box of ethanol in GROMACS (with the LINCS algorithm). The simulation ran fine with constraints = all-bonds, so I am just wondering what is the appropriate constraints we should use for the project or for MD specifically?

Choose standard units.

We should decide on a standard unit system to stick with for standard output.
It makes sense to me to stick with foyer units.

Quantity Units
distance nm
angle radians
mass amu
energy kJ/mol

Let's make sure there's nothing missing and get consensus. :)

Should we use `constrainmol` to handle bond constraints?

It might be reasonable to use constrainmol to solve for our coordinates, especially for our constrained molecules, as this this will allow us to build up our molecules, parametrize them, and then use their r_eq values to define their bond lengths.

https://github.com/rsdefever/constrainmol/blob/0baa4c25abf2d01ad4c3078a541b4fd943514121/constrainmol/constrainmol.py#L30-L36

Although i dont think this handles the cases of angle constraints though.

Add forcefields to init

Right now the forcefields are not filled in the job statepoints. Because the Forcefield object is not serializable, it makes the most sense to me to use either a string or the path to an xml file. I lean towards string. Then we could do something like

if job.sp.forcefield == "OPLSAA":
    ff = foyer.forcefields.load_OPLSAA()
elif job.sp.forcefield == "TRAPPE":
    ff = foyer.forcefields.Forcefield("path/to/benzene_trappe-ua_like.xml")

also, are we using the included trappe xml for benzene only?

Hydrogen bond analysis

During March 7 gather.town meeting, we decided to conduct H-bond analysis (average per frame and lifetime if possible) for the systems that have hydrogen bonds.
We can use MDAnalysis (https://docs.mdanalysis.org/2.0.0/documentation_pages/analysis/hydrogenbonds.html#input) or mdtraj (https://mdtraj.org/1.9.4/examples/hbonds.html) for conducting the analysis.
This requires all the engines using the same topology file while saving their .gsd trajectories (it will be also be helpful for the RDF analysis) as we will have to select the donor and acceptor atoms. The topology file can be generated by saving the filled_box generated using the construct_system function (https://github.com/ramanishsingh/reproducibility_study/blob/afecf059e5dfbfc140623ecb273b8d353d58971c/reproducibility_project/src/engines/mcccs/project.py#L627).

System parameters

Let's discuss the system parameters here. We've talked quite a bit about it on the calls, and a little bit in Slack, but an issue in the repo would probably be a better place to discuss this.

Maybe once it's settled someone can make a PR to update the README with the decided details.

Standardize matplotlib styles

We should include a matplotlibrc file so we do not need to worry as much about the plot formatting. If anyone has a nice rc file already, id be happy to include something like that.

Easy packages for Thermo Analysis

So for lammps, the default is to read out your data to a log file with information about setting up each simulation run, and then writing out the specified thermo data for each interval (currently, I'm writing out step, temp, press, total energy, kinetic energy, potential energy, and density information). These files take a bit of parsing to grab the correct data for checking something like equilibration. Ryan DeFever has a really nice utility called Lammps Thermo that makes this process really simple.

I know gromacs has a similar package that is extremely useful Panedr.

Panedr is available on conda-forge it looks like, but Lammps Thermo has to be pip installed from a git clone. If possible, I would like to add both of these packages to the environment.yml file.

NaN values from `is_equilibrated`?

My NPT sim is clearly not equilibrated at the beginning and maybe my pressure coupling is too high (guessing from the initial oscillations in volume, see attached plots of volume vs timestep), but eventually it appears that the simulation volume equilibrates. I would think that this kind of data would be fine for pymbar, but if the whole run is used, I get NaN values:
image
image

It's odd though because if I chop of the first 50 values, is_equilibrated works as expected. what should be the standard procedure for choosing where to start this equilibration analysis? I'm including my logfile in case anyone wants to test for themselves.
log-npt.txt

Signac Flow Job Labels

We need to be consistent with labeling our jobs if we're going to have a singular combined workspace. Speaking with @justinGilmer and @daico007, there might be a few ways to address this:

  1. A simple method would be to use boolean statements for checking if files exist. For example:
>>> @Project.label
>>> def lammps_applied_ff(job):
>>>    return job.isfile("box.lammps") and (job.sp.simulation_engine == "lammps")
  1. A second option could be to use signac label filtering syntax. I'm not an expert on that so I'll leave that up to someone else explaining how that might work, but that could also simplify things.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.