GithubHelp home page GithubHelp logo

msmbuilder / msmbuilder Goto Github PK

View Code? Open in Web Editor NEW
151.0 26.0 95.0 7.08 MB

:building_construction: Statistical models for biomolecular dynamics :building_construction:

Home Page: http://msmbuilder.org

License: GNU Lesser General Public License v2.1

Python 87.03% Shell 0.23% C++ 4.86% C 7.79% Batchfile 0.01% HTML 0.10%
molecular-dynamics python msmbuilder dimensionality-reduction hmm feature-extraction tica pca markov-model clustering

msmbuilder's Introduction

MSMBuilder

Build Status PyPi version License Documentation

MSMBuilder is a python package which implements a series of statistical models for high-dimensional time-series. It is particularly focused on the analysis of atomistic simulations of biomolecular dynamics. For example, MSMBuilder has been used to model protein folding and conformational change from molecular dynamics (MD) simulations. MSMBuilder is available under the LGPL (v2.1 or later).

Capabilities include:

  • Feature extraction into dihedrals, contact maps, and more
  • Geometric clustering with a variety of algorithms.
  • Dimensionality reduction using time-structure independent component analysis (tICA) and principal component analysis (PCA).
  • Markov state model (MSM) construction
  • Rate-matrix MSM construction
  • Hidden markov model (HMM) construction
  • Timescale and transition path analysis.

Check out the documentation at msmbuilder.org and join the mailing list. For a broader overview of MSMBuilder, take a look at our slide deck.

Installation

The preferred installation mechanism for msmbuilder is with conda:

$ conda install -c omnia msmbuilder

If you don't have conda, or are new to scientific python, we recommend that you download the Anaconda scientific python distribution.

Workflow

An example workflow might be as follows:

  1. Set up a system for molecular dynamics, and run one or more simulations for as long as you can on as many CPUs or GPUs as you have access to. There are a lot of great software packages for running MD, e.g OpenMM, Gromacs, Amber, CHARMM, and many others. MSMBuilder is not one of them.

  2. Transform your MD coordinates into an appropriate set of features.

  3. Perform some sort of dimensionality reduction with tICA or PCA. Reduce your data into discrete states by using clustering.

  4. Fit an MSM, rate matrix MSM, or HMM. Perform model selection using cross-validation with the generalized matrix Rayleigh quotient

msmbuilder's People

Contributors

brookee avatar brookehus avatar cing avatar cxhernandez avatar dr-nate avatar gkiss avatar hwaymentsteele avatar jeiros avatar kyleabeauchamp avatar maxentile avatar mpharrigan avatar msultan avatar nhstanley avatar peastman avatar pfrstg avatar rbharath avatar rmcgibbo avatar robertarbon avatar schwancr avatar skearnes avatar smsaladi avatar sunhwan avatar synapticarbors avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

msmbuilder's Issues

Implied Timescales

Have you looked at implied timescales for any full protein systems?

PS: feel free to close and respond via email if this is too early to discuss in public forum.

ImportError: No module named trajectory

Traceback (most recent call last):
  File "/home/cxh/anaconda/bin/hmsm", line 5, in <module>
    pkg_resources.run_script('mixtape==0.2', 'hmsm')
  File "/home/cxh/anaconda/lib/python2.7/site-packages/setuptools-2.2-py2.7.egg/pkg_resources.py", line 488, in run_script
  File "/home/cxh/anaconda/lib/python2.7/site-packages/setuptools-2.2-py2.7.egg/pkg_resources.py", line 1354, in run_script
  File "/home/cxh/anaconda/lib/python2.7/site-packages/mixtape-0.2-py2.7-linux-x86_64.egg/EGG-INFO/scripts/hmsm", line 8, in <module>
    app.start()
  File "/home/cxh/anaconda/lib/python2.7/site-packages/mixtape-0.2-py2.7-linux-x86_64.egg/mixtape/cmdline.py", line 220, in start
    instance = klass(args)
  File "/home/cxh/anaconda/lib/python2.7/site-packages/mixtape-0.2-py2.7-linux-x86_64.egg/mixtape/commands/fitghmm.py", line 118, in __init__
    self.featurizer = mixtape.featurizer.load(args.featurizer)
  File "/home/cxh/anaconda/lib/python2.7/site-packages/mixtape-0.2-py2.7-linux-x86_64.egg/mixtape/featurizer.py", line 82, in load
    featurizer = cPickle.load(f)
ImportError: No module named trajectory

Having trouble finding where that import is called though...

Add command for vitirbi path

The command would take an HMM and a single trajectory as input, and write out the viterbi assignments to an h5 or .dat file.

Atom indices issue

kyleb@kb-intel:~/dat/kinase/hmsm$ hmsm means-ghmm --filename hmms.jsonlines --lag-time 1 --n-states 3  --dir Trajectories/ --top system.subset.pdb --ext h5 -d AtomPairs.dat -o out.csv
None
Traceback (most recent call last):
  File "/home/kyleb/opt/bin/hmsm", line 5, in <module>
    pkg_resources.run_script('mixtape==0.1', 'hmsm')
  File "/home/kyleb/opt/lib/python2.7/site-packages/pkg_resources.py", line 505, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/kyleb/opt/lib/python2.7/site-packages/pkg_resources.py", line 1245, in run_script
    execfile(script_filename, namespace, namespace)
  File "/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/hmsm", line 8, in <module>
    app.start()
  File "/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/cmdline.py", line 174, in start
    instance = klass(args)
  File "/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/commands/sample.py", line 112, in __init__
    self.atom_indices = np.loadtxt(args.atom_indices, dtype=int, ndmin=1)
  File "/home/kyleb/opt/lib/python2.7/site-packages/numpy/lib/npyio.py", line 719, in loadtxt
    raise ValueError('fname must be a string, file handle, or generator')
ValueError: fname must be a string, file handle, or generator

It seems that args.atom_indices is actually None here, presumably because I'm using pairwise distances instead.

sampled-ghmm bug

kyleb@kb-intel:~/dat/kinase/hmsm$ hmsm sample-ghmm --filename hmms.jsonlines --lag-time 1 --n-per-state 10 --n-states 3 --dir Trajectories --top system.subset.pdb --ext h5 -a AtomIndices.dat -o out.csv
AtomIndices.dat
Namespace(atom_indices='AtomIndices.dat', dir='Trajectories', distance_pairs=None, ext='h5', filename='hmms.jsonlines', lag_time=1, n_per_state=10, n_states=3, out='out.csv', top='system.subset.pdb')
loading all data...
done loading
Traceback (most recent call last):
  File "/home/kyleb/opt/bin/hmsm", line 5, in <module>
    pkg_resources.run_script('mixtape==0.1', 'hmsm')
  File "/home/kyleb/opt/lib/python2.7/site-packages/pkg_resources.py", line 505, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/kyleb/opt/lib/python2.7/site-packages/pkg_resources.py", line 1245, in run_script
    execfile(script_filename, namespace, namespace)
  File "/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/hmsm", line 8, in <module>
    app.start()
  File "/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/cmdline.py", line 175, in start
    instance.start()
  File "/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/commands/sample.py", line 127, in start
    weights = discrete_approx_mvn(xx, self.model['means'][k], self.model['vars'][k])
  File "/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/discrete_approx.py", line 149, in discrete_approx_mvn
    np.random.randn(len(moments))) < 1e-5
AssertionError

Also, it seemed like this was pretty slow. Is the centroid-grabbing expected to be faster?

Means bug

kyleb@kb-intel:~/dat/kinase/hmsm$ hmsm  means-ghmm --filename hmms.jsonlines --lag-time 1 --n-states 3 -o out.csv --dir Trajectories --top system.subset.pdb --ext h5 -a AtomIndices.dat 
Traceback (most recent call last):
  File "/home/kyleb/opt/bin/hmsm", line 5, in <module>
    pkg_resources.run_script('mixtape==0.1', 'hmsm')
  File "/home/kyleb/opt/lib/python2.7/site-packages/pkg_resources.py", line 505, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/kyleb/opt/lib/python2.7/site-packages/pkg_resources.py", line 1245, in run_script
    execfile(script_filename, namespace, namespace)
  File "/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/hmsm", line 8, in <module>
    app.start()
  File "/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/cmdline.py", line 175, in start
    instance.start()
  File "/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/commands/pullmeans.py", line 62, in start
    logprob = log_multivariate_normal_density(xx, self.model['means'],
NameError: global name 'log_multivariate_normal_density' is not defined
``

Better helpstring for cmd arguments with lists of ints

  -k N_STATES [N_STATES ...], --n-states N_STATES [N_STATES ...]
                        Number of states in the models. Default = [2,]
  -l LAG_TIMES [LAG_TIMES ...], --lag-times LAG_TIMES [LAG_TIMES ...]
                        Lag time(s) of the model(s). Default = [1,]

IMHO, the comma in the helpstring suggests using comma-delimited inputs, whereas it actually wants space delim.

Minor bugs in setup.py

1). pjoin in setup.py should be changed to os.path.join
2). it should check for cython version to be at least > 0.16

Commands

  1. FitEM
    2, Unpack
  2. Sample
  3. PullMeans

hmsm means-ghmm ignores '--n-per-state' flag ...

… and pulls only a single structure from each state.
Edit: on closer inspection, the structures appear to be pulled from the same state.

hmsm sample-ghmm, on the other hands works like a charm.

box info / previous imaging lost when running pull_structures.py

This is typically not an issue for protein folding, but can become one in catalysis and conformational change applications when you care about ligand molecules, co-factors, ions, and waters in addition to the protein dynamics.

I raise this issue with mixtape, because that's where I've first noticed the problem.
In all fairness the problem could lie within mdconvert or with MSMBuilder's ConvertDataToHDF.py - I haven't checked their output yet.

My specific steps:

    • Amber12 .netcdf trajectories of protein with explicit solvent.
    • .binpos trajectories with only the x closest waters around a specified set of residues.
    • Convert 2. .binpos into .xtc with mdconvert
    • Use ConvertDataToHDF.py to get .lh5 files
    • Run mixtape on 4. and generate a set of HMMs.
    • Extract structures from all of the states with pull_structures.py
    • Visualize them with PyMol
      => waters are scattered between several neighboring periodic boxes.
      => this means the center of mass in each frame is a different one, which throws off the alignment that's happening as part of pull_structures.py

HMM MFPTs

We should be able to compute mean first passage times in the HMM between points in phase space.

Install error

I got the following running setup.py

Installed /home/harrigan/opt/python2.7/lib/python2.7/site-packages/mixtape-0.2-py2.7-linux-x86_64.egg
Processing dependencies for mixtape==0.2
Searching for cvxopt>=1.1.5
Reading http://pypi.python.org/simple/cvxopt/
Best match: cvxopt 1.1.6
Downloading https://pypi.python.org/packages/source/c/cvxopt/cvxopt-1.1.6.tar.gz#md5=190ff21beba1c27eb6a9673cfc2ba1a3
Processing cvxopt-1.1.6.tar.gz
Writing /tmp/easy_install-BfejGj/cvxopt-1.1.6/setup.cfg
Running cvxopt-1.1.6/setup.py -q bdist_egg --dist-dir /tmp/easy_install-BfejGj/cvxopt-1.1.6/egg-dist-tmp-jEdQ48
error: Setup script exited with error: SandboxViolation: os.open('/tmp/tmpUm9T8C/EPU3wK', 131266, 384) {}

The package setup script has attempted to modify files on your system
that are not within the EasyInstall build area, and has been aborted.

This package cannot be safely installed by EasyInstall, and may not
support alternate installation locations even if you run its setup
script by hand.  Please inform the package's author and the EasyInstall
maintainers to find out if a fix or workaround is available.

It's probably my fault, but it told me to inform the package's author :)

Reduce memory usage

Obviously this is low priority for now, but things like pullmeans are memory hogs when they could be loading things sequentially, a la iterload.

Of course, the fix for this might wait until Christian implements a fancy new featurizer.

Nans only with cuda

-bash-4.1$ hmsm fit-em -d AtomPairs.dat  --platform cuda --dir Trajectories --ext h5
Loading data into memory + vectorization: 5.954751 s
Fitting with 10 timeseries from 10 trajectories with 14010 total observations
{'fusion_prior': 0.01, 'n_states': 2, 'platform': 'cuda', 'reversible_type': 'mle', 'thresh': 0.01, 'n_em_iter': 100, 'n_lqa_iter': 10, 'n_features': 666}
/cbio/jclab/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/ghmm.py:181: UserWarning: cuda do_estep: transounts contains NaNs
  curr_logprob, stats = self._impl.do_estep()
/cbio/jclab/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/ghmm.py:388: UserWarning: cuda do_estep: transounts contains NaNs
  logprob, _ = self._impl.do_estep()
Nonfinite numbers in transmat !!

The output json file has Nans everywhere. However, when I use the CPU platform, things appear fine. Any thoughts? Here is my cmd line:

hmsm fit-em -d AtomPairs.dat  --platform cuda --dir Trajectories --ext h5

optimization.

@rbharath: I added working CPU code in C/C++, and it should be pretty easy use with OpenMP threads. I haven't written the python wrappers yet, so it's not "hooked in" atm. I'm working on the CUDA port now.

hmsm means-ghmm instability produces MemoryError and ValueError

/g/g90/kiss2/Progs/Python2.7/bin/python2.7 /g/g90/kiss2/mixtape-master/scripts/hmsm means-ghmm -i ../hmms.jsonlines --featurizer ../features_rmsd.dat --n-states 7 --n-per-state 20 --lag-time 100 -o rmsd_HMM_7states_20struct.csv --dir "../../Trajectories" --ext lh5 --top ../../2g2z_TetInt_TIP3P_300Wat.pdb
# Most often gives a MemoryError
Traceback (most recent call last):
  File "/g/g90/kiss2/mixtape-master/scripts/hmsm", line 8, in <module>
    app.start()
  File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/cmdline.py", line 219, in start
    instance.start()
  File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/commands/pullmeans.py", line 73, in start
    features, ii, ff = mixtape.featurizer.featurize_all(self.filenames, featurizer, self.topology)
  File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/featurizer.py", line 13, in featurize_all
    x = featurizer.featurize(t)
  File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/featurizer.py", line 49, in featurize
    traj.superpose(self.reference_traj, atom_indices=self.atom_indices)
  File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/mdtraj-0.7.0-py2.7-linux-x86_64.egg/mdtraj/trajectory.py", line 785, in superpose
    self.xyz = self_displace_xyz + ref_offset
MemoryError
# But on occasion also a ValueError:
Traceback (most recent call last):
  File "/g/g90/kiss2/mixtape-master/scripts/hmsm", line 8, in <module>
    app.start()
  File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/cmdline.py", line 219, in start
    instance.start()
  File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/commands/pullmeans.py", line 95, in start
    df = pd.DataFrame(data)
  File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 201, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 323, in _init_dict
    dtype=dtype)
  File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 4463, in _arrays_to_mgr
    index = extract_index(arrays)
  File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 4511, in extract_index
    raise ValueError('arrays must all be same length')
ValueError: arrays must all be same length

Version Check for Sklearn

Sklearn.Cluster changed its parameter for the number of cluster from "k" to "n_clusters". Its probably a good idea to do a version check to sklearn in setup.py so that these and other random bugs don't show up.

example gives error

Thanks for making your HMM implementation public! I had a quick look only. Running the provided example I'm getting the following error.

In [1]: np.version
Out[1]: '1.6.1'

Reading day=1
Reading day=2
Reading day=3
Reading day=4
Reading day=5
Reading day=6
Reading day=7
Reading day=8
Reading day=9
Reading day=10
Reading day=11
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/<ipython-input-2-f74a882d89fb> in <module>()
     41 n_components = 3
     42 model = VonMisesHMM(n_components, n_iter=1000)
---> 43 model.fit([X])
     44 hidden_states = model.predict(X)
     45 

/usr/lib/python2.7/dist-packages/sklearn/hmm.pyc in fit(self, obs)
    441 
    442             # Maximization step

--> 443             self._do_mstep(stats, self.params)
    444 
    445         return self

/lib/python2.7/site-packages/vmhmm-0.1-py2.7-linux-x86_64.egg/vmhmm.pyc in _do_mstep(self, stats, params)
    256         if 'k' in params:
    257             invkappa = self._fitinvkappa(posteriors, obs, self._means_)
--> 258             self._kappas_ = inverse_mbessel_ratio(invkappa)
    259 
    260 

/lib/python2.7/site-packages/vmhmm-0.1-py2.7-linux-x86_64.egg/vmhmm.pyc in __call__(self, y)
    304     def __call__(self, y):
    305         if not self._is_fit:
--> 306             self._fit()
    307 
    308         y = np.asarray(y)

/lib/python2.7/site-packages/vmhmm-0.1-py2.7-linux-x86_64.egg/vmhmm.pyc in _fit(self)
    298 
    299         self._xj = xj
--> 300         self._cvals = cvals[:, 0]
    301         self._k = k
    302         self._is_fit = True

IndexError: invalid index

structures: atomic coordinate scramble gives 'big bang' in a box

Commands:

    • Create rmsd-based CSV file that pulls 20 structures closest the mean of each state's distribution:
    /g/g90/kiss2/Progs/Python2.7/bin/python2.7 /g/g90/kiss2/mixtape-master/scripts/hmsm means-ghmm -i ../hmms.jsonlines --featurizer ../features_rmsd.dat --n-states 7 --n-per-state 20 --lag-time 100 -o 
rmsd_HMM_7states_20struct.csv --dir "../../Trajectories" --ext lh5 --top ../../2g2z_TetInt_TIP3P_300Wat.pdb
  1. - Pull structures:
    /g/g90/kiss2/Progs/Python2.7/bin/python2.7 /g/g90/kiss2/mixtape-master/scripts/hmsm structures --ext pdb --top ../../2g2z_TetInt_TIP3P_300Wat.pdb rmsd_HMM_7states_20struct.csv

The first two lines of the resulting .pdb file (state0):

ATOM      1  N   GLN A   0    -166.060 284.310 443.470  1.00  0.00           N  
ATOM      2  H1  GLN A   0     322.940-487.690-410.530  1.00  0.00           H  

Which looks a little bit something like this (State0; box diagonal = on the order of 1000 Angstrom):
bbiab_2

or this (State1; box diagonal ~100 Angstrom):
bbiab_1

or this (State 5; box diagonal 1.6 Angstrom):
bbiab_2_diag_1 6a

All three structures are from the same 7-state HMM.

featurization bug (fixed)

@rbharath fixed a bug that may or may not have been somewhat serious during the featurization, which is relevant for hmsm sample-ghmm. The bugfix is the changes to Mixtape.featurizer in 6b01107.

The bug was introduced by me in 55bef1f, which was merged about a month ago according to github.

instability example: TypeError when running fit-ghmm

Description:

  • This error occurs on occasion.

  • When I run it again, things are fine.

    head -n 1 hmms.jsonlines
    /g/g90/kiss2/mixtape-master/scripts/hmsm fit-ghmm --featurizer features_rmsd.dat -k 2 3 4 5 6 7 8 -l 100 --dir ../Trajectories --ext lh5 --top ../2g2z_TetInt_TIP3P_300Wat.pdb

Traceback (most recent call last):
File "/g/g90/kiss2/mixtape-master/scripts/hmsm", line 8, in
app.start()
File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/cmdline.py", line 219, in start
instance.start()
File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/commands/fitghmm.py", line 129, in start
self.fit(subsampled, subsampled, n_states, lag_time, 0, args, outfile)
File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/commands/fitghmm.py", line 173, in fit
json.dump(result, outfile)
File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/json/init.py", line 181, in dump
for chunk in iterable:
File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/json/encoder.py", line 428, in _iterencode
for chunk in _iterencode_dict(o, _current_indent_level):
File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/json/encoder.py", line 402, in _iterencode_dict
for chunk in chunks:
File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/json/encoder.py", line 326, in _iterencode_list
for chunk in chunks:
File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/json/encoder.py", line 436, in _iterencode
o = _default(o)
File "/g/g90/kiss2/Progs/Python2.7/lib/python2.7/json/encoder.py", line 178, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: (31.019121170043945+0j) is not JSON serializable

Command for plotting

The ideal situation is to have interactive plots. You've fit 100 models with different n_states, lag_times, multiple repetitions, etc, and you want to look at them along many different axes, like likelihoods, timescales, see the populations, etc. You'd also like to look at just subsets of the rows in the jsonlines file -- maybe you decide that one lag time is really no good, so you want to view these plots with that one excluded. Or with only that one.

My tenative idea, although I'm not really sure how to pull this off technology-wise, would be to have a linked table + scatter plots. The table would have each model as a row, and all these different properties in columns. Rows would be toggle-able or selectable, and as you select or unselect rows, data is added/removed from a set of linked scatter plots. At first, I think the axes on the scatter plots would be fixed -- so it'd be like timescales and log likelihoods vs. n states and lag time.

I think something based on D3.js in the browser is the way to go here. I'm not sure if Bokeh is powerful enough to handle the linked chart-table thing though.

Args in subcommand

In the commandline branch, change the signature of the start method. Instead, have args be passed through the init method.

Optional arguments not optional

Draw samples at the center of each state in a Gaussian HMM.

optional arguments:
  -h, --help            show this help message and exit
  --filename FILENAME   Path to the jsonlines output file containg the HMMs
  --lag-time LAG_TIME   Training lag time of the model to select from
  --n-states N_STATES   Number of states in the model to select from
  -o OUTPUT_CSV_FILE, --out OUTPUT_CSV_FILE
                        File to which to save the output, in CSV format

hmsm means-ghmm: error: argument -o/--out is required

Change `top` arguments to `ref_traj`

In many cases, the top argument is actually a reference trajectory with a single frame--and definitely not a MDTraj.Topology object. Might be good to clarify

Load / save

In recent changes to the MSMB tICA code, we've moved towards a model where we:

  1. Create a pickled metric / featurizer
  2. Use pickle for clustering and analysis

I wonder if this would be useful here as well.

Another useful thing might be a load / save for the HMM itself, so that for example we can load the model and run some calculations (e.g. logprob) on an additional trajectory.

Towards a featurizer

So right now, a lot of the trajectory featurization is hard-coded.

What do you think about moving towards a model where we have two steps:

  1. Create+save featurizer
  2. Load+apply featurizer, build HMSM

We don't have to sort out all of the details of the featurizer right now, but this could make the current code a lot more flexible.

Also, I think it would be advantageous in terms of reducing the number of command line arguments for the build HMSM step--right now, we have to input all of the featurizer details at the same time that we input HMSM parameters. IMHO, this separation would be beneficial to users.

In terms of why this is appropriate as an early--rather than late--feature, it will allow me to begin characterizing the robustness of my models w.r.t. features, which is one of the main benefits here.

Bug in pullmeans

all of the arrays in data need to be the same length. sometimes that is not the case. Bug in the implementation of pull means that only shows up sometimes. perhaps when the length of sorted_indices < args.n_per_state?

           if len(p) > 0:
                data['index'].extend(sorted_indices[-self.args.n_per_state:])
                data['filename'].extend(sorted_filenms[-self.args.n_per_state:])
                data['state'].extend([k]*self.args.n_per_state)
            else:

Rev. MLE Linesearch Failure

I'm 99% sure this is a known issue with whatever BFGS solver you're using, as we've seen similar in MSMB. I guess the main things is to check that we are in fact sufficiently converged.

/home/kyleb/opt/lib/python2.7/site-packages/mixtape-0.1-py2.7-linux-x86_64.egg/mixtape/ghmm.py:235: UserWarning: Maximum likelihood reversible transition matrixoptimization failed: ABNORMAL_TERMINATION_IN_LNSRCH
  self.transmat_, self.populations_ = _reversibility.reversible_transmat(counts)

discrete_approx_mvn (sample-ghmm) is kind of slow

  • BFGS optimization over the lagrange multipliers on the constraints. Not clear that it can be hot-started any better.
  • Stochastic gradient descent?
  • Profile the performance of the objective function / gradient calculation in detail
  • Possible things that could help:
    • Call sklearn.utils.ext_math.fast_dot, which -- depending on the version of numpy -- link against a faster blas.
    • Write the inner loop for computing the obj / grad in C.

Build Fails on i686 Ubuntu Machine

The setup.py install script fails with the following error on my i686 ubuntu machine:

bramsundar@bramsundar-HP-Pavilion-dv6-Notebook-PC:~/Devel/mixtape$ sudo python setup.py install                                                        

Attempting to autodetect OpenMP support...
Compiler supports OpenMP

############################################################
The nvcc compiler could not be located in your $PATH. To
enable CUDA acceleration, either add it to your path, or set
$CUDAHOME
############################################################ 

running install
Checking .pth file support in /usr/local/lib/python2.7/dist-packages/
/usr/bin/python -E -c pass
TEST PASSED: /usr/local/lib/python2.7/dist-packages/ appears to support .pth files
running bdist_egg
running egg_info
writing requirements to mixtape.egg-info/requires.txt
writing mixtape.egg-info/PKG-INFO
writing top-level names to mixtape.egg-info/top_level.txt
writing dependency_links to mixtape.egg-info/dependency_links.txt
reading manifest file 'mixtape.egg-info/SOURCES.txt'
writing manifest file 'mixtape.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-i686/egg
running install_lib
running build_py
running build_ext
skipping 'src/reversibility.c' Cython extension (up-to-date)
skipping 'platforms/cpu/wrappers/GaussianHMMCPUImpl.cpp' Cython extension (up-to-date)
building 'mixtape._ghmm' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/local/lib/python2.7/dist-packages/numpy/core/include -
Iplatforms/cpu/kernels/include/ -Iplatforms/cpu/kernels/ -I/usr/include/python2.7 -c platforms/cpu/wrappers/GaussianHMMCPUImpl.cpp -o build/temp.linux-
i686-2.7/platforms/cpu/wrappers/GaussianHMMCPUImpl.o -fopenmp
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for Ada/C/ObjC but not for C++ [enabled by default]
In file included from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1760:0,
                 from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:17,
                 from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4,
                 from platforms/cpu/wrappers/GaussianHMMCPUImpl.cpp:314:
/usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable i
t by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
In file included from platforms/cpu/kernels/logsumexp.hpp:10:0,
                 from platforms/cpu/kernels/forward.hpp:8,
                 from platforms/cpu/kernels/do_estep.hpp:18,
                 from platforms/cpu/wrappers/GaussianHMMCPUImpl.cpp:321:
/usr/lib/gcc/i686-linux-gnu/4.6.1/include/emmintrin.h:32:3: error: #error "SSE2 instruction set not enabled"
In file included from platforms/cpu/kernels/include/sse_mathfun.h:32:0,
                 from platforms/cpu/kernels/logsumexp.hpp:14,
                 from platforms/cpu/kernels/forward.hpp:8,
                 from platforms/cpu/kernels/do_estep.hpp:18,
                 from platforms/cpu/wrappers/GaussianHMMCPUImpl.cpp:321:
/usr/lib/gcc/i686-linux-gnu/4.6.1/include/xmmintrin.h:32:3: error: #error "SSE instruction set not enabled"
In file included from platforms/cpu/kernels/logsumexp.hpp:14:0,
                 from platforms/cpu/kernels/forward.hpp:8,
                 from platforms/cpu/kernels/do_estep.hpp:18,
                 from platforms/cpu/wrappers/GaussianHMMCPUImpl.cpp:321:
platforms/cpu/kernels/include/sse_mathfun.h:45:9: error: ‘__m128’ does not name a type
platforms/cpu/kernels/include/sse_mathfun.h:51:9: error: ‘__m64’ does not name a type
platforms/cpu/kernels/include/sse_mathfun.h:93:3: error: ‘__m128’ does not name a type
platforms/cpu/kernels/include/sse_mathfun.h:94:3: error: ‘__m64’ does not name a type
platforms/cpu/kernels/include/sse_mathfun.h:112:1: error: ‘v4sf’ does not name a type
platforms/cpu/kernels/include/sse_mathfun.h:214:1: error: ‘v4sf’ does not name a type
platforms/cpu/kernels/include/sse_mathfun.h:332:1: error: ‘v4sf’ does not name a type
platforms/cpu/kernels/include/sse_mathfun.h:449:1: error: ‘v4sf’ does not name a type
platforms/cpu/kernels/include/sse_mathfun.h:568:16: error: variable or field ‘sincos_ps’ declared void
platforms/cpu/kernels/include/sse_mathfun.h:568:16: error: ‘v4sf’ was not declared in this scope
platforms/cpu/kernels/include/sse_mathfun.h:568:24: error: ‘v4sf’ was not declared in this scope
platforms/cpu/kernels/include/sse_mathfun.h:568:30: error: ‘s’ was not declared in this scope
platforms/cpu/kernels/include/sse_mathfun.h:568:33: error: ‘v4sf’ was not declared in this scope
platforms/cpu/kernels/include/sse_mathfun.h:568:39: error: ‘c’ was not declared in this scope
In file included from platforms/cpu/wrappers/GaussianHMMCPUImpl.cpp:321:0:
platforms/cpu/kernels/do_estep.hpp:79:37: error: expected declaration before end of line
error: command 'gcc' failed with exit status 1

hmsm fit-ghmm

Not sure exactly what the issue is, but this latest commit doesn't work on my project. I just get a blank json file and stops after:

Loading data into memory + vectorization: 2381.359713 s
Fitting with 2177 timeseries from 1287 trajectories with 14722812 total observations

Add randomness to hot-starting in GaussianFusionHMM

In any more on an input space with more than a small handful of degrees of freedom, there are going to be numerous local minima in the hmm likelihood function. The way to be honest about this, IMO, is to make sure that our hot-start is random enough that running the code multiple times with different random seeds will fall into different local minima. Currently, GaussianFusionHMM has a random_state argument to __init__, but it's ignored.

  • One output would be to make the KMeans hot-starting use random initialization instead of KMeans++, with the random seed passed through.
  • We could also add an option to the command line to run multiple repetitions with different random seeds.

cc: @gkiss, @rbharath

Command: Save Structures

So we want to take a CSV file as input, parse it, and extract the listed PDBs that are either the in-state mean exemplars or random samples.

Eventually, we might want to allow an optional "provenance" csv that maps protein coordinate trajectories to the original explicit solvent data, to allow for Gert's (and our own) application.

One question is whether the sample -> CSV -> extract pipeline is here to stay, or whether we want to merge them.

Here's some boilerplate code:

import numpy as np
import pandas as pd
import mdtraj as md

#os.system("hmsm  means-ghmm --filename hmms.jsonlines --lag-time 1 --n-states 3 -o out.csv --dir Trajectories --top system.subset.pdb --ext h5 -a AtomIndices.dat")

df = pd.read_csv("./out.csv")
for k, row in df.iterrows():
    trj = md.load(row["filename"])[row["index"]]
    trj.save("./PDB/State%d-mean.pdb" % row["state"])

I suppose we should also aim to (eventually) standardize this procedure WRT to msmbuilder as well.

Test Suite for Mixtape

The mixtape library is starting to grow relatively large. I think it's high time we introduce a test suite to ensure that new features don't break existing ones.

Any suggestions on how to do this? How do msmbuilder and openmm handle tests?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.