probcomp / cgpm Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 11.0 9.27 MB

Library of composable generative population models which serve as the modeling and inference backend of BayesDB.

License: Apache License 2.0

Python 99.66% Shell 0.34%

probabilistic-programming data-analysis bayesian-inference tabular-data machine-learning

cgpm's People

Contributors

Stargazers

Watchers

Forkers

hiddenswitch leighhopcroft vishalbelsare wilsondy jar398 rupesh81 mharradon joaoloula khsibr liaaaaran emilyfertig

cgpm's Issues

experiment: chain_cmi - Was gpmcc trained on a different data generator?

I wanted to check whether the error in the gpmcc_cmi results were due to few samples on the conditioned value, so I made the following plots. It seems that the data on which gpmcc was trained is different from simulate_abc.

Plot from simulate_ABC(N=250)

Plot from engine._X

Plot from engine.simulate(-1,[0,1,2]) (4 samples for each of the 60 states)

Implement conditional mutual information

Should be in the StateGPM.

Implement truncated normal

Required for some solid MML datatype experiments I am designing.

Let mutual_information and conditional_mutual_information depend on rowid.

To capture cluster-specific dependence.

Study the performance effect of randomly permuting data in data and dimensions kernels

Empirically showing that a fixed ordering of the data in a Gibbs sweep converges faster -- permuting columns yet to be tested.

Metrics to follow.

Split plotting code to generic discrete and continuous plotters.

Currently each DistributionGpm has its own method for plotting its univariate distribution, need perhaps half an hour to figure out how to unify that code for continuous/discrete settings. Low priority.

Resolve matplotlib warnings from example.py

Running ipython -i example.py gives:

MatplotlibDeprecationWarning: The use of 0
(which ends up being the _last_ sub-plot) is deprecated in 1.4 and will
raise an error in 1.5

and

/usr/local/lib/python2.7/dist-packages/matplotlib/backend_bases.py:2399: MatplotlibDeprecationWarning: Using default event loop until function
specific to this GUI is implemented
  warnings.warn(str, mplDeprecation)

Should drop by pdb by running

In [1]: import warnings
In [2]: warnings.filterwarnings('error')

which gives a traceback.

The third warning

/usr/local/lib/python2.7/dist-packages/matplotlib/collections.py:590:
FutureWarning: elementwise comparison failed; returning scalar instead,
but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

is a matplotlib issue, fixed in 1.5.0

Fix marginal logps in uncollapsed distributions to include the log-likelihood and log priors.

mutual_information of the same variable returns weird result

No real reason to ask for the mutual_information of the same variable but I guess we could fix this easily.
It should return an array of 1s

Fold DistributionGpm hypers and suffstats into distargs['hypers']

Do column parameters of uncollapsed columns need to be transitioned before hypers?

As it stands the row transition of a the random forest might leave the internal regressor in a state where it cannot compute logpdf_marginal, since it has observations that were incorporated (and counts[x]>0) but the internal regressor was never transitioned to use those counts.

Options:

Patch the computation of calc_predictive_logp to detect this case and resort to the uniform outlier model, def not ideal since we want hyperparameter transitions to account for the regressor and not just the outlier model.
Enforce the transition order as specified above (params and then hyperparams) for the uncollapsed dimensions.
Brainstorm more about alternatives.

Geweke test gpmcc

Implement CMI

Separately for each state.

Add number of iterations of each kernel in metadata

When doing State.transition, keep a record of the number of iterations each kernel has achieved.

Rewrite the mh sampler in sampling.py

It seems to work on uncollapsed beta and truncated normal models, but the code is a mess.

Implement column and row dependency constraints.

More generally, implementation of MML. Task blocked on #38

Wrong sign in bernoulli simulator

Look into why switch > to <.

Implement BQL, in a separate query module.

Currently implemented partial BQL in sampling.py and some in engine.py. Rewrite this whole mess into query.py.

Standardize docstrings.

Probably numpy style.

close the pool in each engine function

using pool.close or pool.terminate

Migrate GPM notes from bdb-experiments to gpmcc.

The MEng proposal probably presents a more accurate view of GPMs though. Will put in both probably.

Design and implement conditional component models

Typically discriminative. The conditional component model is fundamentally different from a foreign predictor.

In a foreign predictor, the latents of CC are unknown to kepler_law.
MODEL period GIVEN apogee, perigee USING kepler_law
In a conditional component model
MODEL eccentricity GIVEN * USING regression
where * are the variables in the same view.

Control all entropy in gpmcc by passing around rng explicitly

Currently each state relies on setting the global rng using np.random.seed. We need each and every function which does some random computation to take in an rng object and use it for random bits. The change will be quite comprehensive, and requires update

All the simulate and sample_params methods in classes that implement DistributionGpm.
All the simulate_* in utils.general (used during inference).
All the simulate_* in state (use for queries).
All the simulate_* in utils.test (used for synthetic data generation).

There probably are more locations which a git-grep or other procedure need to identify.

Rewrite view transition operator

The logic for uncollapsed samplers looks broken. Need to read and rewrite very carefully.

Migrate foreign predictors into gpmcc for use with CMML

Currently sitting in bdbcontrib, and not accessible to CMML.

Engine.transition needs N=None not N=1 by default

Init gpmcc state without observations

Design and implement interface for MixtureGpm.

Dim, View, and State are all mixtures of other GPMs.

~~Dim is mixture of DistirbutionGpm~~
View is a multivariate Dim (approximation of a joint distribution of variables which are marginally DistributionGpms). A single Dim in a View is a standard univariate DP mixture of DistirbutionGpm.
State is a mixture of Dim, organizing each mixture component into a View.

At this stage I am inclined to eliminate the topical difference between Dim and View, and require all uses of even a single Dim to be wrapped into a View. Why? Because we cannot reuse inference code written for Dim in View, while the reverse is true.

State._transition_column should propose m-1 new views if not singleton

See: https://github.com/probcomp/gpmcc/blob/master/gpmcc/state.py#L247

According to the auxiliary variable algorithm (Algorithm 8, Neal 2000), there are supposed to be a total of m new views, including the current one if it is a singleton, such that the Gibbs step is stationary on the conditional distribution over the extended space of k + m views.

In the case of m = 1, the code does the right thing because m - 1 = 0 new views are created when the current is a singleton; but this is incorrect for general m and in the case of the default argument to the function which is apparently m = 3.

Implement a CRP component model

For categorical values in an open set.

Implement simulate for all DistributionGPMs

Blocks #15, in progress in 1226-fsaad-dist-sampling and subsumes #2

Implement a legacy utils to convert lovcat state to gpmcc

Do not explicitly destroy singletons, should be done automatically.

Rewrite the multiprocessing engine and clarify unimplemented methods.

Design and implement an interface for the GPMs in dists

Recreated this duplicated issue to maintain in original Bitbucket repo numbers.

Unify state.transition to use max(N,S) for better inference control

Quoting @madeleineth

@fsaad from that code, it looks like it'd actually be simpler to merge the time and iterations code into a single loop, so the kernel execution and progress code is not repeated. Maybe a few helper functions would make it cleaner. And by merging the loops, you get rid of the artificial distinction. Something like:
while True:
p = _proportion_done(N, S, ...)
if p >= 1: break
_show_progress(p, ...)
_transition(...)

make serialization json-friendly

@fsaad I would like Engine.to_metadata and Engine.from_metadata to generate/accept plain-old-python-objects (not numpy) so that in gpmcc_metamodel, I can serialize/deserialize with json.{loads,dumps} instead of my current _engine_to_json and _json_to_engine hacks.

Implement simulate_bulk and logpdf_bulk

Simple bulk operations in the State that take in multiple queries. Saves rebuilding a State for each query in the engine, for instance.

Remove rowids from dim.

Needs a significant refactoring but essential for SMC implementation. Dims should be wrapped in a View to unify the location of the inference algorithm.

Moreover dim is a GPM which exposes its latents anyway in its queries.

Enable column transitions with a conditional model

Currently disabled, need to ensure the following cases

Invariants to maintain

Conditional models are never initialized into singletons.
Conditional models are not proposed to move into singletons in the column CRP.
Conditionals models are not isolated (become singletons) when columns in their Views are being transitioned.

Correctly computing the transition probability

When an unconditional column is proposed to a View with a conditional column, need to compute both the marginal probability of the unconditional column under the View clustering and the delta in marginal logp of the conditional column, retraining given the proposal.

suspicious code in simple_predictive_probability

This variable name does not look right:

    is_observed = row > state.n_rows                                        
    if is_observed:                                                         
        logp = _simple_predictive_probability_unobserved(state, col, x)     
    else:                                                                   
        logp = _simple_predictive_probability_observed(state, row, col, x)

Line 271 of gpmcc/utils/sampling.py.

Create uniform interface for distribution GPMs

Many of the operations of the distributions in dists/ is duplicated code that can be reused. Define the interface, and generic implementations of duplicated methods.

Change Binomial to Bernoulli, Multinomial to Categorical

For correctness of naming.

Unfix seeds for random procedures

Evaluating twice procedures such as simulate and mutual_information produce exactly the same result. Passing seeds manually to them is not an optimal solution if we must evaluate them multiple times in a loop.

Initializing a StateGPM (and State.incorporate_dim) with a conditional model?

Initializing a Dim without the regression metadata will crash (entry points are State.incorporate_dim and State.__init__ with a conditional model.

The ViewGPM is smart enough to take the metadata from its unconditional dims and appending that important metadata to the conditional Dim metadata before creating it.

Possible solutions: like State.update_cctypes, delegate all initializing of Dim to the enclosing View. Requires refactoring.

Create test for simulate of DistributionGpms.

For instance, the posterior predictive of Normal-Gamma Normal is a Student t. Do we achieve identical simulation if we instead simulate rho ~ Gamma(Hn), mu ~ Normal(rho,Hn), then x ~ Normal(mu,rho) where Hn are the posterior hyperparameters?

Figure out how to sample from posterior predictive of geometric.

The posterior predictive is actually just a Beta Negative Binomial with r = 1. Work in progress on 1226-fsaad-dist-sampling branch, as well as other fixes to samplers.

Update the particle learning dimension for new hyperparameter inference scheme.

Currently almost completely broken after massive changes. More generally, need to use current implementation of SMC on a single dimension to SMC on an entire crosscat state. Talk with vkm on design of this inference algorithm.

To get the particle demo running checkout the particle-demo tag, see README.md for details.

Make sure the tests run.

Actually, rewrite the whole test suite. The best test we currently have is test_simulate, which should be made generic. I refrained from touching the tests until I rewrote the source to a point where it made sense to update the tests, since the intermediate GPM interfaces were not complete.

Documentation spike of distribution GPMs.

Seems that singleton_logp is no longer required as well?

probcomp / cgpm Goto Github PK

cgpm's People

Contributors

Stargazers

Watchers

Forkers

cgpm's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs