baxtereaves / baxcat_cxx Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 3.69 MB

C++ implementation of cross-categorization

License: GNU General Public License v3.0

C++ 63.13% Makefile 0.17% Python 33.88% Cython 2.83%

baxcat_cxx's People

Contributors

Watchers

Forkers

psederberg

baxcat_cxx's Issues

Overflow in continuous components

If the data in a continuous column have very high (or low) magnitude we get overflow error when we try to calculate sum_x_sq. We should add a default z-normalizing converter for continuous columns.

Problem: The easy way to do this is to add lambdas to Engine._converters for those columns, but pickle can't deal with lambdas.

Add support for cyclic data

Use VonMises with normal prior on the center. A couple of notes:

The conjugate prior in Daniel Fink's "Compendium of Conjugate Priors" is wrong.
The conjugate prior on the scale depends on the center, which makes things difficult.

Conditional probability scaling

Conditional probabilities aren't scaled properly. Fix it and write tests.

Add support for count data

Model count data with the Poisson-gamma conjugate model.

Add support for magnitude (lower-bounded) data

We have a couple of choices of models. The easy choice is LogNormal because there is a closed-form marginal and predictive probability when both parameters are unknown. The problem with LogNormal is that it only has that one shape. It would be nice to use Gamma, but we'd have to estimate the normalizer.

CRP(alpha) prior is not ideal

in keeping with the component model priors, the crp alpha prior should be vague. If the prior is set to Gamma(1/n, n), we get a mean of 1 and heavy tails. Also, a gamma prior allows for easy sampling see mike West (1992).

Refactor DataContainer

Data container hold two vectors, _data and _is_initialized. This probably causes some cache issues with we grab data and check if they're initialized. It might be better to store one vector of tuples. We might even want each datum to be a struct with the fields value and missing then we'd do

DataContainer<double> data(x, is_missing)
double x = data[i].value
// or
double x = data[i]  // throws out of bounds if missing
bool x_missing = data[i].missing

Support multi-machine parallelization

Currently Engine does run() in parallel for each model, but each model has parallel tasks. We need to be able to do tasks on many machines so we can exploit the extra parallelism in cross-categorization. This is especially important when the data have tons of columns and lots of view---the hyperparameter update for each column goes in parallel, and the row assignment for each view goes in parallel.

Since the engine object had a _mapper attribute, we could easily allow the user to supply their own
mapper function on initialization. To supply the standard mapper:

mapper = lambda f, args: list(map(f, args))
engine = Engine(df, mapper=mapper)

...or mutliprocessing...

from mutliprocessing.pool import Pool

pool = Pool()
mapper = pool.map

engine = Engine(df, mapper=mapper)

...or IPython Parallel...

import ipyparallel as ipp

c = ipp.Client()
mapper = c.map

engine = Engine(df, mapper=mapper)

...or whatever.

Add explicit binary data model

Model binary data with Bernoulli with beta prior.

Add row_similarity with respect to certain columns

We'd like to know how similar certain objects are with respect to certain features.

def row_similarity(row_a, row_b, wrt=None):
    pass

Add data subsampling

The models w/in an Engine should be able to work on independent subsets of the data. The interface will work something like this

engine = Engine(df, n_models=32)
engine.init_models(subsample_size=.5)

This will create an Engine with 32 models, each of which work on half the data.

`dist::multinomial` is actually a categorical distribution

The bxacat::dist::multinomial PMF is a categorical PMF, which is what we wanted. It should be renamed.

Refactor cxx namespaces

The namespace situation in the cxx code it out of hand. Past bax went a little crazy.

baxtereaves / baxcat_cxx Goto Github PK

baxcat_cxx's People

Contributors

Watchers

Forkers

baxcat_cxx's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs