GithubHelp home page GithubHelp logo

baxtereaves / baxcat_cxx Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 1.0 3.69 MB

C++ implementation of cross-categorization

License: GNU General Public License v3.0

C++ 63.13% Makefile 0.17% Python 33.88% Cython 2.83%

baxcat_cxx's People

Contributors

baxtereaves avatar

Watchers

 avatar  avatar

Forkers

psederberg

baxcat_cxx's Issues

Overflow in continuous components

If the data in a continuous column have very high (or low) magnitude we get overflow error when we try to calculate sum_x_sq. We should add a default z-normalizing converter for continuous columns.

Problem: The easy way to do this is to add lambdas to Engine._converters for those columns, but pickle can't deal with lambdas.

Add support for cyclic data

Use VonMises with normal prior on the center. A couple of notes:

  • The conjugate prior in Daniel Fink's "Compendium of Conjugate Priors" is wrong.
  • The conjugate prior on the scale depends on the center, which makes things difficult.

Add support for magnitude (lower-bounded) data

We have a couple of choices of models. The easy choice is LogNormal because there is a closed-form marginal and predictive probability when both parameters are unknown. The problem with LogNormal is that it only has that one shape. It would be nice to use Gamma, but we'd have to estimate the normalizer.

CRP(alpha) prior is not ideal

in keeping with the component model priors, the crp alpha prior should be vague. If the prior is set to Gamma(1/n, n), we get a mean of 1 and heavy tails. Also, a gamma prior allows for easy sampling see mike West (1992).

Refactor DataContainer

Data container hold two vectors, _data and _is_initialized. This probably causes some cache issues with we grab data and check if they're initialized. It might be better to store one vector of tuples. We might even want each datum to be a struct with the fields value and missing then we'd do

DataContainer<double> data(x, is_missing)
double x = data[i].value
// or
double x = data[i]  // throws out of bounds if missing
bool x_missing = data[i].missing

Support multi-machine parallelization

Currently Engine does run() in parallel for each model, but each model has parallel tasks. We need to be able to do tasks on many machines so we can exploit the extra parallelism in cross-categorization. This is especially important when the data have tons of columns and lots of view---the hyperparameter update for each column goes in parallel, and the row assignment for each view goes in parallel.

Since the engine object had a _mapper attribute, we could easily allow the user to supply their own
mapper function on initialization. To supply the standard mapper:

mapper = lambda f, args: list(map(f, args))
engine = Engine(df, mapper=mapper)

...or mutliprocessing...

from mutliprocessing.pool import Pool

pool = Pool()
mapper = pool.map

engine = Engine(df, mapper=mapper)

...or IPython Parallel...

import ipyparallel as ipp

c = ipp.Client()
mapper = c.map

engine = Engine(df, mapper=mapper)

...or whatever.

Add data subsampling

The models w/in an Engine should be able to work on independent subsets of the data. The interface will work something like this

engine = Engine(df, n_models=32)
engine.init_models(subsample_size=.5)

This will create an Engine with 32 models, each of which work on half the data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.