baxtereaves / baxcat_cxx Goto Github PK
View Code? Open in Web Editor NEWC++ implementation of cross-categorization
License: GNU General Public License v3.0
C++ implementation of cross-categorization
License: GNU General Public License v3.0
If the data in a continuous column have very high (or low) magnitude we get overflow error when we try to calculate sum_x_sq
. We should add a default z-normalizing converter for continuous columns.
Problem: The easy way to do this is to add lambdas to Engine._converters
for those columns, but pickle can't deal with lambdas.
Use VonMises with normal prior on the center. A couple of notes:
Conditional probabilities aren't scaled properly. Fix it and write tests.
Model count data with the Poisson-gamma conjugate model.
We have a couple of choices of models. The easy choice is LogNormal because there is a closed-form marginal and predictive probability when both parameters are unknown. The problem with LogNormal is that it only has that one shape. It would be nice to use Gamma, but we'd have to estimate the normalizer.
in keeping with the component model priors, the crp alpha prior should be vague. If the prior is set to Gamma(1/n, n), we get a mean of 1 and heavy tails. Also, a gamma prior allows for easy sampling see mike West (1992).
Data container hold two vectors, _data
and _is_initialized
. This probably causes some cache issues with we grab data and check if they're initialized. It might be better to store one vector of tuples. We might even want each datum to be a struct with the fields value
and missing
then we'd do
DataContainer<double> data(x, is_missing)
double x = data[i].value
// or
double x = data[i] // throws out of bounds if missing
bool x_missing = data[i].missing
Currently Engine
does run()
in parallel for each model, but each model has parallel tasks. We need to be able to do tasks on many machines so we can exploit the extra parallelism in cross-categorization. This is especially important when the data have tons of columns and lots of view---the hyperparameter update for each column goes in parallel, and the row assignment for each view goes in parallel.
Since the engine object had a _mapper
attribute, we could easily allow the user to supply their own
mapper function on initialization. To supply the standard mapper:
mapper = lambda f, args: list(map(f, args))
engine = Engine(df, mapper=mapper)
...or mutliprocessing...
from mutliprocessing.pool import Pool
pool = Pool()
mapper = pool.map
engine = Engine(df, mapper=mapper)
...or IPython Parallel...
import ipyparallel as ipp
c = ipp.Client()
mapper = c.map
engine = Engine(df, mapper=mapper)
...or whatever.
Model binary data with Bernoulli with beta prior.
We'd like to know how similar certain objects are with respect to certain features.
def row_similarity(row_a, row_b, wrt=None):
pass
The models w/in an Engine should be able to work on independent subsets of the data. The interface will work something like this
engine = Engine(df, n_models=32)
engine.init_models(subsample_size=.5)
This will create an Engine
with 32 models, each of which work on half the data.
The bxacat::dist::multinomial
PMF is a categorical PMF, which is what we wanted. It should be renamed.
The namespace situation in the cxx code it out of hand. Past bax went a little crazy.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.