ryanbressler / cloudforest Goto Github PK

View Code? Open in Web Editor NEW

737.0 737.0 92.0 1.7 MB

Ensembles of decision trees in go/golang.

License: Other

Go 97.85% Shell 0.13% Python 2.02%

cloudforest's People

Contributors

Stargazers

Watchers

Forkers

rbkreisberg golearn tknijnen one-on-one lijinhui pschanely archs prodigeni hellcoderz ilyalab andreasbriese walkingsparrow theultraaliens timedcy kyledj yama1968 0x0all sbs081 vdemario lytics jalilkazemi omarseddik pgnepal xuanhan863 xzturn vseledkin cloudxtreme tasssadar liuzheng87 silasxue neeraj-bhukania number0 mira-labs lamproae marcsantiago marcelfarres paulha rotblauer tangfeixiong alihalabyah koalacxr illotum gophersgang guangminglion oldtree mengqhui insionng hiekay zihua tomarron hhy5277 paurichardson blackbawks swizzley cube3power mlikelihood gosundy gaurav-gogia thalesfsp dataviral theharveyz ezodude mohamedndiaye developgo codelingobot axamon blockgraph avmi bh521 marikishtar moomoofarm1 visualsun carmel ricochet2200 jianjian113142 cdreier mfrank2016 itasks deepak1996tyagi standardgalactic ajunlonglive vkuznet kkodoo biorisk valeman iq-scm njtc406 kbajalc

cloudforest's Issues

Mondrian Forests for Online Learning

http://arxiv.org/abs/1406.2602

Need importance enhancements: count of the trees each feature is used in, per tree impurity decrease and mean minimal depth.

The following importance metrics should be added:

Count of trees each feature is used in.
Per tree impurity decrease (can be calculated from the existing values and above).
Mean Minimal Depth (See "Random Survival Forests").

Not sure if this will require core code changes, possibly passing in a [feature]{depth,importance} map to each tree growth?

Hello, I've got a regression model I'm trying to build. At the moment it seems like the nfold utility only splits on nominal data. I just thought I would mention it, I can easily build something on my end, but I figured I would mention it.

Great tool btw!

multithreaded boosting with hogwild style asynchronis updates

Optimization for evaloob

If evaloob proves to boost performance an optimized function that splits from the coded split and also includes the corrections for missing etc should be made.

Importance Overhall: What Method(s) to Get P Values?

P-Values for variable importance are desirable as they are easier to interpret and will be potentially easier to drop in to our other tools.

A couple of different methods seem viable for this. The ace method as used in rf-ace involves repeatedly growing a forest including artificial contrasts of all features and using Wilcoxon tests.

Another Method presented in "Identification of Statistically Significant Features from Random Forests" tests the change produced by per mutating each feature and testing on OOB cases after each tree. This is potentially computationally more efficient since only one forest needs to be grown.

Another interesting paper that might be complementary is "Understanding variable importances in forests of randomized trees" which presents work on totally randomized trees, Extra-Trees and random forests suggesting that the more randomized implementations might be of use when we are concerned primarily with feature selection.

Random Survival Forests (ie one sided regression)

We should implement the one sided impurity metic described in this paper:

http://arxiv.org/pdf/0811.1645.pdf

May need to invert it so that we our code which minimizes things will work.

AdaBoost for regression is broken

It doesn't produce good results using the current attempt to normalize the data and run it through an adaboost like function. Should implement a simple cutoff based method as has been published before.

Bivariate Splitting / Pathway Awareness

"Pathway analysis using random forests with bivariate node-split for survival outcomes" suggests a simple method for splitting on two features at once by choosing sqrt(m) features and looking for splits of the form f1+f2<split.

http://bioinformatics.oxfordjournals.org/content/26/2/250.full

Regularized Decision Trees / Random Forests

For feature selection:

https://sites.google.com/site/houtaodeng/rrf

https://fd03118b-a-62cb3a1a-s-sites.googlegroups.com/site/houtaodeng/publications/FSRegularizedTrees.pdf?attachauth=ANoY7cov-DsoVWCKZ8luuwvvztHuEVtf2OhCNd4dMswUaB7yiaAe_GvpxUQitpLc49yA-4FrNYlatRhMKI2p4CsIqITofbk8u8iynCDqTvOmtog23-j3R8Pgak4E1Ie4jN1h3OBVtphWR7pyCXul6v1xeT-sl241JWT3Wylf7K24ta4h2AECL-c0GtwjQZBORg0drHIUuFu0VsRyJ_M-CVa3XcbT6afzy3DDhLyJxSO7cdZVg1mMEnM%3D&attredirects=0

Optimization for numerical features with few values.

Add a new numeric feature type (and detect such features on data load) that uses a pre stored list of all the distinct values instead of sorting on each split.

I suspect this will be faster for sparse features, or feature types like hamming scat that have mostly one value and for ordinal features with few values.

It could also support optimized mode finding for ordinal regression.

Column oriented features

In original AFM format is possible to choose between features placement (columns/rows):

Based on the headers, the AFM reader is able to discern the right orientation (features >as rows or columns in the matrix) of the matrix.

Here is just allowed row orientation features placement.

Late Fusion of Classifiers for Independent Famlies of Features

http://jmlr.org/proceedings/papers/v25/madani12/madani12.pdf

On Using Nearly-Independent Feature Families for High Precision and Con dence

Omid Madani Manfred Georg David A. Ross

This could be interesting for classification from different derived genomic features or from genomic/clinical data. It would be easy to simply combine forest grown from each feature family but they discuss some other methods.

Manifold Learning

There is a great review of techniques for manifold learning form images/text here:

http://research.microsoft.com/pubs/158806/CriminisiForests_FoundTrends_2011.pdf

Where is the train.fm file?

Looks like train.fm file, mentioned in the Quick start, is not included in the repo.

Binary storage for data and forests.

We should be able to get big speed gains in parallel use by reducing the size of stored data and forests.

Density Estimating Trees

Density estimating trees should be easy to implement in our framework and could be used for balancing regression problems etc. It would also be interesting to pursue extension to joint distributions, mutual information and information gain.

http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p627.pdf
http://www.cc.gatech.edu/~pram/pubs/det_kdd_ppt.pdf

Not compiled

Not compiled growforest.go
Line 336: CloudForest.DensityTarget not defined

Rotation Forests

http://www.ncbi.nlm.nih.gov/pubmed/16986543

Would require a PCA/Eigensolver

Thread-safe voting

The documentation for Tree.Vote states

Since BallotBox is not thread safe trees should not vote into the same BallotBox in parallel.

However both CatBallotBox and NumBallotBox declare themselves thread safe. Aren't these statements at odds, incompatible? Is voting thread safe or not?

I do not have hard data but, from anecdotal experience, I tend to believe voting is indeed not thread safe. From the implementation of Tree.VoteCases it seems the state of the traversal is kept inside the Tree, which would cause unpredictable behavior if two or more votes are run in parallel. Is my interpretation correct?

Gradient Boosting Trees Half Implemented.

Boosting may beed bagging with replacement.

Split Count Doesn't Build

I guess we need to figure out what to do with it.

Document how to interpret importance.tsv

I have this sweet matrix, but I have very little idea what it means :)

N:sold_price    0   0   0   NaN 0   0
N:current_list_price    3.3253227556326355e+13  3215    2.672728164839731e+15   2.672728164839731e+15   40  1.675
N:lat   3.123188413118099e+12   1092    8.52630436781241e+13    8.52630436781241e+13    40  3.525
N:lon   1.3448376446951357e+13  1181    3.970633145962389e+14   3.970633145962389e+14   40  2.75
C:zip   1.1397540975320078e+14  243 6.924006142506948e+14   6.924006142506948e+14   40  2.275
C:property_type 7.441395153460111e+11   63  1.1720197366699675e+12  1.3788467490234912e+12  34  7
N:sqft  3.3402937521053375e+13  1185    9.895620240612062e+14   9.895620240612062e+14   40  2.175
N:lot_sqft  8.34956576885391e+12    973 2.0310318732737138e+14  2.0310318732737138e+14  40  3.475
N:bathrooms 9.852657867573695e+13   397 9.778762933566892e+14   9.778762933566892e+14   40  2.5
N:bedrooms  7.126175247115305e+13   254 4.525121281918218e+14   4.525121281918218e+14   40  3.85
N:year_built    3.1225402849528066e+12  923 7.205261707528602e+13   7.205261707528602e+13   40  3.825
N:favorited 0   0   0   NaN 0   0
N:current_photos_count  9.073837896732547e+12   1028    2.3319763394602644e+14  2.3319763394602644e+14  40  2.599999999999999
C:commission_percent    2.6080912579452184e+13  200 1.3040456289726092e+14  1.3040456289726092e+14  40  5.324999999999998

Sometimes growforest runs for a long time in the last few trees

I've noticed more than once that growforest tends to output the first trees relatively fast and slows down in the end, when there are around 5 or 6 trees missing (out of a 100).

What I believe is happening is the recursion sometimes keeps going on for a really long time regardless of the depth. I haven't seem it go into an infinite loop or a stack overflow, but I suppose that's possible if my interpretation is correct.

At one point last year I remember having seen this and I made a change to my local copy in which I broke out of the recursion when the depth was some high number that almost never happened (100 thousand or 1 million, can't remember) and it worked, even though it was very ugly. Applyforest was happy with the .sf file generated, nothing seemed to be wrong.

This time around I'd like to understand what's happening better to see if there is a better solution. I've only been experimenting with combinations of -oob, -progress and -vet so far, so there might be flags already to help with this, I'm not sure.

Need balanced bagging, other strategies, for unbalanced data.

The plan is to implement balanced sampling of cases with replacement at the bagging level as follows:

Sample which class to draw from (uniform distribution to ensure balance on average).
Draw a case from that class with replacement.
Repeat.

We already have cost weighted classification. Please comment or open issues with other strategies.

References:
http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163175/
http://www.biomedcentral.com/1471-2105/11/523
http://bib.oxfordjournals.org/content/early/2012/03/08/bib.bbs006
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0067863

What file formats should be supported for data and models?

Regression analog of roughly balanced bagging?

Roughly balanced bagging has proven great for unbalanced classification problems. We should develop a version for regression that samples in bag cases to ensure roughly uniform density across the range of the target variable.

Add Totaly Randomized Trees, Update Docs Re Importance

This paper has some interesting ideas on importance:

http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips26/281.pdf

Report specificity, sensitivity etc for binary classification with `-test`

I made a small patch on my own fork to report a little bit more data with -test when growforest is finishing. It looks like this:

Error: 0.06121835978431722
Accuracy: 48510 / 51673 = 0.9387881485495326
True Negatives 24999 / Total Negatives 26585 = Specificity (True Negative Rate) 0.940342
True Positives 23511 / Total Positives 25088 = Sensitivity (True Positive Rate) 0.937141
True Positives 23511 / Predicted Positives 25097 = Precision (Positive Predictive Value) 0.936805
True Negatives 24999 / Predicted Negatives 26576 = Negative Predictive Value 0.940661
F1 Score: 0.936973

I didn't make a PR because in my little patch I just assumed I was performing classification with 2 categories (it's what I always do) and didn't check if this was really the case.

Would this be useful in general? If so, I can add the checks to run this only when it makes sense and submit it.

ExtraTrees

Extremely randomixed trees. This may require a new feature/target interface depending on how much overlap there is in parameters with the existing splitting code. It should be easy to generate random coded splits (the categorical code does this for high cardinality features), just need to determine what parameters to expose.

http://orbi.ulg.ac.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf

Output Class Vote Counts from applyforest, maybe add quantile or raw vote output for regression?

Slow performance splitting on large categorical variables.

BestSplitBigCatIter is too slow. Possibly investigate using fully randomized search earlier or "Stochastic Greedy Algorithms: A leaning based approach to combinatorial optimization" [1] as in rf-ace.

[1] http://www.thinkmind.org/index.php?view=article&articleid=soft_v4_n12_2011_1

panic: interface conversion: ... missing method CatToNum

I run into an error using libsvm's sparse format, given the following files:

Training data, 5 instances: https://www.dropbox.com/s/nlzfgj9235ffhsd/cf_test2.train.libsvm?dl=1

1.00000001 2:4.682604e-01 3:6.842105e-01 ...
1.60000001 2:2.247624e+00 3:4.454545e-01 ...
...

Testing data, 2 instances: https://www.dropbox.com/s/t44uc09rdyboiev/cf_test2.test.libsvm?dl=1

0.16666601 2:1.619004e+00 3:3.390276e-01 ...
0.88888801 2:6.182730e-01 3:1.569653e-01 ...

Training goes well, but testing throws an error. Any idea what is going on?

$ growforest -train cf_test2.train.libsvm -rfpred cf_test2.sf -target 0
...
$ applyforest -fm cf_test2.test.libsvm -rfpred cf_test2.sf -preds cf_test2.pred
Target is 0 in feature 0
panic: interface conversion: *CloudForest.DenseNumFeature is not CloudForest.CatFeature: missing method CatToNum

goroutine 1 [running]:
github.com/ryanbressler/CloudForest.(*CatBallotBox).TallyError(0x20832bfe0, 0x2208291c48, 0x2082b40f0, 0x2)
    /Users/jg/src/github.com/ryanbressler/CloudForest/catballotbox.go:91 +0x5c
main.main()
    /Users/jg/src/github.com/ryanbressler/CloudForest/applyforest/applyforest.go:69 +0xb5a

Optimized Non Standard Targets and algorythem

While the core gini impurity and l2 regression are quire fast some of the more esoteric ones could be sped up quite a bit.

Early Stopping from OOB

Stop tree/forest growth early based on decrease in oob error. Could both shorten running time and control overfitting especially in simple boosted models.

Some ideas:
http://cavemoosum.blogspot.com.au/2014/02/cross-validation-is-over-long-live.html
http://cran.r-project.org/web/packages/gbm/index.html

Full Ace Implementation

with iterative ensemble growth and wilxocian test

http://enpub.fulton.asu.edu/workshop/FSDM05-Proceedings.pdf#page=74

Feature selection permutations with randomly assigned cases

http://link.springer.com/article/10.1007/s11222-012-9349-1#page-1
http://www.statistik.lmu.de/PR2/lehre/sk2011/Hapfelmeier.pdf

Hashing trick with libsvm format?

Howdy,

Your package looks really great. I have a lot of NLPish data and so it would be great to be able to use sparse data with strings instead of integers. Sort of like VW, but with fewer bells and whistles.

Is this very difficult to add?

Thanks for the awesome open source software!

Massively Parrallel/Out of Core Learning

Some references:

http://mail-archives.apache.org/mod_mbox/mahout-dev/201302.mbox/%3CCAJQdJb23YDLyNJ-QHSmdDizorDA_d0O8rNy+s8OPbBHCHBp4OA@mail.gmail.com%3E

Planet
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36296.pdf

Rainforest
http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf

CloudRF for large data

Hi Ryan,

How can I use CloudRF to deal with large data on multiple servers?
My dataset has more than 60,000,000 samples and 2,000 features.
Can CloudRF read the libsvm format file? Xi and Yi represent like col:value
e.g: 3 4:1 5:2 6:8 4.5:9 -1.2: 10
2 5:1 2:4 5:6 3.5:8 -1.9: 10
the first column is Y, others are Xi.

Thank you!

White list, feature and fold sets for convenience in cross validation.

In cross validation studies it would be useful to specify feature sets and cases to use without having to slice up the fm (note: loading the same massive fm may slow repeated analysis a lot).

Confidence Splitting Criteria

http://nerds.airbnb.com/confidence-splitting-criterions/

Random Features

Generate random derived features:

http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips26/289.pdf

Tests, Benchmarks on Public Data

So far all of the testing is on internal ISB data sets...need to do some benchmarking, testing and comparison to other implementations on public data sets.

l1 gini impurity?

Can gini imputigy be implemented as 1-sum(correct/total) and will eliminating the square have similar effects to l2 vs l1 regression?

organize or extract some useful utils

Hi,

I notice there are many useful tools in this package, for example, read/write libsvm format file, and the data matrix package etc. Do you have any plan to organize to extract them into packages? I think it would be good if we can put them into packages, so other ML-related package can be implemented base on these utils.
I'd like to help with such task if you have any plan on it! Please let me know what's your thought!
Thanks.

Merge forest

Hi,

I'm wondering if it is good to implement such feature:
User can train several forest, and then the applyforest can take more than one forest to make prediction

pros: user can easily train forest on several machines without constructing message interface

and I think RF by default is pretty OK with aggregating several forest, right?

LearningThread object that combines splitter allocations and logic?

Currently we pass around a BestSplitAllocs object to avoid repeated allocations for each learning go routine. It might make more sense to call this object something like Learner or LearnerThread and attach logic to it.

ryanbressler / cloudforest Goto Github PK

cloudforest's People

Contributors

Stargazers

Watchers

Forkers

cloudforest's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs