GithubHelp home page GithubHelp logo

cloudforest's People

Contributors

rbkreisberg avatar ricochet2200 avatar ryanbressler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cloudforest's Issues

nfold with numeric data

Hello, I've got a regression model I'm trying to build. At the moment it seems like the nfold utility only splits on nominal data. I just thought I would mention it, I can easily build something on my end, but I figured I would mention it.

Great tool btw!

Optimization for evaloob

If evaloob proves to boost performance an optimized function that splits from the coded split and also includes the corrections for missing etc should be made.

Importance Overhall: What Method(s) to Get P Values?

P-Values for variable importance are desirable as they are easier to interpret and will be potentially easier to drop in to our other tools.

A couple of different methods seem viable for this. The ace method as used in rf-ace involves repeatedly growing a forest including artificial contrasts of all features and using Wilcoxon tests.

Another Method presented in "Identification of Statistically Significant Features from Random Forests" tests the change produced by per mutating each feature and testing on OOB cases after each tree. This is potentially computationally more efficient since only one forest needs to be grown.

Another interesting paper that might be complementary is "Understanding variable importances in forests of randomized trees" which presents work on totally randomized trees, Extra-Trees and random forests suggesting that the more randomized implementations might be of use when we are concerned primarily with feature selection.

AdaBoost for regression is broken

It doesn't produce good results using the current attempt to normalize the data and run it through an adaboost like function. Should implement a simple cutoff based method as has been published before.

Optimization for numerical features with few values.

Add a new numeric feature type (and detect such features on data load) that uses a pre stored list of all the distinct values instead of sorting on each split.

I suspect this will be faster for sparse features, or feature types like hamming scat that have mostly one value and for ordinal features with few values.

It could also support optimized mode finding for ordinal regression.

Column oriented features

In original AFM format is possible to choose between features placement (columns/rows):

Based on the headers, the AFM reader is able to discern the right orientation (features >as rows or columns in the matrix) of the matrix.

Here is just allowed row orientation features placement.

Not compiled

Not compiled growforest.go
Line 336: CloudForest.DensityTarget not defined

Thread-safe voting

The documentation for Tree.Vote states

Since BallotBox is not thread safe trees should not vote into the same BallotBox in parallel.

However both CatBallotBox and NumBallotBox declare themselves thread safe. Aren't these statements at odds, incompatible? Is voting thread safe or not?

I do not have hard data but, from anecdotal experience, I tend to believe voting is indeed not thread safe. From the implementation of Tree.VoteCases it seems the state of the traversal is kept inside the Tree, which would cause unpredictable behavior if two or more votes are run in parallel. Is my interpretation correct?

Document how to interpret importance.tsv

I have this sweet matrix, but I have very little idea what it means :)

N:sold_price    0   0   0   NaN 0   0
N:current_list_price    3.3253227556326355e+13  3215    2.672728164839731e+15   2.672728164839731e+15   40  1.675
N:lat   3.123188413118099e+12   1092    8.52630436781241e+13    8.52630436781241e+13    40  3.525
N:lon   1.3448376446951357e+13  1181    3.970633145962389e+14   3.970633145962389e+14   40  2.75
C:zip   1.1397540975320078e+14  243 6.924006142506948e+14   6.924006142506948e+14   40  2.275
C:property_type 7.441395153460111e+11   63  1.1720197366699675e+12  1.3788467490234912e+12  34  7
N:sqft  3.3402937521053375e+13  1185    9.895620240612062e+14   9.895620240612062e+14   40  2.175
N:lot_sqft  8.34956576885391e+12    973 2.0310318732737138e+14  2.0310318732737138e+14  40  3.475
N:bathrooms 9.852657867573695e+13   397 9.778762933566892e+14   9.778762933566892e+14   40  2.5
N:bedrooms  7.126175247115305e+13   254 4.525121281918218e+14   4.525121281918218e+14   40  3.85
N:year_built    3.1225402849528066e+12  923 7.205261707528602e+13   7.205261707528602e+13   40  3.825
N:favorited 0   0   0   NaN 0   0
N:current_photos_count  9.073837896732547e+12   1028    2.3319763394602644e+14  2.3319763394602644e+14  40  2.599999999999999
C:commission_percent    2.6080912579452184e+13  200 1.3040456289726092e+14  1.3040456289726092e+14  40  5.324999999999998

Sometimes growforest runs for a long time in the last few trees

I've noticed more than once that growforest tends to output the first trees relatively fast and slows down in the end, when there are around 5 or 6 trees missing (out of a 100).

What I believe is happening is the recursion sometimes keeps going on for a really long time regardless of the depth. I haven't seem it go into an infinite loop or a stack overflow, but I suppose that's possible if my interpretation is correct.

At one point last year I remember having seen this and I made a change to my local copy in which I broke out of the recursion when the depth was some high number that almost never happened (100 thousand or 1 million, can't remember) and it worked, even though it was very ugly. Applyforest was happy with the .sf file generated, nothing seemed to be wrong.

This time around I'd like to understand what's happening better to see if there is a better solution. I've only been experimenting with combinations of -oob, -progress and -vet so far, so there might be flags already to help with this, I'm not sure.

Need balanced bagging, other strategies, for unbalanced data.

The plan is to implement balanced sampling of cases with replacement at the bagging level as follows:

Sample which class to draw from (uniform distribution to ensure balance on average).
Draw a case from that class with replacement.
Repeat.

We already have cost weighted classification. Please comment or open issues with other strategies.

References:
http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163175/
http://www.biomedcentral.com/1471-2105/11/523
http://bib.oxfordjournals.org/content/early/2012/03/08/bib.bbs006
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0067863

Regression analog of roughly balanced bagging?

Roughly balanced bagging has proven great for unbalanced classification problems. We should develop a version for regression that samples in bag cases to ensure roughly uniform density across the range of the target variable.

Report specificity, sensitivity etc for binary classification with `-test`

I made a small patch on my own fork to report a little bit more data with -test when growforest is finishing. It looks like this:

Error: 0.06121835978431722
Accuracy: 48510 / 51673 = 0.9387881485495326
True Negatives 24999 / Total Negatives 26585 = Specificity (True Negative Rate) 0.940342
True Positives 23511 / Total Positives 25088 = Sensitivity (True Positive Rate) 0.937141
True Positives 23511 / Predicted Positives 25097 = Precision (Positive Predictive Value) 0.936805
True Negatives 24999 / Predicted Negatives 26576 = Negative Predictive Value 0.940661
F1 Score: 0.936973

I didn't make a PR because in my little patch I just assumed I was performing classification with 2 categories (it's what I always do) and didn't check if this was really the case.

Would this be useful in general? If so, I can add the checks to run this only when it makes sense and submit it.

ExtraTrees

Extremely randomixed trees. This may require a new feature/target interface depending on how much overlap there is in parameters with the existing splitting code. It should be easy to generate random coded splits (the categorical code does this for high cardinality features), just need to determine what parameters to expose.

http://orbi.ulg.ac.be/bitstream/2268/9357/1/geurts-mlj-advance.pdf

panic: interface conversion: ... missing method CatToNum

I run into an error using libsvm's sparse format, given the following files:

  1. Training data, 5 instances: https://www.dropbox.com/s/nlzfgj9235ffhsd/cf_test2.train.libsvm?dl=1
1.00000001 2:4.682604e-01 3:6.842105e-01 ...
1.60000001 2:2.247624e+00 3:4.454545e-01 ...
...
  1. Testing data, 2 instances: https://www.dropbox.com/s/t44uc09rdyboiev/cf_test2.test.libsvm?dl=1
0.16666601 2:1.619004e+00 3:3.390276e-01 ...
0.88888801 2:6.182730e-01 3:1.569653e-01 ...

Training goes well, but testing throws an error. Any idea what is going on?

$ growforest -train cf_test2.train.libsvm -rfpred cf_test2.sf -target 0
...
$ applyforest -fm cf_test2.test.libsvm -rfpred cf_test2.sf -preds cf_test2.pred
Target is 0 in feature 0
panic: interface conversion: *CloudForest.DenseNumFeature is not CloudForest.CatFeature: missing method CatToNum

goroutine 1 [running]:
github.com/ryanbressler/CloudForest.(*CatBallotBox).TallyError(0x20832bfe0, 0x2208291c48, 0x2082b40f0, 0x2)
    /Users/jg/src/github.com/ryanbressler/CloudForest/catballotbox.go:91 +0x5c
main.main()
    /Users/jg/src/github.com/ryanbressler/CloudForest/applyforest/applyforest.go:69 +0xb5a

Hashing trick with libsvm format?

Howdy,

Your package looks really great. I have a lot of NLPish data and so it would be great to be able to use sparse data with strings instead of integers. Sort of like VW, but with fewer bells and whistles.

Is this very difficult to add?

Thanks for the awesome open source software!

CloudRF for large data

Hi Ryan,

  • How can I use CloudRF to deal with large data on multiple servers?
    My dataset has more than 60,000,000 samples and 2,000 features.
  • Can CloudRF read the libsvm format file? Xi and Yi represent like col:value
    e.g: 3 4:1 5:2 6:8 4.5:9 -1.2: 10
    2 5:1 2:4 5:6 3.5:8 -1.9: 10
    the first column is Y, others are Xi.

Thank you!

Tests, Benchmarks on Public Data

So far all of the testing is on internal ISB data sets...need to do some benchmarking, testing and comparison to other implementations on public data sets.

l1 gini impurity?

Can gini imputigy be implemented as 1-sum(correct/total) and will eliminating the square have similar effects to l2 vs l1 regression?

organize or extract some useful utils

Hi,

I notice there are many useful tools in this package, for example, read/write libsvm format file, and the data matrix package etc. Do you have any plan to organize to extract them into packages? I think it would be good if we can put them into packages, so other ML-related package can be implemented base on these utils.
I'd like to help with such task if you have any plan on it! Please let me know what's your thought!
Thanks.

Merge forest

Hi,

I'm wondering if it is good to implement such feature:
User can train several forest, and then the applyforest can take more than one forest to make prediction

pros: user can easily train forest on several machines without constructing message interface

and I think RF by default is pretty OK with aggregating several forest, right?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.