GithubHelp home page GithubHelp logo

storm-ml-1's Introduction

open-streamer

What is open-streamer ?

Open-Streamer is a library base on Storm platform,it is described by Trident api.and it focus on the real-time algorithm and online learnning algorithm,this library has implemented some classical algorithms type,like classifier, clustering,Regression,Cardinality,and Average Counting.etc,you can build some smart applications with this library over big data environment,it's easy to use this library on your project.I will give the start-tutorial for you to help you start to use this library.This library is not so ORIGINAL,you must know the Machine Learnning algorithm Library Over Storm : Trident-Ml,open-streamer extends trident-ml,Thanks Trident-ml's open source spirit.

Open-Stream Algorithms Overviews:

  • Average
    • Moving Average[1]
    • EWMA average[2]
  • Cardinality
    • LogLog Cardinality[3]
    • HyperLogLog cardinality[4]
    • Adaptive Counting Cardinality[5]
    • Linear Counting
  • Classification
    • Committee Classifier[6]
    • Passive Aggressive Classifier[7]
    • Perceptron Classifier[8]
    • Winnow Classifier[9]
    • Balanced Winnow Classifier[10]
    • Modify Banalced Winnow Classifier[11]
  • Clustering
    • Birch
    • Canopy
    • K-means
  • Frequency Counting
    • Count Sketch[12]
    • Lossy Counting[13]
    • Stick Sampling Counting[14]
    • Space Saving[15]
    • Top-k
  • Regression
    • Ftrl regression[16]
    • Perceptron Regression[17]
    • Passive Aggression Regression

Tutorial

You should have a spout for your Topology(DAG),you can Reference https://github.com/pandening/open-streamer/blob/master/src/main/java/com/hujian/trident/ml/examples/data/DoubleSpout.java Then,the data flow from spout will needto be transformed to an Object instance of com.hujian.trident.ml.core.Instance,there is a good and sample instance creator for you in the package: com.hujian.trident.ml.core.InstanceCreator , you can use this creator to create an instance and then emit the data flow to downstream.you should know about Trident's Api,like Function,Filter,StateUpdate,etc,for example,if you want to do some filter work on the data flow,you can let the data flow into a filter of Trident,then emit the data that you want to the downstram.

you can builder your topology with Trident Apis,for example,you can build an topology to run an average algorithm,like Moving Average, the only thing you need to do is adjust the runtime parameter , the follow java code will let you know how to use this library.

average is instance of IAverage,you can let average = new MovingAverage or EWMAAverage.

        TridentTopology tridentTopology = new TridentTopology();

        tridentTopology.newStream(topologyName,new DoubleSpout(10))
                .each(new Fields("item","frequency","type"),
                        new CountEntryInstanceCreator<Double>(),new Fields("instance"))
                .partitionPersist(new MemoryMapState.Factory(),new Fields("instance"),
                        new AverageModelUpdater("average-model-update",average),new Fields("average"))
                .newValuesStream()
                .each(new Fields("average"),new ShowAverageFunction(),new Fields("done"))
                .each(new Fields("done"),new ShowAverageFunction(),new Fields(""));

There is an Integrated java code https://github.com/pandening/open-streamer/blob/master/src/main/java/com/hujian/trident/ml/GPAPPBuilder.java

A complex demo for this library

Hybrid Classifier , a complex demo for this library,you can add Arbitrary Classifiers to the factory,the factory will choose some of its to classify the instrance,in the actual demo,I use 4 classifiers to test the hybrid classifier,a Committee Classifier,and 3 Passive Aggressive Classifier(Pa,Pa-I,PA-II),the data flow will be classified by the Committee Classiffer firstly,the Classification result will store at a singleton class,you can implement your storage by implement IStore,then the data flow will continue flow to downstream, the PA Classifier will receive the instance,the PA classifier will first do classify,get the classication result,then ย  get the classification list of this instance by instance id(each instance will be signed a instanceID),then the project will judge, if Committee's classication result equals PA's result,then end of classifying,get the classification result,and remove the instance from storage,then go to a Trident Function named EndFunction,do some print work,you can do more complexer work here,and,if Comittee's result != PA's result,the data will continue to next classifier PA-I,do some work like PA classifier,and if necessary,the PA-II classifier will do the same work like PA,PA-I,and after PA-II classifier,if there is no same classification result in the result list of this instance,the program will vote an classifier's result to you according to a weight vector,this vector will maintain by each Classifier,if any Classifier can get the classification result,the weight vector will be updated,the rules to update is: -(1) scanning each classifier's classification result,if the Classifier's classification result is the final result,then the classifier's weight will add 1L -(2) after updating the weight vector,for some reasons,we need to normalize the weight vector's sum to 100(or others small value) in the final classifier,the program also do some statistic work,like Right/Error count,you can print the information to watch the process of algorithm running.

Relevant Knowledge

  • Storm [18]
  • Trident [19]
  • Trident-ml [20]
  • Mahout [21]

Authors

Jian Hu,NanKai Edu,Tian Jin,China,2013.9 - 2017.6 (compute science and technology)

Email:[email protected]

Copyright and license

Copyright 2013-2017 Hu Jian

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Links & References

[1] Key Words: Moving Average , Goolgle

[2] http://blog.csdn.net/x_i_y_u_e/article/details/44194761

[3] http://blog.csdn.net/keshixi/article/details/46730231

[4] http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

[5] Fast and Accurate Traffic Matrix Measurement Using Adaptive Cardinality Counting

[6] A Multi-class Linear Learning Algorithm Related to Winnow

[7] Online Passive-Aggressive Algorithms

[8] http://www.cnblogs.com/jerrylead/archive/2011/04/18/2020173.html

[9] https://en.wikipedia.org/wiki/Winnow_(algorithm)

[10] Single-Pass Online Learning: Performance, VotingSchemes and Online Feature Selection

[11] Gender Identification on Twitter Using the Modified Balanced Winnow

[12] http://dimacs.rutgers.edu/~graham/pubs/papers/freqvldbj.pdf

[13] Approximate Frequency Counts over Data Streams

[14] Approximate Frequency Counts over Data Streams

[15] Efficient Computation of Frequent and Top-k Elements in Data Streams

[16] Ad Click Prediction: a View from the Trenches

[17] Online Passive-Aggressive Algorithms

[18] http://storm.apache.org/

[19] https://github.com/apache/storm/tree/master/storm-core/src/jvm/org/apache/storm/trident

[20] https://github.com/pmerienne/trident-ml

[21] http://mahout.apache.org/

storm-ml-1's People

Contributors

pandening avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.