The statreduce from flipkart-incubator

statreduce's Introduction

statreduce

statreduce is a library for writing Hadoop mapreduce jobs with map step in Java and reduce step in R for statistical computations. It provides simple abstractions like ListStatReducer and MatrixStatReducer.

ListStatReducer takes a list of int/double as input from map step, converts it to R vector, invokes R function and stores the result.

MatrixStatReducer takes a matrix of int/double as input from map step, converts it to R matrix, invokes R function and stores the result.

To use this library, add statreduce as dependency:

<dependency>
  <groupId>com.fk</groupId>
  <artifactId>statreduce</artifactId>
  <version>0.0.1</version>
</dependency>

For performing statistical operation on a vector of int/double, use ListStatReducer

public class TemperatureSummaryReducer extends ListStatReducer {
    public TemperatureSummaryReducer(){
        super("tempValues", "avgTemp", "", "avgTemp <- mean(tempValues)", Double.class, Double.class);
    }
}

tempValues is the double vector passed to R.

avgTemp <- mean(tempValues) is the function call.

avgTemp is the value returned as Reducer output.

For performing statistical operation on a matrix of int/double, use MatrixStatReducer:

public class MoneyballOnBasePercentageReducer extends MatrixStatReducer {
    public MoneyballOnBasePercentageReducer() {
        super("playerStats", "obp", ScriptUtils.fromFile("/tmp/on_base_percentage.R"),
                "obp <- calculateOBP(playerStats)", Double.class, Double.class);
    }
}

This calculates On-base percentage for baseball players by taking playerStats as input and returning obp as output. R functions can be loaded from files/hdfs/jar.

See MoneyballOnBasePercentageDriver or TemperatureSummaryDriver for usage examples. (Sample data and R functions are in data and functions dir)

statreduce's People

Contributors

Stargazers

Watchers

statreduce's Issues

Generate List output from MatrixStatReducer

MatrixStatReducer can currently generate only single int/double value as output.
Extend it to generate list of int/double values as output.

Generate List output from ListStatReducer

ListStatReducer can currently generate only single int/double value as output.
Extend it to generate list of int/double values as output.

For example, If statistical summary for input series is generated as output, it should be possible to emit all values as list of int/double in reducer.

Recommend Projects

flipkart-incubator / statreduce Goto Github PK

statreduce's Introduction

statreduce

statreduce's People

Contributors

Stargazers

Watchers

Forkers

statreduce's Issues

Generate List output from MatrixStatReducer

Generate List output from ListStatReducer

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs