GithubHelp home page GithubHelp logo

scalaz / scalaz-analytics Goto Github PK

View Code? Open in Web Editor NEW
64.0 31.0 17.0 59 KB

A high-performance, purely-functional library for doing computational analysis and statistics over data in a type-safe way

License: Apache License 2.0

Scala 100.00%

scalaz-analytics's Introduction

scalaz-analytics

Gitter

Goal

Scalaz Analytics provides a high-performance, purely-functional library for doing computational analysis and statistics over data in a type-safe way.

Introduction & Highlights

Scalaz Analytics is a principled functional programming library for data processing and analytics.

  • Simple and principled
  • First class support for analytics and data science
  • Pure type-safe, functional interface that integrates with other Scalaz projects
  • Supports batch and streaming
  • Efficient on both small and large data sets, single machine and distributed
  • Can be used from a REPL for interactive analysis or as a library for applications

Other libraries

Below is a selection of Analytics/Data processing Libraries that we are being used as inspiration. Some of these metrics are somewhat subjective but they give an idea for what we are looking at from each library. Note that these metrics assume native support, so libraries that achieve these things via another library are not considered.

Library Scales to Big Data Supports Batch Supports Streaming FP Easy to Debug Out of the box analytics
Spark ✔ (mini batch)
Flink
Pandas
R ?
Dask ?
Apex ?
Beam ?

Background

scalaz-analytics's People

Contributors

camjo avatar danielyli avatar gitter-badger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scalaz-analytics's Issues

Add Datatypes/Data structures to stdlib

We need to provide a set of Type's that can be used in constructing computation descriptions in the StdLib.

As an initial list might be:

  • Int
  • Long
  • Double
  • String
  • Boolean
  • Byte
  • Decimal
  • Char
  • Float
  • Null
  • Short
  • Timestamp
  • Date

Data Structures (we need to define the semantics of these since they are not concrete collections at this point)

  • Array
  • Map
  • Set
  • Tuple

Add methods to setops

We need to flesh out the setops trait in the design sketch.

This ticket will need to pay close attention to the progress of #4 since the approach we take could impact the types of setops that are possible with our abstraction.

General Usage Examples

This ticket is to capture the types of computations we want to be able to express with this library. I think it's probably best to have this as a discussion ticket for now and it will likely help feed the design of #4.

Numeric Syntax

Like the DatasetSyntax example, we will need syntax for Numerics

Think basic operators (+, -, /, *) etc

Consider Removal of Numeric

This refers to a discussion in gitter about Numeric (see https://gitter.im/scalaz/scalaz-analytics/archives/2018/09/26 and https://gitter.im/scalaz/scalaz-analytics/archives/2018/09/27)

Basically the issue is Numeric is not fully defined for very common numerical types e.g. Int not total over division, Double not having a modulo operation.

The consensus of the discussion seems to be that using algebraic abstractions would probably be cleaner. Even better would be to make sure that we can interoperate with scalaz-algebra, spire and algebird as easily as possible.

Right now implementing this would merely mean to remove the code about numeric. Perhaps we should tackle this issue once we have a couple of implementation examples so we can analyze all the implications more easily?

Dataset design

Creating this issue to discuss the detailed design of Dataset[A] (and potentially DataStream[A]).

As discussed in the meeting with John, I think its worth thinking through what this API would look like as both batch and stream. We can go severals ways with this:

The first would be a unified API like Spark's Dataset (and structured streaming).
https://spark.apache.org/docs/2.3.1/structured-streaming-programming-guide.html#programming-model

Another would be Flink's Dataset/DataStream API (which are build on top of their stateful streams abstraction).
https://ci.apache.org/projects/flink/flink-docs-release-1.5/concepts/programming-model.html

I'd like to flesh out what this API should look like and how it should function in more detail here.

Word Count usage example

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.