GithubHelp home page GithubHelp logo

spark-columnar's Introduction

This is a proof of concept project that uses Shapeless to optimize the in-memory data layout for RDDs in Spark.

The basic idea is that a user-facing RDD of tuples and/or case classes is backed by another RDD in which there is only one item per partition, that represents all the tuples and/or case classes for that partition. The compute method of the user-facing RDD transforms the one item in the partition of the backing RDD into many items, like a flatMap operation (very similar to MappedRDD). By making sure that any calls to persist and related operations on the user-facing RDD get routed to the backing RDD we can get Spark to persist these partitions with only one item each, which gives us full control over how the data is organized per partition inside that one item.

Arrays are an efficient in-memory storage format for primitives. So an efficient way to persist an user-facing RDD[(Int, Int)] would be to use a backing RDD that stores each partition as a single item which contains two arrays of ints. Something like this: RDD[(Array[Int], Array[Int])]. For one-off situations this can be coded up quickly. But it is not trivial to generalize this. This is where shapeless comes in, a library for generic programming in Scala. Shapeless has a data structure called hlist which is like a generalization of tuples, allowing us to write transformation like that of an RDD[(Int, Int)] to an RDD[(Array[Int], Array[Int])] while abstracting over arity and type. This works for tuples as well as case classes. Shapeless has its complexity (a lot of implicit typeclasses involved), but when it all works the usage is 
as simple as this:

val x: RDD[(Int, Int)] = ...
val y: RDD[(Int, Int)] = ColumnarRDD(x)

At this point y should behave the same as x, but have a smaller footprint when cached/persisted.

Keep in mind that there is some serious overhead in this process. For tuples or case classes of size n, for each partition in the ColumnarRDD n arrays are constructed and filled up. After that for every iteration reading from the ColumnarRDD n array lookups are done and a tuple or case class is constructed on the fly. So i would not advice to use this unless the efficient in-memory representation is very important for the RDD. Also this is currently just a proof of concept that has not been tested yet in any real project/scenario.

In a test program where i created an RDD[(Int, Int, Int, Int)] with 10,000 items, and then created a ColumnarRDD from it, and cached both in memory, the ColumnarRDD took up 7 times as little memory as the original.

ColumnarRDD does not currently work in the Spark REPL. The issue is that the implicit shapeless Generic typeclasses are generated by macros that basically define a new anonymous class on the spot in your code. So if you create a ColumnarRDD in the REPL that is also where your new anonymous Generic subclass will be defined, and this does not deserialize well since the receiving side will not know this class.

spark-columnar's People

Contributors

koertkuipers avatar

Stargazers

Ajit Koti avatar Michael Bernadskiy avatar Kevin Horlick avatar Ruslan Dautkhanov avatar  avatar Michael Mullis avatar Yan avatar Youngwoo Kim avatar Peter Rudenko avatar Simon Hafner avatar Andy Petrella avatar Jisoo Park avatar Han Ju avatar Greg Leclercq avatar Paul Prae avatar

Watchers

 avatar  avatar James Cloos avatar Andres Perez avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.