GithubHelp home page GithubHelp logo

justm0rph3u5 / airbnb-spark-thrift Goto Github PK

View Code? Open in Web Editor NEW

This project forked from airbnb/airbnb-spark-thrift

0.0 1.0 0.0 52 KB

A library for loadling Thrift data into Spark SQL

License: Apache License 2.0

Scala 95.09% Thrift 4.91%

airbnb-spark-thrift's Introduction

Spark Thrift Loader

Build Status

A library for loadling Thrift data into Spark SQL.

Features

It supports conversions from Thrift records to Spark SQL, making Thrift a first-class citizen in Spark. It automatically derives Spark SQL schema from Thrift struct and convert Thrift object to Spark Row in runtime. Any nested-structs are all support except Map key field needs to be primitive.

It is especially useful when running spark streaming job to consume thrift events from different streaming sources.

Supported types for Thrift -> Spark SQL conversion

This library supports reading following types. It uses the following mapping from convert Thrift types to Spark SQL types:

Thrift Type Spark SQL type
bool BooleanType
i16 ShortType
i32 IntegerType
i64 LongType
double DoubleType
binary StringType
string StringType
enum String
list ArrayType
set ArrayType
map MapType
struct StructType

Examples

Convert Thrift Schema to StructType in Spark

import com.airbnb.spark.thrift.ThriftSchemaConverter

// this will return a StructType for the thrift class
val thriftStructType = ThriftSchemaConverter.convert(ThriftExampleClass.getClass)

Convert Thrift Object to Row in Spark

import com.airbnb.spark.thrift.ThriftSchemaConverter
import com.airbnb.spark.thrift.ThriftParser

// this will return a StructType for the thrift class
val thriftStructType = ThriftSchemaConverter.convert(ThriftExampleClass.getClass)
val row =  ThriftParser.convertObject(
                thriftObject,
                thriftStructType)

Use cases: consume Kafka Streaming, where each event is a thrift object

import com.airbnb.spark.thrift.ThriftSchemaConverter
import com.airbnb.spark.thrift.ThriftParser


 directKafkaStream.foreachRDD(rdd => {
    val schema = ThriftSchemaConverter.convert(ThriftExampleClass.getClass)

     val deserializedEvents = rdd
       .map(_.message)
       .filter(_ != null)
       .flatMap(eventBytes => {
           try Some(MessageSerializer.getInstance().fromBytes(eventBytes))
             .asInstanceOf[Option[Message[_]]]
           catch {
               case e: Exception => {
                   LOG.warn(s"Failed to deserialize  thrift event ${e.toString}")
                   None
               }
           }
       }).map(_.getEvent.asInstanceOf[TBaseType])

       val rows: RDD[Row] = ThriftParser(
           ThriftExampleClass.getClass,
           deserializedEvents,
           schema)

       val df = sqlContext.createDataFrame(rows, schema)

       // Process the dataframe on this micrao batch
    })
 }

How to get started

Clone the project and mvn package to get the artifact.

How to contribute

Please send the PR here and cc @liyintang or @jingweilu1974 for reviewing

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.