GithubHelp home page GithubHelp logo

kongdz / spark-json-to-table Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mirkoprescha/spark-json-to-table

0.0 2.0 0.0 241 KB

Sample ETL process written in Spark 2.1 using dataset type safety including unittests. Runs on docker image providing spark and zeppelin.

Scala 94.63% Dockerfile 5.37%

spark-json-to-table's Introduction

spark-json-to-parquet-table

This project demonstrates how to explode json-arrays into a relational format using spark.

Example:

This source JSON object: {"business_id":"1","categories":["Tobacco Shops","Nightlife","Vape Shops"],"price":100}

will be converted into such a table in parquet format.

business_id categorie      price
1 Tobacco Shops 100
1 Nightlife 100
1 Vape Shop 100

As source file input it utilizes the json files provided by Yelp Dataset Challenge round#9.

This project consists of a spark-app doing the transformation and a docker image providing the required software and libraries to run the spark-app.

Spark-App

The Spark-App reads all json files one after another into a dataframe, validates the schema with help of case classes, explodes all arrays into additional dataset and leads through all remaining attributes into the parent table. All generated datasets are persisted as parquet. The parent table furthermore contains the array values as comma-separated string (helpful for some analyses to avoid joining). Spark-App is written in Scala and build with SBT. It utilizes scalatest for unit and integration test. Find the sources are in ./src.

Spark-Zeppelin Docker Image

To run the Spark-App with spark-submit this project also provides Spark 2.1 together with Zeppelin as docker image. The docker image is uploaded in dockerhub in a public repository. A sample zeppelin notebook to analyze exploded tables is here ./zeppelin_notebooks/dataset-analysis.json.

Works with

  • docker 1.13.1
  • spark 2.1
  • scala 2.17
  • sbt 0.13.9

Getting Started

To get this project running you need at least 10 GB available harddisk. Follow these steps to use compiled spark-app in provided docker image.

  1. download tar file from Yelp Dataset Challenge round#9 containing input json files

  2. run docker container

it will download image from dockerhub and run it in a container

docker run -it -p 8088:8080   mirkoprescha/spark-zeppelin

If you want to use zeppelin immediately, wait roughly 10 second until daemon started

  1. Copy yelp_dataset_challenge_round9.tgz to docker container

Start another shell session and copy the file into the docker container. (your latest started container)

docker cp yelp_dataset_challenge_round9.tgz $(docker ps  -l -q):/home/
  1. run spark job

Go back to your first session. You should be connected as root in the docker container

cd /home
spark-submit   --class com.mprescha.json2Table.Json2Table \
      /usr/local/bin/spark-json-to-table_2.11-1.0.jar \
      /home/yelp_dataset_challenge_round9.tgz

Spark processing will take roughly 5 minutes.

If the job ran successful, following output-structure is generated in /home/output/.

  • businessAsTable
  • businessAttributes
  • businessCategories
  • businessHours
  • checkinAsTable
  • checkinTimes
  • review
  • tip
  • userAsTable
  • userElite
  • userFriends

Each subdir represents an entity-type that can be analyzed in zeppelin notebook.

You can verfiy result on your machine with du -h output/.

This should produce an output like this.

root@c6c0a39bc1fa:/home# du -h output/
4.8M	output/businessCategories
17M	output/checkinAsTable
4.2M	output/businessHours
4.8M	output/businessAttributes
703M	output/userAsTable
712M	output/userFriends
25M	output/userElite
9.5M	output/checkinTimes
1.8G	output/review
21M	output/businessAsTable
55M	output/tip
3.3G	output/
  1. goto zeppelin ui: http://localhost:8088/#/

Open the Notebook called analysis. Accept ("save") the interpreter bindings. In the menu bar click to play button to run all paragraphs.

if the notebook is not available you have to download it from this git repo and import into zeppelin. Alternatively check the results of the notebook in zeppelin hub

deploy changes in spark-app

Clone this project.

After any changes to the spark-app you need to build a new package with

sbt package

If all test are successful, place the package here ./spark-docker/bin/spark-json-to-table_2.11-1.0.jar

changes in dockerfile

After changes in Dockerfile goto project home dir and run

docker build --file spark-docker/Dockerfile -t mirkoprescha/spark-zeppelin .

spark-json-to-table's People

Contributors

mirkoprescha avatar

Watchers

James Cloos avatar kongdz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.