GithubHelp home page GithubHelp logo

sureshb208 / spark-etl-pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from anjijava16/spark-etl-pipeline

0.0 1.0 0.0 5.18 MB

Various data stream/batch process demo with Apache Scala Spark ๐Ÿš€

Dockerfile 7.37% Python 7.64% HTML 0.12% Java 3.60% Scala 79.27% Shell 1.99%

spark-etl-pipeline's Introduction

SPARK-ETL-PIPELINE

demo various data fetch/transform process via Spark Scala

Scala Projects

File structure

# โ”œโ”€โ”€ Dockerfile         : Dockerfile make scala spark env 
# โ”œโ”€โ”€ README.md
# โ”œโ”€โ”€ archived           : legacy spark scripts in python/java...
# โ”œโ”€โ”€ build.sbt          : (scala) sbt file build spark scala dependency 
# โ”œโ”€โ”€ config             : config for various services. e.g. s3, DB, hive..
# โ”œโ”€โ”€ data               : sample data for some spark scripts demo
# โ”œโ”€โ”€ output             : where the spark stream/batch output to  
# โ”œโ”€โ”€ project            : (scala) other sbt setting : plugins.sbt, build.properties...
# โ”œโ”€โ”€ python             : helper python script 
# โ”œโ”€โ”€ run_all_process.sh : script demo run minimum end-to-end spark process
# โ”œโ”€โ”€ script             : helper shell script
# โ”œโ”€โ”€ src                : (scala) MAIN SCALA SPARK TESTS/SCRIPTS 
# โ”œโ”€โ”€ target             : where the final complied jar output to  (e.g. target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar)
# โ””โ”€โ”€ travis_build.sh    : travis build file

Prerequisites

  1. Modify config with yours and rename them (e.g. twitter.config.dev -> twitter.config) to access services like data source, file system.. and so on.
  2. Install SBT as scala dependency management tool
  3. Install Java, Spark
  4. Modify build.sbt aligned your dev env
  5. Check the spark etl scripts : src

Process

sbt clean compile -> sbt test -> sbt run -> sbt assembly -> spark-submit <spark-script>.jar

Quick Start

$ git clone https://github.com/yennanliu/spark-etl-pipeline.git && cd spark-etl-pipeline && bash run_all_process.sh

Quick Start Manually

Quick Start Manually
# STEP 0) 
$ cd ~ && git clone https://github.com/yennanliu/spark-etl-pipeline.git && cd spark-etl-pipeline

# STEP 1) download the used dependencies.
$ sbt clean compile

# STEP 2) print twitter via spark stream  via sbt run`
$ sbt run

# # STEP 3) create jars from spark scala scriots 
$ sbt assembly
$ spark-submit spark-etl-pipeline/target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar
# get fake page view event data 

# run the script generate page view 
$ sbt package
$ spark-submit \
  --class DataGenerator.PageViewDataGenerator \
  target/scala-2.11/spark-etl-pipeline_2.11-1.0.jar

# open the other terminal to receive the event
$ curl 127.0.0.1:44444

Quick Start Docker

Quick Start Docker
# STEP 0) 
$ git clone https://github.com/yennanliu/spark-etl-pipeline.git

# STEP 1) 
$ cd spark-etl-pipeline

# STEP 2) docker build 
$ docker build . -t spark_env

# STEP 3) ONE COMMAND : run the docker env and sbt compile and sbt run and assembly once 
$ docker run  --mount \
type=bind,\
source="$(pwd)"/.,\
target=/spark-etl-pipeline \
-i -t spark_env \
/bin/bash  -c "cd ../spark-etl-pipeline && sbt clean compile && && sbt assembly && spark-submit spark-etl-pipeline/target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar"

# STEP 3') : STEP BY STEP : access docker -> sbt clean compile -> sbt run -> sbt assembly -> spark-submit 
# docker run 
$ docker run  --mount \
type=bind,\
source="$(pwd)"/.,\
target=/spark-etl-pipeline \
-i -t spark_env \
/bin/bash 
# inside docker bash 
root@942744030b57:~ cd ../spark-etl-pipeline && sbt clean compile && sbt run 

root@942744030b57:~ cd ../spark-etl-pipeline && spark-submit spark-etl-pipeline/target/scala-2.11/spark-etl-pipeline-assembly-1.0.jar

Ref

Ref

Dataset

Dataset

spark-etl-pipeline's People

Contributors

yennanliu avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.