GithubHelp home page GithubHelp logo

jeperez / sparkling-docker Goto Github PK

View Code? Open in Web Editor NEW

This project forked from svkx/sparkling-docker

0.0 2.0 0.0 300 KB

HTTPD logs processing with Flume, Hadoop and Spark Streaming in Docker containers

License: GNU General Public License v2.0

Shell 42.20% ApacheConf 49.27% Java 8.53%

sparkling-docker's Introduction

sparkling-docker

This project combines multiple technologies, running in separate containers, and represents how easily distributed system can be built and run with Docker. It covers Apache HTTPD access logs processing with Flume, Hadoop and Spark Streaming.


Overall architecture

Current architecture of containers can be represented by the following interaction diagram

[HTTPD]      [FLUME-AGENT]----[FLUME-COLLECTOR]      [SPARK]----[SPARK-DRIVER]
   |               |                  |                 |
 (logs)-------------              [HADOOP]---------------
         {volume}                     |
                                   (hdfs)
  1. Apache server generates access logs, which are written to host-binded volume folder
  2. Flume Agent is targeted to access logs via setup volume and sends data to sink, running in separate container
  3. Flume Collector writes data to HDFS of Hadoop
  4. Spark-Driver is Spark Streaming application, which scans HDFS folder and processes data on Spark cluster

Containers list

  • DNS
  • Apache HTTPD
  • Flume Agent
  • Flume Collector
  • Hadoop NameNode (namenode, jobtracker)
  • Hadoop SecondaryNameNode (secondarynamenode)
  • Hadoop DataNode 1 (datanote, tasktracker)
  • Hadoop DataNode 2 (datanote, tasktracker)
  • Spark Master
  • Spark Slave 1
  • Spark Slave 2
  • Spark Streaming Driver Application

Docker Images

List of Images

  • tonistiigi/dnsdock - public DNSDock image from http://hub.docker.com
  • centos:6 - public CentOS image from http://hub.docker.com
  • centos:6-my - CentOS image with some linux tools, supervisord, sshd and bootstrap scripts
  • centos-httpd:my - extended CentOS image with Apache HTTPD server
  • centos-java:7-my - extended CentOS image with JDK
  • centos-flume:1.6.0-my - JDK image with Flume
  • centos-hadoop:1.2.1-my - JDK image with Hadoop
  • centos-spark:1.4.0-my - JDK image with Spark
  • centos-spark-driver:0.0.1-my - Spark image with Spark Streaming Java Application

Images Hierarchy

  |- tonistiigi/dnsdock
  |- centos:6
      |- centos:6-my
         |- centos-httpd:my
         |- centos-java:7-my
            |- centos-flume:1.6.0-my
            |- centos-hadoop:1.2.1-my
            |- centos-spark:1.4.0-my
                |- centos-spark-driver:0.0.1-my

Build Process

All Docker images, except tonistiigi/dnsdock and centos:6, should be build. Build sequence should match the hierarchy above.

  • ./build-centos.sh - builds centos:6-my image
  • ./build-httpd.sh - builds centos-httpd:my image
  • ./build-java.sh - builds centos-java:7-my image
  • ./build-flume.sh - builds centos-flume:1.6.0-my image
  • ./build-hadoop.sh - builds centos-hadoop:1.2.1-my image
  • ./build-spark.sh - builds centos-spark:1.4.0-my image
  • ./build-spark-driver.sh - builds centos-spark-driver:0.0.1-my image

Make sure that dependent image is built before proceeding with the next build

Launching containers

The only major recommendation here is to start dnsdock container first, so all other containers will be able to register in DNS and communicate with one another. To minimize the quantity of Exceptions, caused by inactive containers, start containers in the following order.

$ ./start-dns.sh
$ ./start-httpd.sh
$ ./start-hadoop.sh
$ ./start-flume.sh
$ ./start-spark.sh
$ ./start-spark-driver.sh

Every container writes data and logs to host folder, binded via volume. Make sure that HOST_VOL variable is set properly in env.sh:

HOST_VOL=/home/nibbler/docker

If Hadoop is started first time, make sure to use start-hadoop-format.sh script, instead of start-hadoop.sh. Services.sh script will format Hadoop file system in NameNode container prior to startup.

$ ./start-hadoop-format.sh

To startup Flume, Hadoop and Spark components containers step-by-step, use another set of scripts:

$ ./start-hadoop-namenode.sh
$ ./start-hadoop-namenode-format.sh
$ ./start-hadoop-secondarynamenode.sh
$ ./start-hadoop-datanodes.sh

$ ./start-flume-agent.sh
$ ./start-flume-collector.sh

$ ./start-spark-master.sh
$ ./start-spark-slaves.sh

To stop containers separately, use the following scripts:

$ ./stop-spark-driver.sh
$ ./stop-spark.sh
$ ./stop-flume.sh
$ ./stop-hadoop.sh
$ ./stop-httpd.sh
$ ./stop-dns.sh

or kill them all in one shot:

$ ./stop-all.sh

Useful Scripts

Stop the container and remove it, including dangling volumes:

$ ./stop-container.sh -n container-name -c -v

$ ./stop-container.sh -h
This script stops and removes containers by name
OPTIONS:
   -h          Show this message
   -n <name>   Container name
   -c          Remove container
   -v          Remove conrainer and volume

Open bash shell to running container:

$ ./tty.sh container-id

Inspect running container:

$ ./inspect.sh container-id

Cleanup exited containers:

$ ./cleanup-containers.sh

Cleanup dangling images, e.g. after unsuccessful builds:

$ ./cleanup-images.sh

Cleanup exited containers and dangling images:

$ ./cleanup-all.sh

sparkling-docker's People

Contributors

svkx avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.