This project combines multiple technologies, running in separate containers, and represents how easily distributed system can be built and run with Docker. It covers Apache HTTPD access logs processing with Flume, Hadoop and Spark Streaming.
Current architecture of containers can be represented by the following interaction diagram
[HTTPD] [FLUME-AGENT]----[FLUME-COLLECTOR] [SPARK]----[SPARK-DRIVER] | | | | (logs)------------- [HADOOP]--------------- {volume} | (hdfs)
- Apache server generates access logs, which are written to host-binded volume folder
- Flume Agent is targeted to access logs via setup volume and sends data to sink, running in separate container
- Flume Collector writes data to HDFS of Hadoop
- Spark-Driver is Spark Streaming application, which scans HDFS folder and processes data on Spark cluster
- DNS
- Apache HTTPD
- Flume Agent
- Flume Collector
- Hadoop NameNode (namenode, jobtracker)
- Hadoop SecondaryNameNode (secondarynamenode)
- Hadoop DataNode 1 (datanote, tasktracker)
- Hadoop DataNode 2 (datanote, tasktracker)
- Spark Master
- Spark Slave 1
- Spark Slave 2
- Spark Streaming Driver Application
- tonistiigi/dnsdock - public DNSDock image from http://hub.docker.com
- centos:6 - public CentOS image from http://hub.docker.com
- centos:6-my - CentOS image with some linux tools, supervisord, sshd and bootstrap scripts
- centos-httpd:my - extended CentOS image with Apache HTTPD server
- centos-java:7-my - extended CentOS image with JDK
- centos-flume:1.6.0-my - JDK image with Flume
- centos-hadoop:1.2.1-my - JDK image with Hadoop
- centos-spark:1.4.0-my - JDK image with Spark
- centos-spark-driver:0.0.1-my - Spark image with Spark Streaming Java Application
|- tonistiigi/dnsdock |- centos:6 |- centos:6-my |- centos-httpd:my |- centos-java:7-my |- centos-flume:1.6.0-my |- centos-hadoop:1.2.1-my |- centos-spark:1.4.0-my |- centos-spark-driver:0.0.1-my
All Docker images, except tonistiigi/dnsdock and centos:6, should be build. Build sequence should match the hierarchy above.
- ./build-centos.sh - builds centos:6-my image
- ./build-httpd.sh - builds centos-httpd:my image
- ./build-java.sh - builds centos-java:7-my image
- ./build-flume.sh - builds centos-flume:1.6.0-my image
- ./build-hadoop.sh - builds centos-hadoop:1.2.1-my image
- ./build-spark.sh - builds centos-spark:1.4.0-my image
- ./build-spark-driver.sh - builds centos-spark-driver:0.0.1-my image
Make sure that dependent image is built before proceeding with the next build
The only major recommendation here is to start dnsdock container first, so all other containers will be able to register in DNS and communicate with one another. To minimize the quantity of Exceptions, caused by inactive containers, start containers in the following order.
$ ./start-dns.sh $ ./start-httpd.sh $ ./start-hadoop.sh $ ./start-flume.sh $ ./start-spark.sh $ ./start-spark-driver.sh
Every container writes data and logs to host folder, binded via volume. Make sure that HOST_VOL variable is set properly in env.sh:
HOST_VOL=/home/nibbler/docker
If Hadoop is started first time, make sure to use start-hadoop-format.sh script, instead of start-hadoop.sh. Services.sh script will format Hadoop file system in NameNode container prior to startup.
$ ./start-hadoop-format.sh
To startup Flume, Hadoop and Spark components containers step-by-step, use another set of scripts:
$ ./start-hadoop-namenode.sh $ ./start-hadoop-namenode-format.sh $ ./start-hadoop-secondarynamenode.sh $ ./start-hadoop-datanodes.sh $ ./start-flume-agent.sh $ ./start-flume-collector.sh $ ./start-spark-master.sh $ ./start-spark-slaves.sh
To stop containers separately, use the following scripts:
$ ./stop-spark-driver.sh $ ./stop-spark.sh $ ./stop-flume.sh $ ./stop-hadoop.sh $ ./stop-httpd.sh $ ./stop-dns.sh
or kill them all in one shot:
$ ./stop-all.sh
Stop the container and remove it, including dangling volumes:
$ ./stop-container.sh -n container-name -c -v $ ./stop-container.sh -h This script stops and removes containers by name OPTIONS: -h Show this message -n <name> Container name -c Remove container -v Remove conrainer and volume
Open bash shell to running container:
$ ./tty.sh container-id
Inspect running container:
$ ./inspect.sh container-id
Cleanup exited containers:
$ ./cleanup-containers.sh
Cleanup dangling images, e.g. after unsuccessful builds:
$ ./cleanup-images.sh
Cleanup exited containers and dangling images:
$ ./cleanup-all.sh