GithubHelp home page GithubHelp logo

jevonyang / bigdata-docker Goto Github PK

View Code? Open in Web Editor NEW

This project forked from saraivaufc/bigdata-docker

0.0 0.0 0.0 639 KB

Run Hadoop Cluster within Docker Containers.

License: Other

Dockerfile 34.32% Shell 65.68%

bigdata-docker's Introduction

bigdata-docker

Install Docker (Ubuntu):

$ sudo apt-get remove docker docker-engine docker.io
$ sudo apt-get update
$ sudo apt-get install apt-transport-https ca-certificates  curl software-properties-common
$ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo apt-key fingerprint 0EBFCD88
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
$ sudo apt-get update
$ sudo apt-get install docker-ce
$ sudo docker run hello-world

Install Docker Compose

$ sudo curl -L "https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
$ sudo chmod +x /usr/local/bin/docker-compose

To use GPU

Install NVIDIA Docker 2

$ sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit

Update /etc/docker/daemon.json

From:

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

To:

{
	"default-runtime":"nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Restart Docker

$ service docker restart

Build images

$ docker-compose build --parallel

Up containers via compose

$ docker-compose up -d

Applications

Application URL
Hadoop http://localhost:9870
Hadoop Cluster http://localhost:8088
Hadoop HDFS hdfs://localhost:9000
Hadoop WEBHDFS http://localhost:14000/webhdfs/v1
Hive Server2 http://localhost:10000
Hue http://localhost:8888 (username: hue,password: secret)
Spark Master UI http://localhost:4080
Spark Jobs http://localhost:4040
Livy http://localhost:8998
Jupyter notebook http://localhost:8899
AirFlow http://localhost:8080 (username: airflow,password: airflow)
Flower http://localhost:8555

Tutorials

HDFS

Access the Hadoop Namenode container

docker exec -it hadoop-master bash

List root content

hadoop fs -ls /

Create a directory structure

hadoop fs -mkdir /dados
hadoop fs -ls /
hadoop fs -ls /dados
hadoop fs -mkdir /dados/bigdata
hadoop fs -ls /dados

Test the deletion of a directory

hadoop fs -rm -r /dados/bigdata
hadoop fs -ls /dados

Add an external file to the cluster

cd /root
ls
hadoop fs -mkdir /dados/bigdata
hadoop fs -put /var/log/alternatives.log /dados/bigdata
hadoop fs -ls /dados/bigdata

Copy files

hadoop fs -ls /dados/bigdata
hadoop fs -cp /dados/bigdata/alternatives.log /dados/bigdata/alternatives2.log
hadoop fs -ls /dados/bigdata

List the contents of a file

hadoop fs -ls /dados/bigdata
hadoop fs -cat /dados/bigdata/alternatives.log

Create a HUE User

hadoop fs -mkdir /user/hue
hadoop fs -ls /user/hue
hadoop fs -chmod 777 /user/hue

Hive

Access the Hadoop Namenode container

docker exec -it hadoop-master bash

Run Hive Shell

hive

List databases

> show databases;

Access 'default' Database

> use default;

List database tables

> show tables;

Spark

Documentation: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=read%20csv

Data ingestion in HDFS

# Access the Hadoio Namenode container
docker exec -it hadoop-master bash

# Download ENEM datasets: http://inep.gov.br/microdados

# create spark folder in HDFS
hadoop fs -mkdir /user/spark/

# Data ingestion in HDFS
hadoop fs -put  MICRODADOS_ENEM_2018.csv /user/spark/
hadoop fs -put  MICRODADOS_ENEM_2017.csv /user/spark/

Access the Spark master node container

docker exec -it spark-master bash

Access Spark shell

spark-shell

Load ENEM 2018 data from HDFS

val df = spark.read.format("csv").option("sep", ";").option("inferSchema", "true").option("header", "true").load("hdfs://hadoop-master:9000/user/spark/MICRODADOS_ENEM_2018.csv")

Show dataframe schema

df.printSchema()

Show how many visually impaired students participated in the ENEM test in 2018.

df.groupBy("IN_CEGUEIRA").count().show()

Show how many students participated in the ENEM test in 2018 grouped by age.

df.groupBy("NU_IDADE").count().sort(asc("NU_IDADE")).show(100, false)

Kafka

Connect Kafka Broker 1

docker exec -it kafka-broker1 bash

Create topic

kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 1 --topic test

List topics

kafka-topics.sh --zookeeper zookeeper:2181 --list

Run Producer on Kafka Broker 1

kafka-console-producer.sh --bootstrap-server kafka-broker1:9091 --topic test

Enter data

>Hello

Connect Kafka Broker 2

docker exec -it kafka-broker2 bash

Run Consumer on Kafka Broker 2

kafka-console-consumer.sh --bootstrap-server kafka-broker1:9091 --from-beginning --topic test

Delete topic

kafka-topics.sh --zookeeper zookeeper:2181 --delete --topic test
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.