GithubHelp home page GithubHelp logo

saurabh3949 / dockerized-hadoop-with-bigram-counter Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 499 KB

Spawns a Hadoop master node and 4 worker nodes (all in docker) to count bigrams in the given input file

Java 70.46% Shell 23.20% Python 6.33%

dockerized-hadoop-with-bigram-counter's Introduction

Dockerized Hadoop cluster and BigramCounter

##1. Project Introduction

This project is based on kiwenlau/hadoop-cluster-docker project, however, we've reconstructed it to run a BigramCounter after creating the cluster and to computer some bigram statistics.

Docker Images required/built:

Image Name
kiwenlau/serf-dnsmasq
kiwenlau/hadoop-base
kiwenlau/hadoop-slave
saurabh/hadoop-master

##2. Images description

In this project, we use 4 docker images: serf-dnsmasq, hadoop-base, hadoop-master and hadoop-slave.

#####serf-dnsmasq

Based on ubuntu:15.04. serf and dnsmasq are installed for providing DNS service for the Hadoop Cluster.

#####hadoop-base

Based on serf-dnsmasq, openjdk, openssh-server, vim and Hadoop 2.3.0 are installed.

#####hadoop-master

Based on hadoop-base. This is reconfigured to spawn 4 worker nodes and run the bigram counter.

#####hadoop-slave

Based on hadoop-base. Configure the Hadoop slave node.

Following picture shows the image architecture of my project:

![alt text](https://github.com/saurabh3949/Dockerized-Hadoop-with-Bigram-Counter/raw/master/image architecture.jpg "Image Architecture")

##3. Usage

#####Clone the REPOSITORY and run:

bash start.sh

##4. Input Text file The master node uses a reduced version of mobydick.txt (in hadoop-master/files directory) by default. You can replace the contents of this file to run the analysis on a different input.

##5. Output The script generates the following output with reduced mobydick.txt:

             _               _
  ___  _   _| |_ _ __  _   _| |_
 / _ \| | | | __| '_ \| | | | __|
| (_) | |_| | |_| |_) | |_| | |_
 \___/ \__,_|\__| .__/ \__,_|\__|
                |_|


Total number of bigrams: 554

Most common bigram: ('of the', 5)

Top 10 bigrams:
('of the', 5)
('in the', 4)
('I find', 2)
('What do', 2)
('city of', 2)
('find myself', 2)
('in my', 2)
('is a', 2)
('to get', 2)
('American desert,', 1)

No. of bigrams required to add up to 10% of all bigrams: 42

##6. Contributors

dockerized-hadoop-with-bigram-counter's People

Contributors

saurabh3949 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.