GithubHelp home page GithubHelp logo

spark-fim-2s's Introduction

Frequent Itemsets Mining Algorithms implemented under Spark RDD platform

General

Parallel version of Apriori, FP-Growth and ECLAT algorithm implemented under Spark RDD platform. The implementations are without Spark MLlib library and able to run on clusters in docker container.

Installation

The following steps will make you run codes on your spark cluster's containers. This implementation is based on one master and two workers configuration.

Prerequisites

  • Docker installed
  • Docker compose installed

Clone to local

cd docker
git clone https://github.com/WangZX-0630/spark-fim-2s

Build image

cd spark-fim-2s
docker build -t zxwang/spark-hadoop-2s:3.3.2 .

Getting Started

Start cluster

docker-compose up -d

Start ssh and hadoop service in all cluster nodes

SSH login without password is required. You need to start ssh service in all nodes. Despite this, hadoop services are also required to be started in all nodes.

./start.sh

Enter in the master node

docker exec -it spark-fim-2s-spark-1 bash

Run test

Run a simple test to check whether the cluster is running properly. The test code is stored in the share/test.py file. The test code will print the number of nodes in the cluster.

root@master:/opt >> spark-submit --master spark://master:7077 share/test.py

Datasets

The following datasets are used in the test.

chess.dat
kosarak.dat
mushroom.dat
retail.dat

These datasets are from the FIMI website. You could download them from the website and put under the share/data dir.

wget http://fimi.ua.ac.be/data/chess.dat.gz
wget http://fimi.ua.ac.be/data/kosarak.dat.gz
wget http://fimi.ua.ac.be/data/mushroom.dat.gz
wget http://fimi.ua.ac.be/data/retail.dat.gz
gunzip *.gz
mv *.dat share/data

Run algorithm codes

Put data sets into HDFS

Put data sets into HDFS. The data sets are stored in the share/data directory. Here put chess.dat into HDFS.

root@master:/opt >> hadoop fs -put share/data/chess.dat /

Run fp-growth

Run single fp-growth algorithm on one data set chess.dat. The minimum support is 0.3. Time Cost will be printed in the console.

root@master:/opt >> spark-submit --master spark://master:7077 share/fp-growth.py chess.dat 0.3

Run batches of test automatically

In the auto-test.py, you can set the minimum support and the data sets you want to run. The performance results will be stored in the share/result.txt file.

root@master:/opt >> python share/auto_test.py

Extend the cluster nodes number

You can extend the cluster nodes number by following steps:

  1. Create new directory for the new image. For example, spark-fim-4s. Copy all files in spark-fim-2s to spark-fim-4s.
cd docker
mkdir spark-fim-4s
cp -r spark-fim-2s/* spark-fim-4s
  1. In config/workers, add more workders.
worker1
worker2
worker3
worker4
  1. Build a new image with a new tag, such as zxwang/spark-hadoop-4s:3.3.2 if two more workers added.
docker build -t zxwang/spark-hadoop-4s:3.3.2 .
  1. Modify docker-compose.yml file. Update image name and volumes. Add more spark-workers.

For example, you can add two more worker nodes by adding the following lines in the docker-compose.yml file and update the image and volumes.

  spark-worker-3:
    image: zxwang/spark-hadoop-4s:3.3.2
    hostname: worker3
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - ~/docker/spark-fim-4s/share:/opt/share
    ports:
      - '8083:8081'
  spark-worker-4:
    image: zxwang/spark-hadoop-4s:3.3.2
    hostname: worker4
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=2G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - ~/docker/spark-fim-4s/share:/opt/share
    ports:
      - '8084:8081'
  1. Modify start.sh file as following:
docker exec -it spark-fim-4s-spark-1 sh start-hadoop.sh
docker exec -it spark-fim-4s-spark-worker-1-1 sh start-hadoop.sh
docker exec -it spark-fim-4s-spark-worker-2-1 sh start-hadoop.sh
docker exec -it spark-fim-4s-spark-worker-3-1 sh start-hadoop.sh
docker exec -it spark-fim-4s-spark-worker-4-1 sh start-hadoop.sh
  1. Run docker-compose up -d to start the cluster.

Directory Structure

spark-fim-2s
├── Dockerfile
├── config
│   ├── core-site.xml
│   ├── hadoop-env.sh
│   ├── hdfs-site.xml
│   ├── mapred-site.xml
│   ├── ssh_config
│   ├── workers
│   └── yarn-site.xml
├── docker-compose.yml
├── share
│   ├── apriori.py
│   ├── auto_test.py
│   ├── data
│   │   ├── chess.dat
│   │   ├── kosarak.dat
│   │   ├── mushroom.dat
│   │   └── retail.dat
│   ├── fp-growth.py
│   ├── requirements.txt
│   ├── result.txt
│   └── test.py
├── start-hadoop.sh
└── start.sh

spark-fim-2s's People

Contributors

wangzx-0630 avatar

Stargazers

 avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.