GithubHelp home page GithubHelp logo

fake-streaming-pipeline's Introduction

Fake Streaming Pipeline

This project is designed to simulate a real-time data pipeline.

It generates real time fake data like user and song in Spotify, sends this data as a stream through a Kafka server, processes the data in real-time using PySpark(Structured Streaming), writes the processed data to a Cassandra database and visualize the results in Grafana.

The simulation includes data for the user (such as name, address, age, and nationality), the song being played (including artist, album name, track name, track duration, and genre), and the time of play.

The data processing includes selecting certain fields, transforming some of the data, and computing the average age of listeners for each artist.

Table of Contents

Structure

diagram

Why Fake Data?

In a real-world scenario, you would likely have a consistent stream of data coming from a source such as a web app, IoT devices, or logs. However, setting up such a source for the purpose of demonstrating or testing a data pipeline can be challenging, especially when you want a large volume of data. You might not have access to enough real data, or there could be privacy issues with using real user data.

This is why Faker comes in. Faker is a Python library that generates fake data. You can use it to generate data that mimics a variety of real-world data types. By using Faker, you can easily generate a large volume of realistic-looking data for your data pipeline.

Benefits

  • Control over the volume of data

    You can generate as much or as little data as you need, simply by running the Faker function more or fewer times.

  • Privacy

    Since the data is all fake, there are no privacy concerns.

  • Variety

    Faker can generate a wide range of data types, allowing you to simulate a wide range of real-world scenarios.

  • Consistency

    The data generated by Faker is consistent in format, making it easier to process.

Getting Started

This project is set up to run in a dockerized environment, making it easy to get up and running.

Prerequisites

Before starting, ensure you have the following installed:

  • Python 3.10
  • Pipenv
  • Docker and Docker Compose

Installation

  1. Clone the repository and run the following command

    git clone https://github.com/OZOOOOOH/Fake-Streaming-Pipeline.git
    
    cd Fake-Streaming-Pipeline
  2. Set up Python virtual environment

    2-1. Install all necessary dependencies

    pipenv install

    2-2. Activate the new virtual environment

    pipenv shell
  3. Start Docker containers

    docker-compose up -d
  4. Run the application

    python src/main.py

Usage

If you follow above descriptions, You can monitor the Kafka server and the Cassandra database to see the data in real-time. You can also query the Cassandra database to analyze the processed data.

How to

To monitor the Kafka server

diagram

To watch the grafana dashboard diagram

Data Processing Details

Class SpotifyStreamingProcessor in process.py reads the data from the Kafka stream, processes it, and writes it to Cassandra.

The processing involves selecting certain fields, transforming some of the data, and computing the average listener age for each artist. The processed data and the average age of listeners by artist are then written to Cassandra.

Roadmap

  • Add Batch Processing Pipelines
  • Enhance the structure of the application to handle larger volumes of data.
  • Adapt the application to run on cloud services such as AWS.
  • Implement a batch processing pipeline.
  • Extend the application to handle data from multiple sources.
  • Monitor cassandra and kafka containers using grafana

Acknowledgments

Projects

Libraries

Dataset

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.