GithubHelp home page GithubHelp logo

data-pipeline's Introduction

Data-pipeline

Architecture

data-pipeline-new (1)

Introduction

In this project, you will execute an End-To-End Data Pipeline on Real-Time Order Data using Kafka and ELK stack using Docker-Compose.

Technology Used

🔹 Python3

🔹 Docker-compose

🔹 Apache Kafka

🔹 ELK Stack

Dependency

▪️ For Python you can see requirement.txt

▪️ In docker-compose we have not used logstash, So we need to configure it on our system.

▪️ In Logstash install /usr/share/logstash/bin/logstash-plugin install logstash-integration-kafka.

Kafka Architecture

Kafka-Architectureedit

Kafka is a distributed streaming platform that can handle real-time data feeds. It was initially developed by LinkedIn and later open-sourced by Apache Software Foundation. Kafka achieves its high-throughput and fault-tolerance by distributing the load over multiple servers.

Key Components:

Producer: The producer is responsible for creating the data and sending it to the Kafka cluster. The producer is decoupled from the cluster and can send data at high speed.

Consumer: The consumer is responsible for consuming the data produced by the producer. It connects to the Kafka cluster and subscribes to specific topics.

Topic: A topic is a category or feed name to which the records are published. Topics are used to organize the data into categories.

Broker: A broker is a Kafka server that receives the records from producers and serves them to consumers. A Kafka cluster can consist of multiple brokers.

Zookeeper: Zookeeper is a centralized service for maintaining configuration information and providing synchronization and coordination. In a Kafka cluster, Zookeeper helps in electing the cluster's controller and maintaining the broker and partition state.

Kafka offers several advantages:

🔶 Horizontal scalability: Kafka can handle high volumes of data with a scalable and distributed architecture.

🔶 High throughput: Kafka can handle millions of records per second.

🔶 Fault-tolerance: Kafka ensures data durability and reliability by replicating data across multiple nodes.

🔶 Low latency: Kafka allows real-time processing of data with minimal latency.

If you encounter any issues or have suggestions for improvements, please feel free to contribute or report them on the GitHub repository. We welcome any feedback to enhance the script further.

data-pipeline's People

Contributors

ron-ait avatar

Stargazers

Duy Nguyen Bui avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.