GithubHelp home page GithubHelp logo

moringa-wk9's Introduction

Moringa-wk9 : Apache Kafka and Streamlit exploration

Wednesday 26-April Project Brief: Data Streaming with Kafka

apache kafka

Background: Telecommunications Mobile Money Data Engineering with Kafka

In this project, you will work with telecommunications mobile money data to build a Kafka data engineering solution. You will be provided with a dummy json file containing sample data that you will use to test your solution.

The project aims to build a Kafka pipeline that can receive real-time data from telecommunications mobile money transactions and process it for analysis. The pipeline should be designed to handle high volumes of data and ensure that the data is processed efficiently.

To complete this project, you will need to follow these steps:
1. Set up a Kafka cluster: You must set up a Kafka cluster that can handle high volumes of data. You can use either a cloud-based or on-premises Kafka cluster.
2. Develop a Kafka producer: You must develop a Kafka producer that can ingest data from telecommunications mobile money transactions and send it to the Kafka cluster. The producer should be designed to handle high volumes of data and ensure that the data is sent to the Kafka cluster efficiently.
3. Develop a Kafka consumer: You must develop a Kafka consumer to receive data from the Kafka cluster and process it for analysis. The consumer should be designed to handle high volumes of data and ensure that the data is processed efficiently.
4. Process the data: Once you have set up the Kafka pipeline, you must process the data for analysis. This may involve cleaning and aggregating the data, performing calculations, and creating visualizations.
5. Test the solution: You must test your solution using the provided dummy json file. The file contains sample data that you can use to ensure that your Kafka pipeline is working correctly.

Here’s the dummy JSON file that represents our mobile money data.
{
"transaction_id": "12345",
"sender_phone_number": "256777123456",
"receiver_phone_number": "256772987654",
"transaction_amount": 100000,
"transaction_time": "2023-04-19 12:00:00"
}

Steps to setup the pipeline

1- Goto https://confluent.cloud/ and setup a kafka cluster and topic
2- Get the connection details for your cluster instance
3- In the attached .py file find the code section with below entries. Update the below connection details to reflect the connection details generated for your own confluence cluster instance.

bootstrap_servers = '#YOUR_URL#.confluent.cloud:9092'
security_protocol = 'SASL_SSL'
sasl_mechanism = 'PLAIN'
sasl_plain_username = '#YOUR_USERNAME#'
sasl_plain_password = '#YOUR_PASSWORD#'
topic = 'my_pipeline'

4- Run the .py file to start the streaming pipeline

========================================================================================

Thursday 27-April Project Brief: Visualizing streaming data with Streamlit

streamlit

Introduction

In this project, you will create a real-time data visualization dashboard using Streamlit to analyze streaming data from Reddit to identify fraud in telecommunications. The project will involve connecting to Reddit's API, collecting real-time posts, processing the posts to extract useful information, and visualizing the data using Streamlit.

Problem Statement

Fraud in telecommunications is a significant problem that costs the industry billions of dollars annually. Fraudsters use various techniques to exploit telecom infrastructure weaknesses, including hacking into phone systems, stealing identities, and exploiting vulnerabilities in billing systems. The challenge for telecom companies is to detect and prevent fraud in real-time before it causes significant financial damage.

Your task is to develop a real-time data visualization dashboard that monitors Reddit for mentions of telecoms fraud and other related keywords, such as "telecoms scam", "phone fraud", "billing fraud", and "identity theft". You will extract useful information from the posts, such as the post text, user name, subreddit, and date/time, and use this information to analyze the data for patterns and trends related to telecom fraud.

Project Requirements

● Connect to Reddit's API and collect real-time posts related to telecom fraud and other related keywords.
● Process the posts to extract useful information, including the post text, user name, subreddit, and date/time.
● Analyze the data to identify patterns and trends related to telecom fraud and other related keywords.
● Use Streamlit to create an interactive data visualization dashboard that displays real-time information about telecom fraud and other related keywords.
● The dashboard should include at least one chart or graph that displays the data meaningfully, e.g., a bar chart showing the number of fraud mentions by subreddit or a line chart showing the frequency of fraud mentions over time.
● The dashboard should be easy to use and visually appealing, with clear and concise labels and instructions

Deliverables

● Python script to collect and process real-time posts from Reddit API.
● Interactive data visualization dashboard created using Streamlit.
● Deployment of the dashboard to a cloud-based platform.

Steps to access the dashboard

The application code is in file - streamlit_app.py
The libraries that need to be imported to run the dashboard are in file - requirements.txt
The dashboard is accessible at URL - https://joekibz-moringa-wk9-streamlit-app-sywmbo.streamlit.app/

moringa-wk9's People

Contributors

joekibz avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.