In this project, you will work with telecommunications mobile money data to build a Kafka data engineering solution. You will be provided with a dummy json file containing sample data that you will use to test your solution.
The project aims to build a Kafka pipeline that can receive real-time data from telecommunications mobile money transactions and process it for analysis. The pipeline should be designed to handle high volumes of data and ensure that the data is processed efficiently.
To complete this project, you will need to follow these steps:
1. Set up a Kafka cluster: You must set up a Kafka cluster that can handle high volumes
of data. You can use either a cloud-based or on-premises Kafka cluster.
2. Develop a Kafka producer: You must develop a Kafka producer that can ingest data
from telecommunications mobile money transactions and send it to the Kafka cluster.
The producer should be designed to handle high volumes of data and ensure that the
data is sent to the Kafka cluster efficiently.
3. Develop a Kafka consumer: You must develop a Kafka consumer to receive data from
the Kafka cluster and process it for analysis. The consumer should be designed to
handle high volumes of data and ensure that the data is processed efficiently.
4. Process the data: Once you have set up the Kafka pipeline, you must process the data
for analysis. This may involve cleaning and aggregating the data, performing
calculations, and creating visualizations.
5. Test the solution: You must test your solution using the provided dummy json file. The
file contains sample data that you can use to ensure that your Kafka pipeline is working
correctly.
Here’s the dummy JSON file that represents our mobile money data.
{
"transaction_id": "12345",
"sender_phone_number": "256777123456",
"receiver_phone_number": "256772987654",
"transaction_amount": 100000,
"transaction_time": "2023-04-19 12:00:00"
}
Steps to setup the pipeline
1- Goto https://confluent.cloud/ and setup a kafka cluster and topic
2- Get the connection details for your cluster instance
3- In the attached .py file find the code section with below entries. Update the below connection details to reflect the connection details generated for your own confluence cluster instance.
bootstrap_servers = '#YOUR_URL#.confluent.cloud:9092'
security_protocol = 'SASL_SSL'
sasl_mechanism = 'PLAIN'
sasl_plain_username = '#YOUR_USERNAME#'
sasl_plain_password = '#YOUR_PASSWORD#'
topic = 'my_pipeline'
4- Run the .py file to start the streaming pipeline
========================================================================================
In this project, you will create a real-time data visualization dashboard using Streamlit to analyze streaming data from Reddit to identify fraud in telecommunications. The project will involve connecting to Reddit's API, collecting real-time posts, processing the posts to extract useful information, and visualizing the data using Streamlit.
Fraud in telecommunications is a significant problem that costs the industry billions of dollars annually. Fraudsters use various techniques to exploit telecom infrastructure weaknesses, including hacking into phone systems, stealing identities, and exploiting vulnerabilities in billing systems. The challenge for telecom companies is to detect and prevent fraud in real-time before it causes significant financial damage.
Your task is to develop a real-time data visualization dashboard that monitors Reddit for mentions of telecoms fraud and other related keywords, such as "telecoms scam", "phone fraud", "billing fraud", and "identity theft". You will extract useful information from the posts, such as the post text, user name, subreddit, and date/time, and use this information to analyze the data for patterns and trends related to telecom fraud.
● Connect to Reddit's API and collect real-time posts related to telecom fraud and other related keywords.● Process the posts to extract useful information, including the post text, user name, subreddit, and date/time.
● Analyze the data to identify patterns and trends related to telecom fraud and other related keywords.
● Use Streamlit to create an interactive data visualization dashboard that displays real-time information about telecom fraud and other related keywords.
● The dashboard should include at least one chart or graph that displays the data meaningfully, e.g., a bar chart showing the number of fraud mentions by subreddit or a line chart showing the frequency of fraud mentions over time.
● The dashboard should be easy to use and visually appealing, with clear and concise labels and instructions
● Python script to collect and process real-time posts from Reddit API.
● Interactive data visualization dashboard created using Streamlit.
● Deployment of the dashboard to a cloud-based platform.
The application code is in file - streamlit_app.py
The libraries that need to be imported to run the dashboard are in file - requirements.txt
The dashboard is accessible at URL - https://joekibz-moringa-wk9-streamlit-app-sywmbo.streamlit.app/