-
Load both datasets into PySpark DataFrames, and clean and preprocess the data in each DataFrame.
-
Extract the week number, quarter, and hour from the transaction date and add these as new columns to the sales DataFrame.
-
Calculate Total sales by month, product, and the total sales amount for each customer.
-
Identify the top 10 customers with the highest total sales amount.
-
Generate a CSV for total sales by customer with email address and phone number.
-
Export the results to a CSV file by month/year.
PySpark script : Code
-
Combine the data from all the JSON files into a single, standardized JSON format.
-
Handle any inconsistencies or missing data in the supplier files.
-
Detect and resolve duplicate entries.
-
Store the consolidated data in a central data repository or database.
Python script : Code
-
Calculate the total sales amount for each customer for each month. The result should include the customer name, the month, and the total sales amount.
SQL code 1: Code
The result name monthlysales is used as view table for 2nd task -
Calculate the 3-month moving average of each customer's sales. The result should include the customer name, the month, and the moving average of sales for each customer. The moving average should be calculated as an average of the sales for the current month and the two previous months.
SQL code 2: Code
Based on given scenario, provide a detailed plan for setting up this cloud-based data pipeline. Specify:
- The cloud provider you would choose for this project.
- We have selected AWS services for most part of this pipeline.
- The specific cloud tools or services you would use for data collection, processing, and storage.
- AWS Glue : to handle the various data formats and prepared for smoother and data cleaning and aggregation.
- AWS Lambda : to transform the data such as cleaning, aggregating, and preparing for the next step, i.e. data analysis.
- Amazon S3 : To store processed and aggregated data. More like a pool (data lake) of data that can be retrieved in real-time efficiently such as for real-time data analytics.
- Amazon Redshift : Similar to S3 but a long-term storage that is optimized for larger or older databases hence the name data warehouse.
- Microsoft Power BI or Tableau : Data analytics, real-time or scheduled depending on the needs, hourly, monthly, or yearly.
- Any potential challenges or considerations that need to be addressed.
- Scalability: with simple data ETL by AWS glue, larger and heavy traffics stream such as Khairul Aming's live Shopee required a better service, such as AWS Kinesis Data Streams service and Amazon EMR service.
- Fault tolerance: an efficient monitoring services of each data sources and load balancer may be needed with bigger scale of e-commerce company.
- Costing: AWS is know to be expensive with bigger load in certain services. Open source services such as Apache Iceberg, Compute Spark and Airflow Mage. If the company is starting small, it can pay for Amazon S3 for the cloud storage while utilising other open source toolkits.
- Please outline your plan, including the architecture and the tools or services you would utilize for each step of the data pipeline.