GithubHelp home page GithubHelp logo

dsti_spark_project2021's Introduction

Exercise with olist dataset, extract deliveries of more than 10 days

Functionality

  1. Set up spark scala, run the code on terminal
  2. Get the desire output
  3. Run a test in form of a sanity test
  4. Generate a csv report of the desired output

Usage

Clone this repository in your computer using git clone https://github.com/profbiyi/dsti_spark_project2021.git

Installation and setting up environment

  1. Download Spark here. Put Spark command in your PATH. (I opted for Spark 3.1.2)
  2. On windows, download hadoop binary here and save to C:\Windows\System32. I was unable to save to CSV until this was done
  3. on windows download winutils for the version of hadoop here and save in HADOOP_HOME/bin

How to download the data

Go to https://www.kaggle.com/olistbr/brazilian-ecommerce and download the zip file.

Then unzip the archive.zip

unzip archive.zip

Running the code

There are two ways to run this code.

Option 1: This options runs the whole code onces and generate necessary output

spark-shell -i  code.scala

Option 2: This options allows for running the code interactively through the spark shell

Open a terminal and and change directory to where the unzip file are: Then run the following command:

spark-shell

Notable assumption

  1. order_purchase_timestamp is in Sao Palo in Brazil, so it has to be converted to UTC
  2. order_delivered_customer_date assumes that all customers are in brazi so it has to be converted to UTC
  3. A general timezone of UTC -3 is used as the timezone for all the orders.

How to extract the deliveries

Run the following code: 

import org.apache.spark.sql.DataFrame
def extractDeliveriesMoreThan10Days: DataFrame = {
    val path = "C:/Users/ibnqu/Desktop/spark_scala_stuff/olist"
    val ordersPath = s"$path/olist_orders_dataset.csv"
    val ordersDF = spark.read.option("header", "true").csv(ordersPath)
    ordersDF.createOrReplaceTempView("orders")
    spark.sql("select datediff(to_utc_timestamp(order_delivered_customer_date, 'UTC-3'), to_utc_timestamp(order_purchase_timestamp, 'UTC-3')) as delivery_delay_in_days from orders where datediff(to_utc_timestamp(order_delivered_customer_date, 'UTC-3'), to_utc_timestamp(order_purchase_timestamp, 'UTC-3')) > 10 ")
}

Display the deliveries: 

deliveries of more than 10 days

extractDeliveriesMoreThan10Days.show

Sanity check:

verify we have no delivery of less than 10 days, should return true:

extractDeliveriesMoreThan10Days.filter(col("delivery_delay_in_days") <= 10).show
extractDeliveriesMoreThan10Days.filter(col("delivery_delay_in_days") <= 10).count == 0

Download as CSV: 

extractDeliveriesMoreThan10Days.coalesce(1).write.option("header", "true").csv(""delivery_delay_in_days.csv")

Testing

Author

Ahmed Oladapo, mail: [email protected]

dsti_spark_project2021's People

Contributors

profbiyi avatar

Watchers

James Cloos avatar  avatar

Forkers

victoriaao

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.