GithubHelp home page GithubHelp logo

traidat / cloudcon-hive Goto Github PK

View Code? Open in Web Editor NEW

This project forked from markgrover/cloudcon-hive

0.0 0.0 0.0 167.55 MB

Data set and queries that I use in my Hive and Impala presentations. Slides are usually posted at slideshare.net/markgrover

cloudcon-hive's Introduction

cloudcon-hive

This repo contains data set and queries I use in my presentations on SQL-on-Hive (i.e. Impala and hive) at various conferences. This started off as a repo that was use in my presentation at CloudCon in San Francisco, so the name of the repo reflects that but now this repo has morphed into a single repository that contains my dataset for demos and such at various different presentations on Hive and Impala.

Files in the repository

Datasets

There are 2 datasets in the repo.

a) The first dataset contains on-time flight performance data from 2008, originally released by Research and Innovative Technology Administration (RITA). The source of this dataset is http://stat-computing.org/dataexpo/2009/the-data.html. The dataset

b) The second dataset contains listing of various airport codes in continental US, Puerto Rico and US Virgin Islands. The source of this dataset is http://www.world-airport-codes.com/ The data was scraped from this website and then cleansed to be in its present CSV form.

Setup

You will need a box with Hadoop and Hive installed. Easiest way to get it to install one of the Demo VMs or use packages available from Apache Bigtop. Cloudera Demo VMs are available from Cloudera's website. You can learn more about Apache Bigtop and install integration test Apache Hadoop and Hive by going to the project's main page and the project's wiki.

  • Git clone this repo, untar dataset and launch hive:

git clone git://github.com/markgrover/cloudcon-hive.git
tar -xzvf cloudcon-hive/2008.tar.gz
hive

Hive commands

You can create tables for the datasets in Hive. These tables can directly be used in Impala, since Hive and Impala share metadata. Of course, you can create these tables directly in Impala itself too, in case you don't want to use Hive. We will only show you commands for creating tables in Hive only though.

  • On hive shell: Create hive table, flight_data:

CREATE TABLE flight_data(
   year INT,
   month INT,
   day INT,
   day_of_week INT,
   dep_time INT,
   crs_dep_time INT,
   arr_time INT,
   crs_arr_time INT,
   unique_carrier STRING,
   flight_num INT,
   tail_num STRING,
   actual_elapsed_time INT,
   crs_elapsed_time INT,
   air_time INT,
   arr_delay INT,
   dep_delay INT,
   origin STRING,
   dest STRING,
   distance INT,
   taxi_in INT,
   taxi_out INT,
   cancelled INT,
   cancellation_code STRING,
   diverted INT,
   carrier_delay STRING,
   weather_delay STRING,
   nas_delay STRING,
   security_delay STRING,
   late_aircraft_delay STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

  • Load the data into the table:

LOAD DATA LOCAL INPATH '2008.csv' OVERWRITE INTO TABLE flight_data;

  • Ensure the table got created and loaded fine:

SHOW TABLES;
SELECT
   *
FROM
   flight_data
LIMIT 10; 

  • Query the table. Find average arrival delay for all flights departing SFO in January:

SELECT
   avg(arr_delay)
FROM
   flight_data
WHERE
   month=1
   AND origin='SFO';

  • On hive shell: create the airports table

CREATE TABLE airports(
   name STRING,
   country STRING,
   area_code INT,
   code STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

  • Load data into airports table:

LOAD DATA LOCAL INPATH 'cloudcon-hive/airports.csv' OVERWRITE INTO TABLE airports;

  • On hive shell, list some rows from the airports table:

SELECT
   *
FROM
   airports
LIMIT 10

  • On hive shell: run a join query to find the average delay in January 2008 for each airport and to print out the airport's name:

SELECT
   name,
   AVG(arr_delay)
FROM
   flight_data f
   INNER JOIN airports a
   ON (f.origin=a.code)
WHERE
   month=1
GROUP BY
   name;

Impala commands

  • Now, we will run the same queries on Impala to compare its performance with Hive. No need to re-create these tables in Impala since Hive and Impala share the same metastore. You can directly access them in Impala.

bash> impala-shell
impala> connect localhost;
impala> show tables;

  • On Impala shell, query the table. Find average arrival delay for all flights departing SFO in January:

SELECT
   avg(arr_delay)
FROM
   flight_data
WHERE
   month=1
   AND origin='SFO';

  • On Impala shell: run a join query to find the average delay in January 2008 for each airport and to print out the airport's name:

SELECT
   name,
   AVG(arr_delay)
FROM
   flight_data f
   INNER JOIN airports a
   ON (f.origin=a.code)
WHERE
   month=1
GROUP BY
   name;

cloudcon-hive's People

Contributors

markgrover avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.