GithubHelp home page GithubHelp logo

kudeh / udacity-dend-capstone Goto Github PK

View Code? Open in Web Editor NEW
8.0 2.0 9.0 137.31 MB

Udacity Data Engineering Nanodegree Capstone Project

Jupyter Notebook 77.07% Python 22.93%
udacity udacity-nanodegree udacity-data-engineer-nanodegree udacity-capstone psycopg2 pandas pyspark airflow jupyter-notebook dags

udacity-dend-capstone's Introduction

Udacity Data Engineering Nanodegree Capstone Project

Table of Contents

Introduction

For my capstone project I developed a data pipeline that creates an analytics database for querying information about immigration into the U.S on a monthly basis. The analytics tables are hosted in a Redshift Database and the pipeline implementation was done using Apache Airflow.

View Notebook for more details and project write up.

Datasets

The following datasets were used to create the analytics database:

  • I94 Immigration Data: This data comes from the US National Tourism and Trade Office found here. Each report contains international visitor arrival statistics by world regions and select countries (including top 20), type of visa, mode of transportation, age groups, states visited (first intended address only), and the top ports of entry (for select countries).
  • World Temperature Data: This dataset came from Kaggle found here.
  • U.S. City Demographic Data: This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. Dataset comes from OpenSoft found here.
  • Airport Code Table: This is a simple table of airport codes and corresponding cities. The airport codes may refer to either IATA airport code, a three-letter code which is used in passenger reservation, ticketing and baggage-handling systems, or the ICAO airport code which is a four letter code used by ATC systems and for airports that do not have an IATA airport code (from wikipedia). It comes from here.

Data Model

The data model consists of tables immigration, us_cities_demographics, airport_codes, world_temperature, i94cit_res, i94port, i94mode, i94addr, i94visa

Data Pipeline

Setup

  1. Python3 & Java 8 Required
  2. Create virtual environment and install dependencies
    $ python3 -m venv venv
    $ source venv/bin/activate
    (venv) $ pip install -r requirements.txt
    (venv) $ ipython kernel install --user --name=projectname
  3. set java version to java8 if not default
    (venv) $ export JAVA_HOME=`/usr/libexec/java_home -v 1.8`
  4. create aws config
    • create file dwh.cfg
    • add the following contents (fill the fields)
    [CLUSTER]
    HOST=
    DB_NAME=
    DB_USER=
    DB_PASSWORD=
    DB_PORT=
    ARN=
    
    [S3]
    BUCKET=
  5. Initialize Airflow & Run Webserver
    (venv) $ export AIRFLOW_HOME=$(pwd)
    (venv) $ airflow initdb
    (venv) $ airflow webserver -p 8080
  6. Run Scheduler (Open New Terminal Tab)
    (venv) $ export AIRFLOW_HOME=$(pwd)
    (venv) $ airflow scheduler

Usage

  1. Create Tables:
    (venv) $ python create_tables.py
  2. Access Airflow UI at localhost:8080
  3. Create Airflow Connections:
    • AWS connection:
    * Redshift connection
  4. Run etl_dag in Airflow UI

udacity-dend-capstone's People

Contributors

kudeh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.