GithubHelp home page GithubHelp logo

kr900910 / mortgage_data_analysis Goto Github PK

View Code? Open in Web Editor NEW
17.0 2.0 9.0 7.63 MB

ETL process which downloads, transforms, and loads Freddie Mac/Fannie Mae mortgage data

Shell 47.32% Python 52.68%
etl-pipeline mortgage-data-analysis hive tableau python shell-script fannie-mae fraddie-mac

mortgage_data_analysis's Introduction

Mortgage Data Analysis

Inital Setup

  1. Register for Fannie Mae: https://loanperformancedata.fanniemae.com/lppub/index.html#.
  2. Register for Freddie Mac: https://freddiemac.embs.com/FLoan/Bin/loginrequest.php.
  3. Pull mortgage-data-analysis repository in EC2 instance (git clone https://github.com/kr900910/mortgage-data-analysis.git).
  4. Create temp_download directory inside mortgage-data-analysis (mkdir temp_download).

Download the data

  1. Go to mortgage-data-analysis/loading_and_modeling, and pip install requests==2.5.3.
  2. Type python download_freddie_mac.py. Enter credentials and quarters to download when prompted. This downloads zip files into the current folder for each quarter.
  3. Type python download_fannie_mae.py. Enter credentials and quarters to download when prompted. This downloads zip files into the current folder for each quarter.

Move the data into HDFS directory

  1. Start Hadoop, postgres, and Hive in EC2 instance.
  2. If this is your first time, type . create_hdfs_dir.sh. This creates necessary HDFS folders.
  3. Type . unzip_to_HDFS.sh. This unzips the zipped files into mortgage-data-analysis/temp_download, removes the zipped files, loads unzipped files to HDFS, and removes the unzipped files. Note that this step can take 15-30 minutes depending on number of quarters being loaded.

Create Hive tables

  1. Go to mortgage-data-analysis/transforming and type . create_hive_tables.sh. This creates Hive metadata for base Fannie and Freddie data in hdfs and for the combined data sets. Note that this script can take several hours to run, depending on how many quarters of data are there (for 15 quarters, acquisition data took 10 min, performance data took ~ 2 hours).

Use Tableau to visualize data

  1. Once Hive tables are created, start HiveServer2 by typing hive --service hiveserver2 &.
  2. Set up an ODBC connection with the server in Tableau and visualize data as necessary. A sample Tableau workbook along with the CSV file extracted from one of Hive tables are available in mortgage-data-analysis/serving folder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.