GithubHelp home page GithubHelp logo

vikin91 / data-engineering-capstone-2020 Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 1.0 32.33 MB

Project for submission in the Data Engineering Nano-degree (please excuse the abundance of typos and low-quality python code)

SAS 16.87% Jupyter Notebook 72.89% Python 9.87% Shell 0.36%

data-engineering-capstone-2020's Introduction

Capstone Project - Udacity Data Engineering

Evaluation criteria

  1. Project code is clean and modular:
    1. All coding scripts have an intuitive, easy-to-follow structure with code separated into logical functions.
    2. Naming for variables and functions follows the PEP8 style guidelines.
    3. The code should run without errors.
  2. Quality Checks:
    1. The project includes at least two data quality checks.
  3. Data Model:
    1. The ETL processes result in the data model outlined in the write-up.
    2. A data dictionary for the final data model is included.
    3. The data model is appropriate for the identified purpose.
  4. Datasets - project includes:
    1. At least 2 data sources
    2. More than 1 million lines of data.
    3. At least two data sources/formats (csv, api, json)

Preparation for running

Some datasets are not included in this project due to their size. Before running this project, you need to download the datasets!

Ensure the following files exist at right locations:

  • ./data/GlobalLandTemperaturesByCity.csv from the World Temperature dataset link
  • ./data/i94_apr16_sub.sas7bdat I94 Immigration Data: from the US National Tourism and Trade Office link

Running

To start a Docker container with Jupyter Notebook and Spark, run:

./run_docker.sh

Next, open the Jupyter Notebook in the browser (providing the correct access token):

http://127.0.0.1:8888/notebooks/work/Immigration.ipynb?token=TOKEN

Resources

It is recommended to assign at least 8GB or memory and >=4 CPU cores for Docker!

data-engineering-capstone-2020's People

Contributors

vikin91 avatar

Watchers

 avatar

Forkers

olenapolyanska

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.