GithubHelp home page GithubHelp logo

rifatrkf / data-engineering-practice Goto Github PK

View Code? Open in Web Editor NEW

This project forked from danielbeach/data-engineering-practice

0.0 0.0 0.0 19 KB

Data Engineering Practice Problems

Dockerfile 27.64% Python 72.36%

data-engineering-practice's Introduction

Data Engineering Practice Problems

One of the main obstacles of Data Engineering is the large and varied technical skills that can be required on a day-to-day basis.

This aim of this repository is to help you develop and learn those skills. Generally, here are the high level topics that these practice problems will cover.

  • Python data processing.
  • csv, flat-file, parquet, json, etc.
  • SQL database table design.
  • Python + Postgres, data ingestion and retrieval.
  • PySpark
  • Data cleansing / dirty data.

How to work on the problems.

You will need two things to work effectively on most all of these problems.

  • Docker
  • docker-compose

All the tools and technologies you need will be packaged into the dockerfile for each exercise.

For each exercise you will need to cd into that folder and run the docker build command, that command will be listed in the README for each exercise, follow those instructions.

Beginner Exercises

Exercise 1 - Downloading files.

The first exercise tests your ability to download a number of files from an HTTP source and unzip them, storing them locally with Python. cd Exercises/Exercise-1 and see README in that location for instructions.

Exercise 2 - Web Scraping + Downloading + Pandas

The second exercise tests your ability perform web scraping, build uris, download files, and use Pandas to do some simple cumulative actions. cd Exercises/Exercise-2 and see README in that location for instructions.

Exercise 3 - Boto3 AWS + s3 + Python.

The third exercise tests a few skills. This time we will be using a popular aws package called boto3 to try to perform a multi-step actions to download some open source s3 data files. cd Exercises/Exercise-3 and see README in that location for instructions.

Exercise 4 - Convert JSON to CSV + Ragged Directories.

The fourth exercise focuses more file types json and csv, and working with them in Python. You will have to traverse a ragged directory structure, finding any json files and converting them to csv.

Exercise 5 - Data Modeling for Postgres + Python.

The fifth exercise is going to be a little different than the rest. In this problem you will be given a number of csv files. You must create a data model / schema to hold these data sets, including indexes, then create all the tables inside Postgres by connecting to the database with Python.

Intermediate Exercises

Coming Soon!

data-engineering-practice's People

Contributors

danielbeach avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.