GithubHelp home page GithubHelp logo

tdd-dbt's Introduction

Overview

This demo aims to provide a set of examples and frameworks for doing test-driven data engineering (TDDE).

Prerequisites

To run this demo, the following applications must be installed.

Tool Mac Demo Version Windows Version
Python Python 3.8.2
Docker docker desktop (really only need docker)
  • docker desktop 3.3.3
  • docker 20.10.6
dbt dbt 0.20.1_1
dbt-sqlserver plugin 0.19.0
NOTE: 0.20.0 did not install properly, so downgraded to 0.19.0
dbt-snowflake plugin 0.20.1

Helpful References

Below are some references that have been used to learn more about the activities in this repo. It is recommended to read them for yourself as you start/continue the dbt journey.

Mac Setup

Docker Setup

Pull the Docker Image Pull the docker image you'd like to use. The MS SQL Server dockerhub page is a good resource.

COMMAND:

docker pull mcr.microsoft.com/mssql/server:2019-latest

Run the Docker Image Run the docker image to start a container that has MS SQL Server running within it.

COMMAND:

docker run -d --name sql_server -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=ITsC0mpl1cat3d' -p 1433:1433 mcr.microsoft.com/mssql/server:2019-latest

Docker Verification

There are multiple ways to verify the image, two are:

  1. Connect a SQL Prompt within the container (via docker tools)
  2. Connect to the SQL instance using a connection (similar to normal connection)

Verification Using Docker Tools

Using docker tools, you can open a command prompt inside the SQL Server docker container. The container has the mssql-tools pre-installed.

The command below will open a sqlcmd prompt inside the docker container, using the username and password provided.

docker exec -it sql_server /opt/mssql-tools/bin/sqlcmd -S localhost -U sa -P ITsC0mpl1cat3d

Verify you have a running instance with a simple select query (don't forget the go):

1> select @@version
2> go

The response should look similar to the response below:

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Microsoft SQL Server 2019 (RTM-CU12) (KB5004524) - 15.0.4153.1 (X64)
	Jul 19 2021 15:37:34
	Copyright (C) 2019 Microsoft Corporation
	Developer Edition (64-bit) on Linux (Ubuntu 20.04.2 LTS) <X64>

(1 rows affected)

Verification with sqlcmd on Your Machine

Setting up your Mac to use sqlcmd requires a couple of additional steps:

  1. Install the SQL Server drivers and sql tools
  2. Connect to SQL Server using the ODBC connection

First, install the MSSQL ODBC Drivers AND MSSQL tools using the code in this reference (code below).

brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release
brew update
HOMEBREW_NO_ENV_FILTERING=1 ACCEPT_EULA=Y brew install msodbcsql17 mssql-tools

Next, verify you can connect to the SQL Server Docker instance and run the select @@version, and your response should be the same as with the docker command verification above:

% sqlcmd -S localhost -U sa -P ITsC0mpl1cat3d
1> select @@version
2> go

NOTE: There are alternate ways to connect using the ODBC driver. This may help if you need other means.

  • Connect using localhost + port sqlcmd -S localhost,1433 -U sa -P ITsC0mpl1cat3d
  • Connect using localhost IP + port sqlcmd -S 127.0.0.1,1433 -U sa -P ITsC0mpl1cat3d
  • Connect using configured IP + port sqlcmd -S 0.0.0.0,1433 -U sa -P ITsC0mpl1cat3d

Stopping/Starting SQL Server

If you want to take the SQL Server container down, just run the stop command:

docker stop sql_server

Next time you need the container:

docker start sql_server

tdd-dbt's People

Contributors

donaldsawyer avatar mwallacemn avatar

Watchers

James Cloos avatar  avatar

tdd-dbt's Issues

TDDE: Transformations in DBT

Develop the transformation to select data from ontime_data, join it to carrier_code, and perform projection. New table is ontime_carrier.

Transformations:.

  1. Join on carrier column
  2. description renamed to carrier_desc
  3. Calculate arrived_flag
    1. Y when not cancelled or diverted
    2. N when cancelled or diverted
    3. NULL when not cancelled or diverted, but arr_delay is NULL
  4. Columns: year, month, day_of_month, carrier_code, carrier_desc, flight_number,origin, destination,arrived_flag

Acceptance Criteria:

  1. Tests written in DBT via tests and/or data constraints in model for ontime_carrier
    1. Cancelled
    2. Diverted
    3. Not cancelled/diverted, but arr_delay is NULL. Feature reference
    4. JOIN on missing carrier should result in UNKNOWN for carrier_desc
    5. There should only be one row per unique flight (think: what happens when there are multiple rows?)
    6. only listed columns exist, and in order
  2. Model for ontime_carrier written
  3. Run on full dataset

Docker Setup

The demo needs the ability to launch and setup the SQL Server docker container.

Prerequisites:

  • Docker is installed on the local machine
  • Python 3.8+ should be installed
  • MS SQL Tools are installed (sqlcmd)

Acceptance Criteria:

  1. Script created to launch docker container
  2. Script can take in database name (e.g.: -v db_name=dev, -v db_name=prod)
  3. Databases and tables are created (ontime_data, carrier_code)
  4. Can connect to docker with the following command: sqlcmd -S 0.0.0.0,1433 -U sa -P ITsC0mpl1cat3d

Nice to Have:

  • Docker is launched upon pytest being started

TDDE: Transformations in SQL Stored Proc

Develop the transformation using a stored procedure to select data from ontime_data, join it to carrier_code, and perform projection. New table is ontime_carrier.

Transformations:.

  1. Join on carrier column
  2. description renamed to carrier_desc
  3. Calculate arrived_flag
    1. Y when not cancelled or diverted
    2. N when cancelled or diverted
    3. NULL when not cancelled or diverted, but arr_delay is NULL
  4. Columns: year, month, day_of_month, carrier_code, carrier_desc, flight_number,origin, destination,arrived_flag

Acceptance Criteria:

  1. Tests written in python as part of TDDE framework
    1. Cancelled
    2. Diverted
    3. Not cancelled/diverted, but arr_delay is NULL. Feature reference
    4. JOIN on missing carrier should result in UNKNOWN for carrier_desc
    5. There should only be one row per unique flight (think: what happens when there are multiple rows?)
    6. only listed columns exist, and in order
  2. Stored Procedure Created
  3. Smoke test run on full dataset

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.