GithubHelp home page GithubHelp logo

epb-etl's Introduction

ETL Pipeline

This repository is responsible for the Extract, Transform, Load (ETL) pipeline for the Energy Performance of Buildings Register (EPBR) data migration. Terraform is used to provision and manage infrastructure in Amazon Web Services (AWS) cloud. Some of the AWS services used include, but not limited to: Lambda, SNS and SQS. The pipeline is designed to accommodate any database (DB) system - Oracle, PostgreSQL, etc.

Table of Contents

Overview

Below is a diagram that illustrates the flow and process of the ETL pipeline.

ETL Architecture

Setup

The ETL pipeline is currently set up to work with Oracle DB, so for the purpose of this guide, we will be using the Oracle DB system. Follow the OS specific guide from the links below.

Operating System

Usage

Configuration

Each stage of the ETL pipeline will require certain configuration(s) within the SQS message body of the event, where configuration is a property defined in body. The entire process is sequential.

Assume we have the following data in an Oracle DB with a table employees:

FIRST_NAME SURNAME DATE_OF_BIRTH EMPLOYEE_ID COMPANY_ID
John Smith 1980-11-01 00:00:00.000000 EMPL123456 10

We have another table emails with the following data:

EMPLOYEE_ID EMAIL
EMPL123456 [email protected]
EMPL123456 [email protected]

Trigger

{
  "trigger": {
    "scan": "SELECT EMPLOYEE_ID FROM employees",
    "extract": {
      "EMPLOYEE": {
        "query": "SELECT * FROM employees WHERE EMPLOYEE_ID = '<%= primary_key %>'",
        "multiple": false
      },
      "EMAIL": {
        "query": "SELECT * FROM emails WHERE EMPLOYEE_ID = '<%= primary_key %>'",
        "multiple": true
      }
    }
  }
}

The configuration for the trigger stage contains information required to scan a database table and produce jobs for the ETL pipeline.

The scan property contains an SQL query that should select only the primary key from a given table. For each row selected from the table, a new ETL job will be created in the pipeline. In each new job, the contents of the extract property are placed in the extract extract configuration, replacing <%= primary_key %> with the primary key selected from the table.

Extract

{
  "extract": {
    "queries": {
      "EMPLOYEE": {
        "query": "SELECT * FROM employees WHERE EMPLOYEE_ID = 'EMPL123456'",
        "multiple": false
      },
      "EMAIL": {
        "query": "SELECT * FROM emails WHERE EMPLOYEE_ID = 'EMPL123456'",
        "multiple": true
      }
    }
  }
}

The EMPLOYEE property contains the query property and multiple property The query property contains contains the SQL query to be made. There is also a multiple property which determines whether there are multiple values in a column, typically a list. This part of the ETL pipeline will extract the data from the Oracle DB into a data property which will be at the same level as the configuration property. The EMAIL property will be a list of objects as shown below.

{
  "configuration": {},
  "data": {
    "EMPLOYEE": {
      "FIRST_NAME": "John",
      "SURNAME": "Smith",
      "DATE_OF_BIRTH": "1980-11-01 00:00:00.000000",
      "EMPLOYEE_ID": "EMPL123456",
      "COMPANY_ID": 10
    },
    "EMAIL": [
      {
        "EMAIL": "[email protected]"
      },
      {
        "EMAIL": "[email protected]"
      }
    ]
  }
}

Transform

The transform property contains the rules property which contains the transformations defined as a list; from and to.

{
  "transform": {
    "rules": [
      {
        "from": ["data", "EMPLOYEE", "FIRST_NAME"],
        "to": ["data","firstName"]
      },
      {
        "from": ["data", "EMPLOYEE", "SURNAME"],
        "to": ["data", "lastName"]
      },
      {
        "from": ["data", "EMPLOYEE", "DATE_OF_BIRTH"],
        "to": ["data", "dateOfBirth"],
        "convert": {
          "type": "date_format",
          "args": ["%Y-%m-%d"]
        }
      },
      {
        "from": ["data", "EMPLOYEE", "COMPANY_ID"],
        "to": ["configuration", "load", "endpoint", "params", "company_id"],
        "convert": {
          "type": "map",
          "args": [{
            "10": 1,
            "20": 2
          }]
        }
      },
      {
        "from": ["data", "EMPLOYEE", "EMPLOYEE_ID"],
        "to": ["configuration", "load", "endpoint", "params", "employee_id"]
      },
      {
        "from": ["data", "EMAIL", "*", "EMAIL"],
        "to": ["data", "emails"]
      }
    ]
  }
}

Some of the rules have a convert property. This makes use of Helper methods which are defined in the helpers file. The type property is the name of the Helper method - date_format and map - and the args property contains the method argument. These methods are designed to transform the data accordingly; e.g. 1980-11-01 00:00:00.000000 to 1980-11-01.

The last rule has a wildcard (*) in the from rule which will transform values for any key, in this case EMAIL key with the email values, into an array/list.

An example of the output following the above transformation rules:

{
  "data": {
    "firstName": "John",
    "lastName": "Smith",
    "dateOfBirth": "1980-11-01",
    "emails": ["[email protected]", "[email protected]"]
  },
  "configuration": {
    "load": {
      "endpoint": {
        "params": {
          "company_id": 1,
          "employee_id": "EMPL123456"
        }
      }
    }
  }
}

Load

The above output from the transform stage becomes the input for the load stage. The load configuration is now populated with additional information.

{
  "load": {
    "endpoint": {
      "method": "put",
      "uri": "http://endpoint/api/employees/<%= company_id %>/employees/<%= employee_id %>",
      "params": {
        "company_id": 1,
        "employee_id": "EMPL123456"
      }
    }
  }
}

Using ERB (Embedded RuBy) variables enables the uri value to be dynamically generated. In the above JSON, <%= company_id %> and <%= employee_id %> are ERB variables which are directly read from the params property as a result of the transformation output in the previous ETL stage.

The API endpoint becomes http://endpoint/api/employees/1/employees/EMPL123456 where the HTTP message body is defined in JSON as:

{
  "firstName": "John",
  "lastName": "Smith",
  "dateOfBirth": "1980-11-01",
  "emails": ["[email protected]", "[email protected]"]
}

Test

Requirements

  • Docker
  • docker image store/oracle/database-enterprise:12.2.0.1

Use the following commands to run all or a particular type of test:

$ make test # run all acceptance and unit tests

$ make test_integration # run all integration tests (requires docker)

$ make test_e2e # run all end-to-end tests (requires docker)

$ make test_all # run all tests (requires docker)

Documentation

Technical Architecture

The AWS Lambda function located in the lib folder follows the principles of Clean Architecture (CA) and is written in the Ruby programming language.

The outermost layer, consisting of DBs and external interfaces, interacts with the gateway layer.

Adapter Gateway
Oracle Database
AWS SQS Message
Logit Log

The idea behind this technical decision is to adhere to the open/closed principle where the gateway(s) can accommodate any relevant adapters; for example, a PostgreSQL adapter for the database gateway.

Use cases:

These use cases depend on the base use case which checks the incoming request of the SQS message.

epb-etl's People

Contributors

acodeninja avatar yusufsheiqh avatar thedoublek avatar marten-wetterberg avatar jpmahub avatar christine-horrocks avatar dependabot[bot] avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.