GithubHelp home page GithubHelp logo

cimt-ag / data_vault_pipelinedescription Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 42.49 MB

A concept and syntax to provide a universal data format, for storing all essential informations, that are needed to implement or generate a data loading process for a data vault model.

Home Page: https://www.cimt-ag.de/leistungen/data-vault-pipeline-description/

License: Apache License 2.0

PLpgSQL 2.46% Python 95.38% Java 2.12% Batchfile 0.04%
data-warehousing datavault20 data-vault

data_vault_pipelinedescription's Introduction

Data Vault Pipeline Description (DVPD)

Concept and reference implementation

(C) Matthias Wegner, cimt ag

Creative Commons License CC BY-ND 4.0

This repository contains the documentation of the "Data Vault Description Pipeline" concept and a reference implementation with multiple test cases and examples.

The concept in "3 words"

The Data Vault Pipeline Description(DPVD) defines a document syntax to describle all metadata, that is needed to implement a process wich loads one source object into a data vault model.

This provides a standardized interface between all steps of the implementation workflow and allows a decoupling between the tools, that are used during design and implementation. As a document, the DVPD also represents a encapsulated deployable artifact and therefore supports the implementation of automated CI/CD workflows.

Full Documentation is in this repository. Best start is DVPD_Introduction_and_orientation.md

Motivation

Loading data into a data warehouse is a complex task even when using the Data Vault methods, wich provide a lot of standardization and generalization. Many tools and frameworks try to support the modelling and implementation process.

Functions needed are: Specification of the usecase, Specification and Analysis of source data structure, Modelling of the Data Vault and mapping of the data, implementing the load process (fetch data from source, transform and load to data vault model), deployment of the processes, schedule und execute processes, monitor progress. All these steps contain a deep complexity by themself. A product, that supports all of these phases in an equal appropriate excellency and functional flexibility, is nearly impossible to implement.*

So data warehouse platforms often contain a bundle of tools with a mix of commercial products and self written code. One major function needed in theses workflows is the communication of the metadata, that is forged during the analysis and modelling steps. This metadata is needed for the implementation, and in best case can be used to generate the processing automatically.

DVPD provides a format, to solve this problem.

*This product needs to solve a high varyity of scenarios, but from the perspecive of a single project, only a small amount is needed. You dont pay the price for 300 functions, when you only need 10 of them

What you find in this repository

Concept Documentation

  1. Description of the concept
  2. Reference of the core syntax of DVPD
  3. Analysis about the use case variations to cover by the syntax a. Data Mapping variation taxonomy a. Data Mapping dependend process generation a. Partitioned deletion scenarios

Reference implementation

  1. PostgreSQL tables and views to implement a DVPD compiler
  2. Documentaion about the structure and usage of the DVPD views
  3. PostgreSQL tables and view to implement automated testing of the compiler
  4. Testsets
  5. Python scripts to deploy the tables and view automatically

data_vault_pipelinedescription's People

Contributors

albincekaj avatar jvonhein avatar mattywausb avatar velimird avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

data_vault_pipelinedescription's Issues

add testcase "Hub only"

"This should only load a hub.
Main concern, if a plan will be generated whithout any ""leaf table"""

Add deletion detection properties

"A source may have a complete or partitioned set of a business objects. On this knowledge an internal deletion detection can be implemented. "

  • define syntax
  • load and check syntax with compiler
  • provide derivations by compiler
  • provide test cases for different scenarios

Check for conflict of mapping when using multiple relations

When a table is targeted by multiple relations, the resulting column structure must be the same for all relations
same number of columns
same column names and types

if this is violated, the compiler must log a proper error message and stop

There is already a stub function in the code:
check_multifield_mapping_consistency_of_column

suppress redundant row hash stage columns

"When the same satellite content is loaded for all field groups, currently the process mapping and staging provide a row hash for every field group. Maybe it is possible zu reduce this to one row hash.
The probability of occurence for this constellation is very low, so dont hurry."

test: modify order in key hash

This test should determine, if the declaration of the field order for a key hash is transported correctly to the DVPI and
has an impact on the DVPI summary

  • Test created

Test; modify priority in diff hash

This test should determine, if the declaration of the field order for a diff hash is transported correctly to the DVPI and
has an impact on the DVPI summary

  • Test created

add ink_key_explicit_content_order[]

  • Challenge: How to declare explicitly the recursive content? Allow usage of field names, indicated by prefixing with "!" (maybe force the use of field names in case of recursive parents)

Use position in fields array as defintion of column position

Currently the position of a filed in the incoming resultset must be declare explicitly. This is due to the fact, postgresSQL array expandation hat no way to provide the index of the array. If somehow this can be fixed, we could remove the explicit index.

Standardize declaration and processing of data types

"Depends on target database and source system or fetch processing
Translation and normalisation is respobility of processing engine.
Recommendenstion: upper case, remove spaces, check syntax
Possible support: separate configuration JSON, mapping of source to target types for every product (needs a optional product specification in the DVPD for source and targets)"

Describe development and deployment using DVPD in more detail to make ecosystem more clear

First approach of document created. Further writing according to the list below, will be added later

use case analysis

Information representation needed, annotatef with sourceendpoints and fields

Source specification

Breakdown of Endpoint data structure to single table representation

Fields, types, business key parts, parsing rules, increment pattern, tracking/deletion detection method

Tools: metadata discovery, content analysis

Result: Pipelines with field list

data vault modelling

Table structure and mapping of fields

Tool/Resource: Modelling tool, already established model

Vault model completion and verification

Full/Conform naming of tables and columns
Essential Naming of key and diff hash columns
Integration into established model (no conflicts)

Tools: Generators, check routines

implementation

Deployable and ""executable"" Artifact for

  • Deployment of DB tables
  • Processing and loading incoming data
    Bandwidth of methods
  • Can be just the dvpd = full generic engine.
  • Dvpd + copy of current engine
  • Generated process (dvpd only provided as documentation)
  • Generated template, with final manual work

(Discussion about pro/cons of full coded artifacts against generic solutions)

Generation of Fetch

Test of pipeline

  • All increment scenarios
  • All historization scenarios

Tools: generated vault to source views, generated testdata (variety, change over time)

Deployment

  • Schedule

operations

usage of data

Tools: Vault model, columns, types, comments, linage

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.