cimt-ag / data_vault_pipelinedescription Goto Github PK

A concept and syntax to provide a universal data format, for storing all essential informations, that are needed to implement or generate a data loading process for a data vault model.

Home Page: https://www.cimt-ag.de/leistungen/data-vault-pipeline-description/

License: Apache License 2.0

PLpgSQL 2.16% Python 95.41% Java 1.86% Batchfile 0.58%

data-warehousing datavault20 data-vault

data_vault_pipelinedescription's Issues

Do we need field group in hierarchical parent declaration ?

Check, diff hash given for not historized reference table

a not historized raference table must not declare a diff hash columns

Sat insert ELT Generator (plpgsql)

Ref insert / update Generator (plpgsql)

Check if field group in sat on link is declared in at least one related hub and not excluded by all other related hubs

BUG: Check if parent of link is defined, does not work

Extend process plan, to provide "hub and link" only plans

Check an adapt to provide Plans for tables, without connected satellites.

add testcase "Hub only"

"This should only load a hub.
Main concern, if a plan will be generated whithout any ""leaf table"""

Verify check of missing link table for esat

Add English documentation of concept

Sat driving key enddate generator (plpgsql)

Add deletion detection properties

"A source may have a complete or partitioned set of a business objects. On this knowledge an internal deletion detection can be implemented. "

define syntax
load and check syntax with compiler
provide derivations by compiler
provide test cases for different scenarios

Test Case: 2 Links to different hubs with different relations

Case aligned with current variation catalog
test case(s) created (3110-3130,3170-3210)

Check for conflict of mapping when using multiple relations

When a table is targeted by multiple relations, the resulting column structure must be the same for all relations
same number of columns
same column names and types

if this is violated, the compiler must log a proper error message and stop

There is already a stub function in the code:
check_multifield_mapping_consistency_of_column

test: modify order in key hash

This test should determine, if the declaration of the field order for a key hash is transported correctly to the DVPI and
has an impact on the DVPI summary

Test created

Add exculsion of ghost records, document HUB Elt generator

add link_key_assemble_rule in link attributes

specification (incl default of data vault profile)
parsing
presentation in table properties

Test Case: 2 Sattellites on same Link

test created

Check: field group for normal sat must be valid on parent

"Explicit for bk or dck
Implicit for links"

Describe development and deployment using DVPD in more detail to make ecosystem more clear

First approach of document created. Further writing according to the list below, will be added later

use case analysis

Information representation needed, annotatef with sourceendpoints and fields

Source specification

Breakdown of Endpoint data structure to single table representation

Fields, types, business key parts, parsing rules, increment pattern, tracking/deletion detection method

Tools: metadata discovery, content analysis

Result: Pipelines with field list

data vault modelling

Table structure and mapping of fields

Tool/Resource: Modelling tool, already established model

Vault model completion and verification

Full/Conform naming of tables and columns
Essential Naming of key and diff hash columns
Integration into established model (no conflicts)

Tools: Generators, check routines

implementation

Deployable and ""executable"" Artifact for

Deployment of DB tables
Processing and loading incoming data
Bandwidth of methods
Can be just the dvpd = full generic engine.
Dvpd + copy of current engine
Generated process (dvpd only provided as documentation)
Generated template, with final manual work

(Discussion about pro/cons of full coded artifacts against generic solutions)

Generation of Fetch

Test of pipeline

All increment scenarios
All historization scenarios

Tools: generated vault to source views, generated testdata (variety, change over time)

Deployment

Schedule

operations

usage of data

Tools: Vault model, columns, types, comments, linage

Allow more then one self join from a link to a hub

Currently there can be only one additional HK of a Hub in a link. (Hard coded as H1 and H0) For the very rare case of a multi back reference in the same relation, this would be not sufficient.

Add comments to all dictionary views

Standardize declaration and processing of data types

"Depends on target database and source system or fetch processing
Translation and normalisation is respobility of processing engine.
Recommendenstion: upper case, remove spaces, check syntax
Possible support: separate configuration JSON, mapping of source to target types for every product (needs a optional product specification in the DVPD for source and targets)"

Link ELT Generator (Python)

Add check for double definition of database tables

The final physical table name must be unique to all others

check implemented
test created

Add optional explicit Businesskey declaration to HUB Table definition

Declaration will be used to check and force the field mappings to meet these names and types and provide consistent declarations about hashing concatenation

Support "tracking satellite"

add option for explicit naming of stage table

Test: Recursives to 2 different hubs (4 Relations)

Case checked and aligned with new catalog (see 3490)

Test: recursive link having also relation to other hub

Check consistency of field position

Stettings of property must be unique and numeric

Test cases implemented
DVPDC Check implemented
Test cases integrated in automatic tests

add ink_key_explicit_content_order[]

Challenge: How to declare explicitly the recursive content? Allow usage of field names, indicated by prefixing with "!" (maybe force the use of field names in case of recursive parents)

Check for unknown parameters declared for stereotype

Check if relation_name for field mapping is declared in link

For recursion allow explicit declaraion of parent key column names in the link table

add declaration of key name to recursion declaration of link

Test; modify priority in diff hash

This test should determine, if the declaration of the field order for a diff hash is transported correctly to the DVPI and
has an impact on the DVPI summary

Test created

suppress redundant row hash stage columns

"When the same satellite content is loaded for all field groups, currently the process mapping and staging provide a row hash for every field group. Maybe it is possible zu reduce this to one row hash.
The probability of occurence for this constellation is very low, so dont hurry."

Add English documentation of reference implementation

Written
QA Matthias
QA ####

Sat deletion Generator (plpgsql)

Test Case: 2 Links to different hubs, one with two field groups

Sat enddate Generator (plpgsql)

Add explicit naming of parent hub keys in links

change syntax reference
add propertiy to loader
use propterties in column derivation logic
change / review syntax examples

Test Case: Link+Esat connecting 7 Hubs

Check: All declared Tables are targeted

Check, if all hubs on a link have at least 1 field group in common

Test: 2 recursives on same hub with satellite content

Case aligned with current variation catalog
test case(s) created (6xx)

add link_key_assemble_rule in data vault profile

Use position in fields array as defintion of column position

Currently the position of a filed in the incoming resultset must be declare explicitly. This is due to the fact, postgresSQL array expandation hat no way to provide the index of the array. If somehow this can be fixed, we could remove the explicit index.