GithubHelp home page GithubHelp logo

datawarehouse-hk232's Introduction

Assignment and Exercise for Data Warehouse Course at HCMUT

Semester 2023 - 2

Student: Trung Nguyen Viet - Instructor: M. Eng Duc Tien Bui

Developed with Python 3.11.x, PostgresSQL 12.x - PG Admin 7.x(if needed), Apache Nifi 1.27.x, following pip dependencies packages as below.

Setup Note - Decision Suppport Module

  1. Download pip package in terminal/command line: pip install faker pandas configparser coloredlogs psycopg2 jupyter squarify seaborn scikit-learn

  2. Type in command line/terminal jupyter notebook

  3. Run a Jupyter notebook file as usual

Setup Note - Data Warehouse Module

  1. Update Database config and connection info in dwh_pipelines/config.ini

  2. Manually create databases with the same name as config.ini in your database instance

  3. Install dependencies: pip install faker pandas configparser coloredlogs psycopg2 jupyter squarify seaborn scikit-learn

  4. Run python script: python gen_staging.py create dimension table python gen_fact.py generate fact table for data mart

Theory: Basic components of an Data Warehouse

This Example use the Inmon Approach in designing data warehouse, some details layer would not completely follow the method.

Architecture

Exercises on Data pipelines

Data Pipeline Components

Example of Data Pipeline using Data Tools - Apache Nifi

A simple data pipeline of Generating a file and Sending to local filesystem

  1. Create and Setting up first Processor of type GenerateFlowFile

Screenshot:

  1. Create and Setting up f2nd Processor of type PutFile:

Screenshot Screenshot

  1. Start 2 processor to run the pipeline

Screenshot

  1. Check Data Provenance

Screenshot

Example of Data Pipeline using Cloud Service - Azure Data Factory

A simple data pipeline of type "Copy Activity" to transform from CSV to Azure SQL table

Một số bước đáng chú ý:
  1. Tạo tài nguyên Blob Storage trong 1 module quản lý Account Storage

Screenshot

  1. Tạo các Linked Service để kết nối đến các Dataset

Screenshot Screenshot

  1. Tạo các Dataset Source và Sink(nguồn và đích) cho mỗi Activity

Screenshot

  1. Thêm Activity(Copy) và Import Mapping

Screenshot

Tham khảo:

  1. Azure 4 Everyone
  2. Data piplines with Spotify's Luigi
  3. Stephen David William Blog
  4. Arvutiteaduse instituudi kursused

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.