GithubHelp home page GithubHelp logo

luatnc87 / open-source-modern-data-stack Goto Github PK

View Code? Open in Web Editor NEW
19.0 1.0 4.0 1.98 MB

This repo demonstrate a comprehensive modern data stack using popular open-source tools.

Shell 73.15% Python 26.85%
airbyte dbt doris superset data-platform docker docker-compose mds

open-source-modern-data-stack's Introduction

Open Source Modern Data Stack

This repo demonstrate a comprehensive modern data stack using popular open-source tools.

Welcome to Open Source Modern Data Stack (OSMDS). This repo showcases a complete modern data stack using popular open-source tools.

osmds_architeture.png

Components of the OSMDS

Before we set up project, let's briefly look at each tools used in this example of the open source modern data stack to make sure you understand their components.

Airbyte

Airbyte is an open-source data integration tool. With Airbyte, you can set up a data pipeline in minutes thanks to its extensive collection of pre-built connectors. Airbyte can replicate data from applications, APIs, and databases into data warehouses and data lakes. Airbyte offers a self-hosted option with Docker Compose that you can run locally. In this example, Airbyte is used to replicate data from a CSV file to Apache Doris.

Apache Doris

Apache Doris is a high-performance, real-time analytic database base on the MPP (Massive Parralell Processing) architecture and is known for extreme speed and ease of use. It takes only sub-second response time to return query results under massive amounts of data, can support not only highly concurrent point query scenarios, but also high throughput complex analytic scenarios.

DBT

dbt is an open-source data transformation tool that relies on SQL to build production-grade data pipelines. dbt replaces the usual boilerplate DDL/DML required to transform data with simple modular SQL SELECT statements and handles dependency management. dbt provides a cloud-hosted option and a CLI, a Python API and integration with orchestration tools like Apache Airflow, Dagter, etc. In this OSMDS example, dbt a simple transformation on the ingested data using a SQL query. Airbyte's native integration with dbt is used to run the transformations.

Apache Superset

Apache Superset is a modern business intelligence, data exploration and visualization platform. Superset connects with a variety of databases and provides an intuitive interface for visualizing datasets. It offers a wide choice of visualizations as well as a no-code visualization builder. You can run Superset locally with Docker Compose or in the cloud using Preset. Superset sits at the end of this real time data analytics stack example and is used to visualize the data stored in Apache Doris.

Pre-requisites

To follow along, you need to:

Install Docker and Docker Compose in your machine. You can follow this guide to install Docker and this one to install Docker Compose.

Using Docker Compose to bootstrap OSMDS

This tutorial uses Docker Compose and a shell script to set up the required resources. Docker saves you from installing additional dependencies locall. You can quickly start and stop the instances.

The shell script setup.sh provides two commands, up and down, to start and stop the instances. The compose files are stored in airbyte/docker-compose-airbyte.yaml, doris/docker-compose-doris.yaml, and superset/docker-compose-superset.yaml. You can go through these files and make any necessary customization, for example, changing the ports where the instances start or installing additional dependencies.

Setting up Airbyte, Doris, Superset with Docker Compose

Setting up Airbyte

The script launches the Airbyte instance at http://localhost:8000. In this tutorial, you will use the File source to read a CSV file. Enter Covid 19 data as the source name, and select File as the source type. Make sure csv is chosen as the format and paste the following URL in the url field: https://storage.googleapis.com/covid19-open-data/v2/latest/epidemiology.csv. Finally, enter the name of the dataset you created in Doris.

NOTE: Login to the Airbyte web UI by entering the default credentials found in .env file.

Configuring data source

Enter Covid 19 data as source name and select File as the source type. Enter the file format and choose HTTPs (Public Web) as the Storage Provider. Finnaly, paste the url of the covid-19 data source.

Configuring data destination

Choose the Apache Doris as data destination. Enter Apache Doris as the name of data destination. Then enter value for the Host, HttpPort, QueryPort, DatabaseName, UserName and Password as below:

Configuring a connection

Finally, you establish a connection between the data source and data destination created above.

Setting up Doris

The script launches the Doris FE (front end) instance at http://localhost:8030. You can see the following screen, which indicates that the FE has start successfully. doris_fe_login.png

NOTE: Here we use the Doris built-in default user (root) to log in with an empty password.

Setting up Superset

Once the setup.sh command has completed, visit http://localhost:8088 to access the Superset UI. Enter admin as username and password. Choose MySQL (use to connect to Doris DWH) from the supported databases drop-down. Then upload the service account keyfile.

Using the OSDMS

One the stack is ready and running. You can start using it to ingest and process your data.

Sync data from CSV into Doris using Airbyte

On the connection screen, you click on the Sync now button to start syncing data from online file into the Doris data warehouse. Depending on the size of the source, it might take a few minutes.

airbyte_sync_data.png

Once finished, your Doris UI should show the following structure:

The covid19_data table contains the raw JSON emitted by Airbyte, which is then normalized to form the final_covid19_data table. We'll use dbt to transform data. You can take look at the Data preview tab to get an idea of the final_covid19_data table structure. preview_covid19_data.png

Do a simple transformation with the dbt

You can use dbt framework to develop transformations.

{{config(
    materialized='table',
    distributed_by = [],
    properties= {"replication_allocation" : "tag.location.default: 1"}
)}}

with tmp_tbl as (
    SELECT replace(_airbyte_data, '\\', '') _airbyte_data
	FROM {{ source('doris_covid19_data', 'covid19_data') }}
)

SELECT
    get_json_string(_airbyte_data, '$.key') `key`
    , get_json_double(_airbyte_data, '$.total_tested') total_tested
    , get_json_double(_airbyte_data, '$.new_recovered') new_recovered
    , get_json_double(_airbyte_data, '$.new_deceased') new_deceased
    , get_json_string(_airbyte_data, '$.date') `date`
    , get_json_double(_airbyte_data, '$.total_confirmed') total_confirmed
    , get_json_double(_airbyte_data, '$.new_tested') new_tested
    , get_json_double(_airbyte_data, '$.total_recovered') total_recovered
    , get_json_double(_airbyte_data, '$.new_confirmed') new_confirmed
    , get_json_double(_airbyte_data, '$.total_deceased') total_deceased
FROM tmp_tbl

Create a aggregated table to get latest covid report by days.

{{config(
    materialized='table',
    properties= {"replication_allocation" : "tag.location.default: 1"}
)}}

SELECT
    `date`,
    sum(total_tested) total_tested,
    sum(new_recovered) new_recovered,
    sum(new_deceased) new_deceased,
    sum(total_confirmed) total_confirmed,
    sum(new_tested) new_tested,
    sum(total_recovered) total_recovered,
    sum(new_confirmed) new_confirmed,
    sum(total_deceased) total_deceased
FROM {{ ref('stg_covid19_data') }}
GROUP BY 1

Visualize data on dashboard with Superset

Once Airbyte has loaded data into Doris, you can add it as a dataset in Superset and start visualizing data. First, navigate to Dataset page. Select the Doris connection that you created earlier as database, demo as the schema and agg_covid19_data as the table schema.

Now you can create a new chart using the data stored in the Doris. Navigate to Charts tab, choose agg_covid19_data as dataset and choose a chart type, example bar chart. On the next page, you can add metrics you want to visualize. Example, I choose simply the sum of new_recovered, organized by the key.

superset_chart.png

Cleaning up

You can clean up the resources by running the following command.

./setup.sh down

Conclusion

By adopting an open-source modern data stack, you can achieve remarkable improvements in the efficiency of your data pipeline, resulting in significant time savings by reducing manual labor. Furthermore, this approach can lead to substantial cost reductions. When you choose to employ self-hosted open-source solutions such as Airbyte, Doris, dbt, and Superset, you ensure that these vital components operate within your own infrastructure, offering you greater control and security.

The advantage of these projects lies in their provision of Docker Compose files, which simplifies the local setup process considerably. This means that you can easily establish and manage these components on your local environment with minimal effort.

However, if you're aiming to deploy a scalable modern data stack to a cloud environment, you have the option to leverage Terraform and Kubernetes, two powerful tools that automate the deployment process. This automation not only streamlines the setup but also enables you to efficiently manage and scale your data stack in the cloud environment, providing you with flexibility and ease of use.

Supporting Links

About the author

open-source-modern-data-stack's People

Contributors

luatnc87 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

open-source-modern-data-stack's Issues

setup.sh missing from repo

Hello and thank you for your work.

From the documentation :

The shell script setup.sh provides two commands, up and down, to start and stop the instances.

Unless I am mistaken, I do not see the setup.sh file on the repo.
Could you provide it ?
Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.