GithubHelp home page GithubHelp logo

jwszolek / cdc-replication-hadoop Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 4.0 206 KB

Keep in sync RDB table with Hive structured store. Added Kafka as a buffer between those two tables.

HCL 1.33% Python 88.08% Dockerfile 10.59%
kafka cdc hive orcfile sync change-data-capture mysql hql spark spark-streaming hadoop redo-actions mysql-connector oracle

cdc-replication-hadoop's Introduction

cdc-replication-hadoop

Keep in sync RDB schema (Oracle, MySQL, Postgres) with Hive structured store. Added Kafka as a buffer between those tables. CDC (Change Data Capture) is one of the best ways to interconnect an OLTP database system with other systems like Data Warehouse, Hive or Spark.

Purpose

The purpose of this project is to create a solution that gives you the ability to synchronize RDB (MySQL, Oracle, Postgres) tables with Hive equivalents. Synchronization process should be done by using CDC (change data capture) logs. By using this technique we should get almost real-time synchronization between source and destination table. For the CDC logging purposes, Debezium’s MySQL Connector was used. It can monitor and record all of the row-level changes in the databases on a MySQL server.

Note: We assume that we don't want to keep history in hive tables. Hive tables should constantly follow RDB.

Requirements

  • RDB changes are sent over CDC logs
  • Kafka buffer to keep CDC logs in small chunks
  • Hive 2.3.2 with ACID support
  • Spark 2.3
  • OrcFile format
  • Terraform scripts for GCP cluster spinup
  • Metastore table for keeping logs about current processes running. Accessed to it is over AWS API-Gateway. DynamoDB used as a data storage layer.
  • MySQL Debezium connector to Kafka
  • Airflow scheduler

Diagram

The diagram below shows the architecture concept. There are three main parts presented below:

  • Relational database cluster, configured to generate CDC logs.
  • Logs are sent to the Kafka buffer
  • Spark streaming job lets to get raw data and push (in a structured way) to the Hive staging table
  • Merge process is done in Hive. To make it work we need to use transaction support in hive.

cdc logs

App

  • Debezium - MySQL Connector can monitor and record all of the row-level changes in the databases on a MySQL server or HA MySQL cluster. The first time it connects to a MySQL server/cluster, it reads a consistent snapshot of all of the databases. When that snapshot is complete, the connector continuously reads the changes that were committed to MySQL 5.6 or later and generates corresponding insert, update and delete events. All of the events for each table are recorded in a separate Kafka topic, where they can be easily consumed by applications and services.

MySQL connector configuration

Debezium kafka setup

  • Kafka configuration

Howto run locally

  • Run sudo docker-compose up -d for complete instalation and start services
  • Run sudo docker-compose build --no-cache for complete rebuild containers

Tests

Tests have been planned and executed on the Google Cloud. The environment was prepared by using terraform scripts.

Links

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.