confluentinc / demo-database-modernization Goto Github PK

This demo shows how to stream data to cloud databases with Confluent. It includes fully-managed connectors (Oracle CDC, RabbitMQ, MongoDB Atlas), ksqlDB/Flink SQL as stream processing engine.

Python 26.72% Shell 23.32% HCL 49.95%

aws confluent-cloud connect database ksqldb terraform mongodb-atlas oracle rabbitmq streaming-data-pipelines

demo-database-modernization's Introduction

Stream Data to Cloud Databases with Confluent

Amid unprecedented volumes of data being generated, organizations need to harness the value of their data from heterogeneous systems in real time. However, on-prem databases are slow, rigid, and expensive to maintain, limiting the speed at which businesses can scale and drive innovation. Today’s organizations need scalable, cloud-native databases with real-time data. This demo walks you through building streaming data pipelines with Confluent Cloud. You’ll learn about:

Confluent’s fully managed source connectors to stream customer data and credit card transactions in real time into Confluent Cloud
Process and enrich data streams in real time. You'll use aggregates and windowing to create a customer list of potentially stolen credit cards
A fully managed sink connector to load enriched data into MongoDB Atlas for real-time fraud analysis

Break down data silos and stream on-premises, hybrid, and multicloud data to cloud databases such as MongoDB Atlas, Azure Cosmos DB and more, so that every system and application has a consistent, up-to-date, and enhanced view of the data at all times. With Confluent streaming data pipelines, you can connect, process, and govern real-time data for all of your databases. Unlock real-time insights, focus on building innovative apps instead of managing databases, and confidently pave a path to cloud migration and transformation.

To learn more about Confluent’s solution, visit the Database streaming pipelines page

There are two versions of this demo

Using ksqlDB as the stream processing engine.
- In this version there are two source connectors (Oracle CDC and RabbitMQ)
- Oracle database contains customers information
- RabbitMQ contains each customer's credit card transactions
Using Flink SQL as the stream processing engine
- In this version there is one source connector (Oracle CDC) and one Python producer
- Oracle database contains customers information
- Python producer generates sample credit card transactions

demo-database-modernization's People

Contributors

Stargazers

Watchers

demo-database-modernization's Issues

README instructions neglect to list Terraform installation as a prerequisite

If you follow the instructions in the README.md file slavishly, they seem to assume (without mentioning it in the Requirements section) that Terraform is already installed. The instructions break with the first terraform command

Add simple architecture diagram to readme

RabbitMQ version wired into the terraform config is no longer available on CloudAMQP

The configuration in main.tf needs to be updated - the version wired into the config is 3.12.1, which is no longer available on CloudAMQP:

│ Error: CreateInstance failed, status: 400, message: map[errors:[map[rmq_version:Invalid RabbitMQ version, available versions are ["3.9.8", "3.9.9", "3.9.11", "3.9.13", "3.9.15", "3.9.16", "3.9.18", "3.9.19", "3.9.20", "3.9.21", "3.9.22", "3.9.23", "3.9.27", "3.10.1", "3.10.2", "3.10.4", "3.10.5", "3.10.6", "3.10.7", "3.10.8", "3.10.10", "3.10.19", "3.10.24", "3.11.5", "3.11.10", "3.11.18", "3.12.2", "3.12.4"]]]] │ │ with cloudamqp_instance.instance, │ on main.tf line 231, in resource "cloudamqp_instance" "instance": │ 231: resource "cloudamqp_instance" "instance" { │ ╵
After changing the value to 3.12.2, it was able to created the RabbitMQ instance

Suggested improvement to fraud logic in ksqldb

Consider this query, which captures the fraud logic of the FD_POSSIBLE_STOLEN_CARD:

SELECT
        TIMESTAMPTOSTRING(WINDOWSTART, 'yyyy-MM-dd HH:mm:ss') AS WINDOW_START,
        T.USERID,
        T.CREDIT_CARD_NUMBER,
        T.FULL_NAME,
        T.EMAIL,
        T.TRANSACTION_TIMESTAMP,
        SUM(T.AMOUNT) AS TOTAL_CREDIT_SPEND,
        MAX(T.AVG_CREDIT_SPEND) AS AVG_CREDIT_SPEND,
        COUNT(*) AS NUM_TRANSACTIONS
    FROM fd_transactions_enriched T
    WINDOW TUMBLING (SIZE 2 HOURS)
    GROUP BY T.USERID, T.CREDIT_CARD_NUMBER, T.FULL_NAME, T.EMAIL, T.TRANSACTION_TIMESTAMP
    HAVING SUM(T.AMOUNT) > MAX(T.AVG_CREDIT_SPEND) EMIT CHANGES;

The GROUP BY clause is grouping so many different things that there really is no aggregation going on. You can verify that NUM_TRANSACTIONS is always 1.

I believe this formulation was a workaround for this error when only grouping by id:

Non-aggregate SELECT expression(s) not part of GROUP BY:
CREDIT_CARD_NUMBER, FULL_NAME, EMAIL, TRANSACTION_TIMESTAMP Either add the column(s) to the GROUP BY or remove them from the SELECT.

This comes from creating the customers table with LATEST_BY_OFFSET (see issue)

Proposal

Instead of creating a raw customer stream and using LATEST_BY_OFFSET to derive a table, create a table directly from the underlying topic:

CREATE TABLE FD_CUSTOMERS (
  customer_id DOUBLE PRIMARY KEY
  ) WITH (KAFKA_TOPIC = 'ORCL.ADMIN.CUSTOMERS',KEY_FORMAT='JSON',VALUE_FORMAT = 'JSON_SR');

NOTE: We avoid the error "column name id already exists" by giving the key a different name, customer_id.

Then create the enriched transaction stream.

CREATE STREAM fd_transactions_enriched WITH (KAFKA_TOPIC = 'transactions_enriched') AS
  SELECT
    C.CUSTOMER_ID,
    T.CREDIT_CARD_NUMBER,
    T.AMOUNT,
    T.TRANSACTION_TIMESTAMP,
    C.FIRST_NAME + ' ' + C.LAST_NAME AS FULL_NAME,
    C.AVG_CREDIT_SPEND,
    C.EMAIL
  FROM fd_transactions T
  INNER JOIN fd_customers C
  ON T.USERID = C.CUSTOMER_ID;

Then create the fraud trigger only grouping by things that we want to count as a group, in this case customer ID, avg spend, full name, email:

CREATE TABLE fd_possible_stolen_card WITH (KAFKA_TOPIC = 'FD_possible_stolen_card', KEY_FORMAT = 'JSON', VALUE_FORMAT='JSON') AS
SELECT
  T.CUSTOMER_ID,
  LATEST_BY_OFFSET(T.TRANSACTION_TIMESTAMP) AS LATEST_TIMESTAMP,
  COUNT(*) AS NUM_TRANSACTIONS,
  T.FULL_NAME,
  LATEST_BY_OFFSET(T.EMAIL) AS EMAIL,
  SUM(T.AMOUNT) AS TOTAL_CREDIT_SPEND,
  T.AVG_CREDIT_SPEND,
  TIMESTAMPTOSTRING(WINDOWSTART, 'yyyy-MM-dd HH:mm:ss') AS WINDOW_START
FROM FD_TRANSACTIONS_ENRICHED T
WINDOW TUMBLING ( SIZE 2 HOURS ) 
GROUP BY T.CUSTOMER_ID, T.AVG_CREDIT_SPEND, T.FULL_NAME, T.EMAIL
HAVING (SUM(T.AMOUNT) > T.AVG_CREDIT_SPEND);

Notice the LATEST_BY_OFFSET(T.EMAIL) -- this means if a user changes their email address, it won't reset the aggregation.

Suggest Change in Narrative: Modernize a transactional app from relational (cloud Oracle) to noSQL (cloud MongoDB Atlas)

Tagging @confluentinc/technical-marketing

Currently this demo is really a streaming ETL job -- extract, transform, and enrich transactional data for the purpose of an analytics service (fraud detection). We have other demos that follow this same general ETL narrative. What we don't have is a good example for making the transition from relational to NoSQL and how Confluent can help.

For background, here is an article about Single Table Design that explains the benefits of moving from a relational database to a NoSQL document database for transactional workloads where the access patterns are well defined ahead of time (ie not ad-hoc analysis). After changing the data model to accommodate the access patterns, the developer can move to a

But using a different database isn't enough. That old, on-prem data is still valuable. And valuable data will continue to come from on-prem instances of the application. That's where Confluent can help. With Confluent, the customer can set up an always-running real-time pipeline from on-prem to cloud and transform it to match the data model expected by Mongo. At the end of the day, the customer now has expanded their application footprint into the cloud so they can expand their business while still retaining the value of their on-prem application.

Let me know what you think!

Move Oracle to Docker to Simulate On-Prem -> Cloud Migration

It would be good to move oracle to a local docker container and then use either

oracle -> CP connect only -> ccloud

oracle -> connect -> CP -> cluster link -> (private network?) -> ccloud

This would show a more authentic hybrid picture for database modernization.

Refactor directory structure

Thanks for putting this together Maygol!

One suggestion I have is to refactor the directory hierarchy. Sometimes it’s intimidating to see a bunch of separate files in the root. For example, collect all the various connector json definitions into a “connectors” folder.

confluentinc / demo-database-modernization Goto Github PK

demo-database-modernization's Introduction

Stream Data to Cloud Databases with Confluent

demo-database-modernization's People

Contributors

Stargazers

Watchers

Forkers

demo-database-modernization's Issues

README instructions neglect to list Terraform installation as a prerequisite

Add simple architecture diagram to readme

RabbitMQ version wired into the terraform config is no longer available on CloudAMQP

Suggested improvement to fraud logic in ksqldb

Proposal

Suggest Change in Narrative: Modernize a transactional app from relational (cloud Oracle) to noSQL (cloud MongoDB Atlas)

Move Oracle to Docker to Simulate On-Prem -> Cloud Migration

Refactor directory structure

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs