GithubHelp home page GithubHelp logo

confluentinc / demo-database-modernization Goto Github PK

View Code? Open in Web Editor NEW
34.0 127.0 20.0 10.54 MB

This demo shows how to stream data to cloud databases with Confluent. It includes fully-managed connectors (Oracle CDC, RabbitMQ, MongoDB Atlas), ksqlDB/Flink SQL as stream processing engine.

Python 26.72% Shell 23.32% HCL 49.95%
aws confluent-cloud connect database ksqldb terraform mongodb-atlas oracle rabbitmq streaming-data-pipelines

demo-database-modernization's Introduction

Stream Data to Cloud Databases with Confluent

Amid unprecedented volumes of data being generated, organizations need to harness the value of their data from heterogeneous systems in real time. However, on-prem databases are slow, rigid, and expensive to maintain, limiting the speed at which businesses can scale and drive innovation. Today’s organizations need scalable, cloud-native databases with real-time data. This demo walks you through building streaming data pipelines with Confluent Cloud. You’ll learn about:

  • Confluent’s fully managed source connectors to stream customer data and credit card transactions in real time into Confluent Cloud
  • Process and enrich data streams in real time. You'll use aggregates and windowing to create a customer list of potentially stolen credit cards
  • A fully managed sink connector to load enriched data into MongoDB Atlas for real-time fraud analysis

Break down data silos and stream on-premises, hybrid, and multicloud data to cloud databases such as MongoDB Atlas, Azure Cosmos DB and more, so that every system and application has a consistent, up-to-date, and enhanced view of the data at all times. With Confluent streaming data pipelines, you can connect, process, and govern real-time data for all of your databases. Unlock real-time insights, focus on building innovative apps instead of managing databases, and confidently pave a path to cloud migration and transformation.

To learn more about Confluent’s solution, visit the Database streaming pipelines page

There are two versions of this demo

  1. Using ksqlDB as the stream processing engine.
    • In this version there are two source connectors (Oracle CDC and RabbitMQ)
    • Oracle database contains customers information
    • RabbitMQ contains each customer's credit card transactions
  2. Using Flink SQL as the stream processing engine
    • In this version there is one source connector (Oracle CDC) and one Python producer
    • Oracle database contains customers information
    • Python producer generates sample credit card transactions

demo-database-modernization's People

Contributors

chuck-confluent avatar jwfbean avatar mkananizadeh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

demo-database-modernization's Issues

RabbitMQ version wired into the terraform config is no longer available on CloudAMQP

The configuration in main.tf needs to be updated - the version wired into the config is 3.12.1, which is no longer available on CloudAMQP:

│ Error: CreateInstance failed, status: 400, message: map[errors:[map[rmq_version:Invalid RabbitMQ version, available versions are ["3.9.8", "3.9.9", "3.9.11", "3.9.13", "3.9.15", "3.9.16", "3.9.18", "3.9.19", "3.9.20", "3.9.21", "3.9.22", "3.9.23", "3.9.27", "3.10.1", "3.10.2", "3.10.4", "3.10.5", "3.10.6", "3.10.7", "3.10.8", "3.10.10", "3.10.19", "3.10.24", "3.11.5", "3.11.10", "3.11.18", "3.12.2", "3.12.4"]]]] │ │ with cloudamqp_instance.instance, │ on main.tf line 231, in resource "cloudamqp_instance" "instance": │ 231: resource "cloudamqp_instance" "instance" { │ ╵
After changing the value to 3.12.2, it was able to created the RabbitMQ instance

Suggested improvement to fraud logic in ksqldb

Consider this query, which captures the fraud logic of the FD_POSSIBLE_STOLEN_CARD:

SELECT
        TIMESTAMPTOSTRING(WINDOWSTART, 'yyyy-MM-dd HH:mm:ss') AS WINDOW_START,
        T.USERID,
        T.CREDIT_CARD_NUMBER,
        T.FULL_NAME,
        T.EMAIL,
        T.TRANSACTION_TIMESTAMP,
        SUM(T.AMOUNT) AS TOTAL_CREDIT_SPEND,
        MAX(T.AVG_CREDIT_SPEND) AS AVG_CREDIT_SPEND,
        COUNT(*) AS NUM_TRANSACTIONS
    FROM fd_transactions_enriched T
    WINDOW TUMBLING (SIZE 2 HOURS)
    GROUP BY T.USERID, T.CREDIT_CARD_NUMBER, T.FULL_NAME, T.EMAIL, T.TRANSACTION_TIMESTAMP
    HAVING SUM(T.AMOUNT) > MAX(T.AVG_CREDIT_SPEND) EMIT CHANGES;

The GROUP BY clause is grouping so many different things that there really is no aggregation going on. You can verify that NUM_TRANSACTIONS is always 1.

I believe this formulation was a workaround for this error when only grouping by id:

Non-aggregate SELECT expression(s) not part of GROUP BY:
CREDIT_CARD_NUMBER, FULL_NAME, EMAIL, TRANSACTION_TIMESTAMP Either add the column(s) to the GROUP BY or remove them from the SELECT.

This comes from creating the customers table with LATEST_BY_OFFSET (see issue)

Proposal

Instead of creating a raw customer stream and using LATEST_BY_OFFSET to derive a table, create a table directly from the underlying topic:

CREATE TABLE FD_CUSTOMERS (
  customer_id DOUBLE PRIMARY KEY
  ) WITH (KAFKA_TOPIC = 'ORCL.ADMIN.CUSTOMERS',KEY_FORMAT='JSON',VALUE_FORMAT = 'JSON_SR');

NOTE: We avoid the error "column name id already exists" by giving the key a different name, customer_id.

Then create the enriched transaction stream.

CREATE STREAM fd_transactions_enriched WITH (KAFKA_TOPIC = 'transactions_enriched') AS
  SELECT
    C.CUSTOMER_ID,
    T.CREDIT_CARD_NUMBER,
    T.AMOUNT,
    T.TRANSACTION_TIMESTAMP,
    C.FIRST_NAME + ' ' + C.LAST_NAME AS FULL_NAME,
    C.AVG_CREDIT_SPEND,
    C.EMAIL
  FROM fd_transactions T
  INNER JOIN fd_customers C
  ON T.USERID = C.CUSTOMER_ID;

Then create the fraud trigger only grouping by things that we want to count as a group, in this case customer ID, avg spend, full name, email:

CREATE TABLE fd_possible_stolen_card WITH (KAFKA_TOPIC = 'FD_possible_stolen_card', KEY_FORMAT = 'JSON', VALUE_FORMAT='JSON') AS
SELECT
  T.CUSTOMER_ID,
  LATEST_BY_OFFSET(T.TRANSACTION_TIMESTAMP) AS LATEST_TIMESTAMP,
  COUNT(*) AS NUM_TRANSACTIONS,
  T.FULL_NAME,
  LATEST_BY_OFFSET(T.EMAIL) AS EMAIL,
  SUM(T.AMOUNT) AS TOTAL_CREDIT_SPEND,
  T.AVG_CREDIT_SPEND,
  TIMESTAMPTOSTRING(WINDOWSTART, 'yyyy-MM-dd HH:mm:ss') AS WINDOW_START
FROM FD_TRANSACTIONS_ENRICHED T
WINDOW TUMBLING ( SIZE 2 HOURS ) 
GROUP BY T.CUSTOMER_ID, T.AVG_CREDIT_SPEND, T.FULL_NAME, T.EMAIL
HAVING (SUM(T.AMOUNT) > T.AVG_CREDIT_SPEND);

Notice the LATEST_BY_OFFSET(T.EMAIL) -- this means if a user changes their email address, it won't reset the aggregation.

Suggest Change in Narrative: Modernize a transactional app from relational (cloud Oracle) to noSQL (cloud MongoDB Atlas)

Tagging @confluentinc/technical-marketing

Currently this demo is really a streaming ETL job -- extract, transform, and enrich transactional data for the purpose of an analytics service (fraud detection). We have other demos that follow this same general ETL narrative. What we don't have is a good example for making the transition from relational to NoSQL and how Confluent can help.

For background, here is an article about Single Table Design that explains the benefits of moving from a relational database to a NoSQL document database for transactional workloads where the access patterns are well defined ahead of time (ie not ad-hoc analysis). After changing the data model to accommodate the access patterns, the developer can move to a

But using a different database isn't enough. That old, on-prem data is still valuable. And valuable data will continue to come from on-prem instances of the application. That's where Confluent can help. With Confluent, the customer can set up an always-running real-time pipeline from on-prem to cloud and transform it to match the data model expected by Mongo. At the end of the day, the customer now has expanded their application footprint into the cloud so they can expand their business while still retaining the value of their on-prem application.

Let me know what you think!

Move Oracle to Docker to Simulate On-Prem -> Cloud Migration

It would be good to move oracle to a local docker container and then use either

oracle -> CP connect only -> ccloud

or

oracle -> connect -> CP -> cluster link -> (private network?) -> ccloud

This would show a more authentic hybrid picture for database modernization.

Refactor directory structure

Thanks for putting this together Maygol!

One suggestion I have is to refactor the directory hierarchy. Sometimes it’s intimidating to see a bunch of separate files in the root. For example, collect all the various connector json definitions into a “connectors” folder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.