getdozer / dozer Goto Github PK

View Code? Open in Web Editor NEW

1.5K 14.0 118.0 60.72 MB

Dozer is a real-time data movement tool that leverages CDC from various sources and moves data into various sinks.

Home Page: https://getdozer.io

License: GNU Affero General Public License v3.0

Shell 0.16% Rust 99.08% Dockerfile 0.09% Solidity 0.01% Python 0.02% JavaScript 0.65%

apis data rust sql realtime streaming etl low-code postgres snowflake api clickhouse datawarehouse debe

dozer's Introduction

Overview

Dozer is a real time data movement tool leveraging CDC from various sources to multiple sinks.

Dozer is magnitudes of times faster than Debezium+Kafka and natively supports stateless transformations. Primarily used for moving data into warehouses. In our own application, we move data to Clickhouse and build data APIs and integration with LLMs.

How to use it

Dozer runs with a single configuration file like the following:

app_name: dozer-bench
version: 1
connections:
  - name: pg_1
    config: !Postgres
      user: user
      password: postgres
      host: localhost
      port: 5432
      database: customers
sinks:
  - name: customers
    config: !Dummy
      table_name: customers

Full documentation can be found here

Supported Sources

Connector	Extraction	Resuming	Enterprise
Postgres	✅	✅	✅
MySQL	✅	✅	✅
Snowflake	✅	✅	✅
Kafka	✅	🚧	✅
MongoDB	✅	🎯	✅
Amazon S3	✅	🎯	✅
Google Cloud Storage	✅	🎯	✅
**Oracle	✅	✅	Enterprise Only
**Aerospike	✅	✅	Enterprise Only

Supported Sinks

Database	Connectivity	Enterprise
Clickhouse	✅
Postgres	✅
MySQL	✅
Big Query	✅
Oracle	✅	Enterprise Only
Aerospike	✅	Enterprise Only

dozer's People

Contributors

Stargazers

Watchers

Forkers

chloeminkyung duonganhthu43 v3g42 snork-alt sashankh chandrasaripaka vikasdesai mediuminvader benedetto73 xd-deng vbaklikov-skyward kurtsley rohankumardubey huangyingting isgasho thiagoguislotti ansrivas auterium alexeyrumyantsev princeofnubia 1412isabel hoangnh93 jackwener readall invis2912 tardunge secheng722 guucat ricky1122alonefe ma233 chacix ramitamitabh17 daedalus2022 rustworks basemdabbour lguzzon-scratchbook mz0in gaoxiaojun linecode esmevane zzzint universalmind303 therockstardba pawarayush nikhilkoshta yashlondhe90960 tanangular harksin merklefruit nurikk borgrancher takkuumi humeizhong003 gautamprikshit1 iramshiv abhishekmishragithub livstyle dozerpadawan hanmeh mryanhehe ibrahim-akrab tachyonicbytes shubham8287 crajcan feliciien abcpro1 hi-rustin sonhmai manigithub-lab ashwini-padhy bheemaiahnn cahyosubroto luis-sousa-pinto supergi0 suriyanad friederbluemle jamestiotio arvandmoe tungbq aaryaattrey webclinic017 just-nilux iamshabell 0xjaskeerat shylock-hg mohan3d ego sigmablocklabs ozgune atulshridhar tobihans mughalmuneeb786 enescakir aliyar-khan solomatovs onyedikachi-david mohanish2504 tangula rutik7066 rkota

dozer's Issues

Add checks for PG roles/permissions

Implement Ingestion Source

Moved this code out of dozer-orchestrator for now as this doesn't comply with changes.

#[cfg(test)]
mod tests {
    use std::{rc::Rc, sync::Arc};

    use crate::orchestration::models::{
        connection::Authentication::PostgresAuthentication,
        source::{MasterHistoryConfig, RefreshConfig},
    };
    use crate::orchestration::{
        models::{
            connection::{Authentication, Connection, DBType},
            source::{HistoryType, Source},
        },
        orchestrator::PgSource,
        sample::{SampleProcessor, SampleSink},
    };
    use dozer_core::dag::{
        channel::LocalNodeChannel,
        dag::{Dag, Endpoint, NodeType},
        mt_executor::{MemoryExecutionContext, MultiThreadedDagExecutor},
    };
    use dozer_ingestion::connectors::{postgres::connector::PostgresConfig, storage::RocksConfig};
    #[test]
    fn run_workflow() {
        let connection: Connection = Connection {
            db_type: DBType::Postgres,
            authentication: PostgresAuthentication {
                user: "postgres".to_string(),
                password: "postgres".to_string(),
                host: "localhost".to_string(),
                port: 5432,
                database: "pagila".to_string(),
            },
            name: "postgres connection".to_string(),
            id: None,
        };
        let source = Source {
            id: None,
            name: "actor_source".to_string(),
            dest_table_name: "ACTOR_SOURCE".to_string(),
            source_table_name: "actor".to_string(),
            connection,
            history_type: HistoryType::Master(MasterHistoryConfig::AppendOnly {
                unique_key_field: "actor_id".to_string(),
                open_date_field: "last_updated".to_string(),
                closed_date_field: "last_updated".to_string(),
            }),
            refresh_config: RefreshConfig::RealTime,
        };
        let storage_config = RocksConfig {
            path: "target/orchestrator-test".to_string(),
        };
        let mut sources = Vec::new();
        sources.push(source);
        let mut pg_sources = Vec::new();
        sources
            .clone()
            .iter()
            .for_each(|source| match source.connection.authentication.clone() {
                Authentication::PostgresAuthentication {
                    user,
                    password,
                    host,
                    port,
                    database,
                } => {
                    let conn_str = format!(
                        "host={} port={} user={} dbname={} password={}",
                        host, port, user, database, password,
                    );
                    let postgres_config = PostgresConfig {
                        name: source.connection.name.clone(),
                        tables: None,
                        conn_str: conn_str,
                    };
                    pg_sources.push(PgSource::new(storage_config.clone(), postgres_config))
                }
            });
        let proc = SampleProcessor::new(2, None, None);
        let sink = SampleSink::new(2, None);
        let mut dag = Dag::new();
        let proc_handle = dag.add_node(NodeType::Processor(Arc::new(proc)));
        let sink_handle = dag.add_node(NodeType::Sink(Arc::new(sink)));

        pg_sources.clone().iter().for_each(|pg_source| {
            let src_handle = dag.add_node(NodeType::Source(Arc::new(pg_source.clone())));
            dag.connect(
                Endpoint::new(src_handle, None),
                Endpoint::new(proc_handle, None),
                Box::new(LocalNodeChannel::new(10000)),
            )
            .unwrap();
        });

        dag.connect(
            Endpoint::new(proc_handle, None),
            Endpoint::new(sink_handle, None),
            Box::new(LocalNodeChannel::new(10000)),
        )
        .unwrap();

        let exec = MultiThreadedDagExecutor::new(Rc::new(dag));
        let ctx = Arc::new(MemoryExecutionContext::new());
        let _res = exec.start(ctx);
        // assert!(_res.is_ok());
    }
}

`CacheIterator` won't work if descending from a key past the db end

Implement pipeline + dag for processing data from ingestion to cache.

fix: proto generator support timestamp

We have an issue when generating Proto including the TimeStamp field in the schema, by default, proto should import google lib support for Timestamp

import "google/protobuf/timestamp.proto";

Implement debezium connector

Table add to postgres connector publication

Example flow when altering publication and select is in the same transaction

Replication conn	Conn	XLog
`BEGIN READ WRITE ISOLATION LEVEL REPEATABLE READ;`
`ALTER PUBLICATION dozer_publication_timesx ADD TABLE users;`
`SELECT * FROM users ORDER BY id DESC;` Assuming that it takes some time to process
	`BEGIN`
	`INSERT INTO users (phone, email) VALUES ('1234567', '[email protected]')`
	`COMMIT`	`Begin(....)` `Commit(...)`
`COMMIT`		`Begin(....)` `Commit(...)`
	`BEGIN`
	`INSERT INTO users (phone, email) VALUES ('1234568', '[email protected]')`
	`COMMIT`	`Begin(....)` `Relation(....)` `Insert(...)` `Commit(...)`

Possible solutions:
https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b
https://debezium.io/blog/2021/10/07/incremental-snapshots/

Implement auth based on document keys

Implement various test cases to test cache.query feature

#133
Will be reopening a different PR to track the above.

Cache not invalidating old value if PK changes

Describe the bug
In the pipeline cache sink we are processing operations. During update operation processing, a record is updated based on old primary key. Although primary key can be part of changes, the new primary key is ignored.

Expected behavior

if old_pk == new_pk {
  cache.update(old_pk, ...);
} else {
  cache.delete(old_pk, ...);
  cache.insert(old_pk, ...);
}

dozer/dozer-orchestrator/src/pipeline/sinks.rs

Line 112 in e605061

Operation::Update { old, new } => {

Implement plan generation

#149

Implemented strategy:

Make a sorted inverted index containing all Eq filters and one range filter / sort option.
Make a full text index for each text filters.

All permutations of the Eq filter fields are generated.

Let handler handle multiple IndexScans in one plan.

Implement event based source

Batch Insert of records

Description

Inserting records one by one can be slow for ingestion. To avoid this we can do batched inserts. Batches can be created based on several criteria:

time (create batch every x seconds or milliseconds)
number of records

Implementation of this feature should consider:

Batch cannot be shared across threads
Records order should be maintained

Current implementation

At the moment all writes is handled by storage_client. storage_clientis injected to ingestor, which is listening for all messages from connector iterator. On every retrieved message storage_client insert methods are called.

Suggested implementation

Replace storage_client with record_writer (trait). record_writer would have two injected dependencies - storage_client.
record_writter would have 3 methods: begin, insert, commit . begin and commit can be ignored in case when developer needs to implement a single row writer.
xlog_mapper would not call ingestor.handle_message. Instead of that it will return mapped message to replicator and replicator can call ingestor.handle_message.

GRPC Streaming on change

A feature on GRPC allows end-user observer changes. We provide 4 streaming route

on_insert
on_update
on_delete
on_schema_change

Example

service FilmService {
rpc film(GetFilmRequest) returns (GetFilmResponse);
rpc by_id(GetFilmByIdRequest) returns (GetFilmByIdResponse);
rpc query(QueryFilmRequest) returns (QueryFilmResponse);
rpc on_insert(OnInsertRequest) returns (stream OnInsertResponse);
rpc on_update(OnUpdateRequest) returns (stream OnUpdateResponse);
rpc on_delete(OnDeleteRequest) returns (stream OnDeleteResponse);
rpc on_schema_change(OnSchemaChangeRequest) returns (stream OnSchemaChangeResponse);
}

Summary Implementation

GRPC Server Init with a notifier - A broadcast channel is created based on this notifier
Every time event happened - (Insert, Update, Delete, Schema update) - Cache sink will trigger this notifier => broadcast channel on GRPC will have this event
Client invoke gprpc stream route => aka a subscribe to broadcast channel + filter event based on route => Client will receive event accordingly

Implement generate-token for cli

implement func generate_token for simple CLI
Ability to config Security for REST and GRPC
Attach GRPC authentication

Refactor modules for the first design checkpoint.

https://app.excalidraw.com/s/9s77uqbZtdz/2mOznX5GCjM

Init Data Apis
Init Auth
Init Cache
Refactor code into modules

Validate Postgres Connection

For replication to work, we will need to validate a few things

Non exhaustive list

Postgres Version
wal_level
Appropriate permissions to access replication

https://github.com/debezium/debezium/blob/main/debezium-connector-postgres/src/main/java/io/debezium/connector/postgresql/PostgresConnector.java#L75

Issue benches report missing the comparison with the old ones

Describe the bug
First day benches report still show the comparison with the old ones, from the next day , no more

Root cause

Because I don't cache the benches report and put it in the same run next time

E2E tests

Target

Create infrastructure to run tests with different scenarios and databases

Things to be implemented

scenario script execution
scenario result assertion
integration to CI process

Schema validation is validating not used tables

Describe the bug
Not used tables schema are validated

To Reproduce
Create PostgreSQL table with not supported column types and try to run orchestration

Expected behaviour
Only tables used as source should be validated

SEQ NO generation for sources

Draw
https://app.excalidraw.com/l/9s77uqbZtdz/v6LQ6wRDC2

Operations are stored with prefix 1. Key format - format!("{}{:0>19}", 1, seq_no). Value - OperationEvent.
Commits are stored with prefix 2. Key format - format!("{}{:0>19}{:0>19}", 2, connection_id). Value tuple - (seq_no, lsn).

Storage logic (draft)

PG Source 1	PG Source 2	LSN Storage
BEGIN (LSN 1)
CREATE (LSN 2)
COMMIT (LSN 3)		(100000000001, OperationEvent {seq_no: 1, lsn: 2, connection_id: 1})
		(200000000001, (3, 2))
	BEGIN (LSN 1)
	CREATE (LSN 2)
	CREATE (LSN 3)
	COMMIT (LSN 4)	(100000000003, OperationEvent {seq_no: 3, lsn: 2, connection_id: 1})
		(100000000004, OperationEvent {seq_no: 4, lsn: 3, connection_id: 1})
		(200000000002, (4, 5))

Getting last LSN for source

let {lsn} = db.get(format!("{}{:0>19}", 2, id));

Getting last SEQ NO

// Result of storage_client.get_operations_table_read_options()
let op_table_prefix = ($lower_bound.as_bytes().to_vec())..($upper_bound.as_bytes().to_vec());

let mut ro = ReadOptions::default();
ro.set_iterate_range(op_table_prefix);

// Function
let mut seq_iterator = db.raw_iterator_opt(storage_client.get_operations_table_read_options());
seq_iterator.seek_to_last();
let mut initial_value = 0;

if seq_iterator.valid() {
    initial_value = bincode::deserialize(seq_iterator.value().unwrap().as_ref()).unwrap();
}

return initial_value;

Implement streaming SQL for transformations.

Implement tracing for dozer

Discussed in #152

^{Originally posted by v3g42 November 2, 2022}
One of the ways we could collect metrics and usage information from our users is the following

Implement tracing in each of the modules
Use opentelemetry format
Optionally collect traces from all users running dozer on their system and collect them on our AWS Xray system)
Implement a flag to dozer cli to enable/disable collection of traces.

Implement ethereum and event connectors

Implement connectors to stream ETH information as well as raw events into the pipeline.

This can aid us in generating decent samples.

gRPC objects for Query and Filter expressions

Ability to input Query Expression on GRPC route: query

Example input on Postman

Query input will have a form like the input in REST(MongoDB-query) - except without $ sign
Simple one

{
    "filter": {
        "film_id": {
            "eq": 89
        }
    },
    "limit": 90,
    "order_by": [
       
    ],
    "skip": 0
}

** And Expression **

{
    "filter": {
        "and": [
            {
                "film_id": {"gt": 12}
            },
            {
                "film_id": {"lt": 2222}
            }
        ]
    },
    "limit": 90,
    "order_by": [
    ],
    "skip": 0
}

Auth Implementation

Implementation Details

TODO

Api Auth Middleware for REST (#109)
Auth Module for Generation and Validation of tokens
CacheReader Implementation
gRPC support @duonganhthu43
Test cases #133

Notes

Dozer API

Dozer APIs are read only
Claims are to be verified in query and get interfaces
Claims will be dynamically passed along with cache query and get functions
Claims can be generated by calling the management API POST /auth/tokens

Dozer Admin

Dozer Admin optionally has access to shared JWT_SECRET
Ability to generate claims
Authentication and User/Group management
is to be implemented on top of Dozer Admin

Claim Structure

  { 
    // Name of the index
    "index_name": "films",

    // FilterExpression to evaluate access
    "access_filter": { userId : {"$eq",  1}}

    // fields
    "fields": ["id", "name"]
 }

Implement from component

Shareable txns are built #186
Join on Dozer SQL

Support all messages formats in ingestion forwarder

Is your feature request related to a problem? Please describe.
At the moment only OperationEvent ingestion message type is supported.

Describe the solution you'd like
Implement Scehma, Commit, Begin types support.

Describe alternatives you've considered

Additional context

Databricks integration

Is your feature request related to a problem? Please describe.
Allow users to use data from databricks in ingestion.

Describe the solution you'd like
Implement new connector based on the current connector interface.

Create initial sync of postgres database

As per the documentation, initial sync of postgres database is needed when replication slot is being created.

Best way to implement this is to potentially initialize a lazy query and stream the results into downstream usage.

sfackler/rust-postgres#298

`usize` values are being serialized in binary format

dozer-cache/src/cache/index/mod.rs:get_secondary_index

This will lead to different program behaviour on 32bit and 64bit systems.

Implement Orchestration APIs

Implemention ingestion proto
Extend orchestration methods
Allow provision to select schema, tables and fields in connector initialization.
Implement preview, get_schema methods (https://github.com/getdozer/dozer/blob/main/dozer-ingestion/src/connectors/postgres/snapshotter.rs#L58)
Instead using RestAPI, GRPC is applied for dozer-api

Include validations on names, schemas etc based on user input.

Include validations on names, schemas etc based on user input.
https://github.com/debezium/debezium/blob/main/debezium-connector-postgres/src/main/java/io/debezium/connector/postgresql/PostgresConnector.java

[ ] validate slot name with "[a-z0-9_]{1,63}")

Debezium is a good source for reference.
https://github.com/debezium/debezium

Refactor connectors to support new record store

Parse XLogData messages into Dozer messages

Postgres Documentation

https://www.postgresql.org/docs/current/protocol-replication.html

Postgres library doesn't support streaming replication. Here is a branch implementing copy both, waiting to be merged to upstream. We will have to maintain a fork till this is merged.
https://github.com/petrosagg/rust-postgres/tree/copy-both

Refactor of the Expression builders

The creation of different processors of the pipeline requires to parse the expression in different ways.
For e.g. if there is an aggregation function in a select item, something like ROUND(AVG(Salary/5), is necessary to build 3 different expression executors parsing the same AST as input.
In the example we need a pre-aggregator expression Salary/5, an aggregator expression AVG(PreAggrExpr) and a post-aggregator expression ROUND(PostAggrExpr).

Dozer APIs & generation

QueryExpression needs json parse support
List & Query API should take in QueryExpression as input and not FilterExpression https://github.com/getdozer/dozer/blob/main/dozer-api/src/api_helper.rs#L84
Generated APIs don't set the right primary key type based on schema definition (Primary key could be string or other fields as well https://github.com/getdozer/dozer/blob/main/dozer-api/src/generator/oapi/generator.rs#L75)
Query Example needs adjusting https://github.com/getdozer/dozer/blob/main/dozer-api/src/generator/oapi/generator.rs#L25
Query and GET API validate body and path params (Currently GET returns error if a wrong field is sent)

REST & gRPC: Streaming filter using query expression

Streaming only exists in gRPC. For each endpoint input from the orchestrator. We have 4 streaming events to subscribe

on_insert
on_update
on_delete
on_schema_change

Each event is going to take a filter expression as input to narrow down the messages the client received
Filter input should have the same structure as Filter Expression inside Query Expression in cache
When the event is triggered, apply filter to decide whether or not the server should notify the client

Fork rust-postgres to openfabric/rust-postgres and apply fixes

`serde_json::Value` is serialized and immediately deserialized in `json_value_to_field`

We should be able to build Field directly from serde_json::Value

Implement data apis for consuming data from cache.

Following Components are needed to expose Dozer Data APIs.

Cache Module #31
Auto generated gRPC endpoints
- Dynamic parameters for gRPC to support all cache features
Auto generated REST + JSON APIs for endpoints. (Future Release)
- Dynamic parameters for REST to support all cache features (Future Release)
Frontend SDKs
- JS SDK

Implement validations for postgres ingestion connector

In order to make Postgres connector work fine we need to check several details before running it.

When user is not providing replication details:

Limit of replication slots
Check if it is possible to create new slot and publication

When user is providing details:

Slot available
Last LSN comparison with slot flush LSN

Implement cache module with lookup and search capabilities

Features to be supported. This is to be built on top of LMDB.

Documents & Search

Indexing strategies to support all of these operations.

Writes

Inserts / Updates (Replaces) / Deletes in Batch in CDC format (OperationEvent)

Security

Document level and property level security

Test cases

Various data sets to capture multiple types of use cases.

Benchmarking

Performance of all APis documented

Implement resume in iterator

Current implement of sync with snapshot and replication messages does not allow user to resume sync.

Describe the solution you'd like
Iterator should have method resume with info about last successful insert. Information can be different based on database type and process (snapshot or replication).

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Improvements to Postgres replication

Modify Record to Operation while we send messages
#27
Extend dozer type mapping to cater for more standard types (handle timestamp etc)
Handle schema inserts and evolution
Parse Xlog Messages

Implement auth for apis and storage

Implement Snowflake connector

REST & gRPC: Auth Support for gRPC

REST is already implemented authentication using the JWT token
The same mechanism should be applied to the same gRPC server

checklist has to be done for this

Implement generate_token cli + input api auth config in dozer config yaml - dozer-orchestrator => output gonna be jwt from API config
Apply the JWT auth layer to gRPC server

Remaining concern:
Along with API authentication, should we include the dozer-admin authentication in the next milestone ?

User login in Dozer-admin using aws Cognito
A page to setup API authentication - same as auth config input in dozer-config yaml

Improve postgres types handling

Is your feature request related to a problem? Please describe.
The current conversion of replication messages Bytes to particular types is not efficient.

Describe the solution you'd like
We need to explore options about implementing custom parser of raw xlog messages or find more efficient way to convert Bytes.