GithubHelp home page GithubHelp logo

getdozer / dozer Goto Github PK

View Code? Open in Web Editor NEW
1.5K 13.0 114.0 60.65 MB

Dozer is a real-time data movement tool that leverages CDC from various sources and moves data into various sinks.

Home Page: https://getdozer.io

License: GNU Affero General Public License v3.0

Shell 0.16% Rust 99.08% Dockerfile 0.09% Solidity 0.01% Python 0.02% JavaScript 0.65%
apis data rust sql realtime streaming etl low-code postgres snowflake

dozer's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dozer's Issues

Dozer APIs & generation

Auth Implementation

Implementation Details

TODO

  • Api Auth Middleware for REST (#109)
  • Auth Module for Generation and Validation of tokens
  • CacheReader Implementation
  • gRPC support @duonganhthu43
  • Test cases #133

Screenshot 2022-10-19 at 12 42 33 PM

Notes

Dozer API

  • Dozer APIs are read only
  • Claims are to be verified in query and get interfaces
  • Claims will be dynamically passed along with cache query and get functions
  • Claims can be generated by calling the management API POST /auth/tokens

Dozer Admin

  • Dozer Admin optionally has access to shared JWT_SECRET
  • Ability to generate claims
  • Authentication and User/Group management
    is to be implemented on top of Dozer Admin

Claim Structure

  { 
    // Name of the index
    "index_name": "films",

    // FilterExpression to evaluate access
    "access_filter": { userId : {"$eq",  1}}

    // fields
    "fields": ["id", "name"]
 }

SEQ NO generation for sources

Draw
https://app.excalidraw.com/l/9s77uqbZtdz/v6LQ6wRDC2

Operations are stored with prefix 1. Key format - format!("{}{:0>19}", 1, seq_no). Value - OperationEvent.
Commits are stored with prefix 2. Key format - format!("{}{:0>19}{:0>19}", 2, connection_id). Value tuple - (seq_no, lsn).

Storage logic (draft)

PG Source 1 PG Source 2 LSN Storage
BEGIN (LSN 1)
CREATE (LSN 2)
COMMIT (LSN 3) (100000000001, OperationEvent {seq_no: 1, lsn: 2, connection_id: 1})
(200000000001, (3, 2))
BEGIN (LSN 1)
CREATE (LSN 2)
CREATE (LSN 3)
COMMIT (LSN 4) (100000000003, OperationEvent {seq_no: 3, lsn: 2, connection_id: 1})
(100000000004, OperationEvent {seq_no: 4, lsn: 3, connection_id: 1})
(200000000002, (4, 5))

Getting last LSN for source

let {lsn} = db.get(format!("{}{:0>19}", 2, id));

Getting last SEQ NO

// Result of storage_client.get_operations_table_read_options()
let op_table_prefix = ($lower_bound.as_bytes().to_vec())..($upper_bound.as_bytes().to_vec());

let mut ro = ReadOptions::default();
ro.set_iterate_range(op_table_prefix);
// Function
let mut seq_iterator = db.raw_iterator_opt(storage_client.get_operations_table_read_options());
seq_iterator.seek_to_last();
let mut initial_value = 0;

if seq_iterator.valid() {
    initial_value = bincode::deserialize(seq_iterator.value().unwrap().as_ref()).unwrap();
}

return initial_value;

Refactor of the Expression builders

The creation of different processors of the pipeline requires to parse the expression in different ways.
For e.g. if there is an aggregation function in a select item, something like ROUND(AVG(Salary/5), is necessary to build 3 different expression executors parsing the same AST as input.
In the example we need a pre-aggregator expression Salary/5, an aggregator expression AVG(PreAggrExpr) and a post-aggregator expression ROUND(PostAggrExpr).

Implement Ingestion Source

Moved this code out of dozer-orchestrator for now as this doesn't comply with changes.

#[cfg(test)]
mod tests {
    use std::{rc::Rc, sync::Arc};

    use crate::orchestration::models::{
        connection::Authentication::PostgresAuthentication,
        source::{MasterHistoryConfig, RefreshConfig},
    };
    use crate::orchestration::{
        models::{
            connection::{Authentication, Connection, DBType},
            source::{HistoryType, Source},
        },
        orchestrator::PgSource,
        sample::{SampleProcessor, SampleSink},
    };
    use dozer_core::dag::{
        channel::LocalNodeChannel,
        dag::{Dag, Endpoint, NodeType},
        mt_executor::{MemoryExecutionContext, MultiThreadedDagExecutor},
    };
    use dozer_ingestion::connectors::{postgres::connector::PostgresConfig, storage::RocksConfig};
    #[test]
    fn run_workflow() {
        let connection: Connection = Connection {
            db_type: DBType::Postgres,
            authentication: PostgresAuthentication {
                user: "postgres".to_string(),
                password: "postgres".to_string(),
                host: "localhost".to_string(),
                port: 5432,
                database: "pagila".to_string(),
            },
            name: "postgres connection".to_string(),
            id: None,
        };
        let source = Source {
            id: None,
            name: "actor_source".to_string(),
            dest_table_name: "ACTOR_SOURCE".to_string(),
            source_table_name: "actor".to_string(),
            connection,
            history_type: HistoryType::Master(MasterHistoryConfig::AppendOnly {
                unique_key_field: "actor_id".to_string(),
                open_date_field: "last_updated".to_string(),
                closed_date_field: "last_updated".to_string(),
            }),
            refresh_config: RefreshConfig::RealTime,
        };
        let storage_config = RocksConfig {
            path: "target/orchestrator-test".to_string(),
        };
        let mut sources = Vec::new();
        sources.push(source);
        let mut pg_sources = Vec::new();
        sources
            .clone()
            .iter()
            .for_each(|source| match source.connection.authentication.clone() {
                Authentication::PostgresAuthentication {
                    user,
                    password,
                    host,
                    port,
                    database,
                } => {
                    let conn_str = format!(
                        "host={} port={} user={} dbname={} password={}",
                        host, port, user, database, password,
                    );
                    let postgres_config = PostgresConfig {
                        name: source.connection.name.clone(),
                        tables: None,
                        conn_str: conn_str,
                    };
                    pg_sources.push(PgSource::new(storage_config.clone(), postgres_config))
                }
            });
        let proc = SampleProcessor::new(2, None, None);
        let sink = SampleSink::new(2, None);
        let mut dag = Dag::new();
        let proc_handle = dag.add_node(NodeType::Processor(Arc::new(proc)));
        let sink_handle = dag.add_node(NodeType::Sink(Arc::new(sink)));

        pg_sources.clone().iter().for_each(|pg_source| {
            let src_handle = dag.add_node(NodeType::Source(Arc::new(pg_source.clone())));
            dag.connect(
                Endpoint::new(src_handle, None),
                Endpoint::new(proc_handle, None),
                Box::new(LocalNodeChannel::new(10000)),
            )
            .unwrap();
        });

        dag.connect(
            Endpoint::new(proc_handle, None),
            Endpoint::new(sink_handle, None),
            Box::new(LocalNodeChannel::new(10000)),
        )
        .unwrap();

        let exec = MultiThreadedDagExecutor::new(Rc::new(dag));
        let ctx = Arc::new(MemoryExecutionContext::new());
        let _res = exec.start(ctx);
        // assert!(_res.is_ok());
    }
}

Batch Insert of records

Description

Inserting records one by one can be slow for ingestion. To avoid this we can do batched inserts. Batches can be created based on several criteria:

  • time (create batch every x seconds or milliseconds)
  • number of records

Implementation of this feature should consider:

  • Batch cannot be shared across threads
  • Records order should be maintained

Current implementation

At the moment all writes is handled by storage_client. storage_clientis injected to ingestor, which is listening for all messages from connector iterator. On every retrieved message storage_client insert methods are called.

Suggested implementation

  • Replace storage_client with record_writer (trait). record_writer would have two injected dependencies - storage_client.
    record_writter would have 3 methods: begin, insert, commit . begin and commit can be ignored in case when developer needs to implement a single row writer.
  • xlog_mapper would not call ingestor.handle_message. Instead of that it will return mapped message to replicator and replicator can call ingestor.handle_message.

Implement cache module with lookup and search capabilities

Features to be supported. This is to be built on top of LMDB.

Documents & Search

Indexing strategies to support all of these operations.

  • Get by ID (Key lookup)
  • Get by multiple IDs
  • Filter API with with conditional operators (AND, OR)
    • Full text search
    • Geo Search
    • Secondary keys
  • Sort by multiple fields

Writes

  • Inserts / Updates (Replaces) / Deletes in Batch in CDC format (OperationEvent)

Security

  • Document level and property level security

Test cases

  • Various data sets to capture multiple types of use cases.

Benchmarking

  • Performance of all APis documented

gRPC objects for Query and Filter expressions

Ability to input Query Expression on GRPC route: query

Example input on Postman

Query input will have a form like the input in REST(MongoDB-query) - except without $ sign
Simple one

{
    "filter": {
        "film_id": {
            "eq": 89
        }
    },
    "limit": 90,
    "order_by": [
       
    ],
    "skip": 0
}

** And Expression **

{
    "filter": {
        "and": [
            {
                "film_id": {"gt": 12}
            },
            {
                "film_id": {"lt": 2222}
            }
        ]
    },
    "limit": 90,
    "order_by": [
    ],
    "skip": 0
}

Table add to postgres connector publication

Example flow when altering publication and select is in the same transaction

Replication conn Conn XLog
BEGIN READ WRITE ISOLATION LEVEL REPEATABLE READ;
ALTER PUBLICATION dozer_publication_timesx ADD TABLE users;
SELECT * FROM users ORDER BY id DESC;
Assuming that it takes some time to process
BEGIN
INSERT INTO users (phone, email) VALUES ('1234567', '[email protected]')
COMMIT Begin(....)
Commit(...)
COMMIT Begin(....)
Commit(...)
BEGIN
INSERT INTO users (phone, email) VALUES ('1234568', '[email protected]')
COMMIT Begin(....)
Relation(....)
Insert(...)
Commit(...)

Possible solutions:
https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b
https://debezium.io/blog/2021/10/07/incremental-snapshots/

E2E tests

Target

Create infrastructure to run tests with different scenarios and databases

Suggested solution

Have a docker image with compiled dozer-orchestrator and create container for each scenario.
Inside docker container, we need to execute scenario and assert result.

Strcuture example

Dockerfile
|-- Tests
     |-- {test case}
           |-- docker-compose.yml
           |-- dozer-run.yml
           |-- Other files

Things to be implemented

  • scenario script execution
  • scenario result assertion
  • integration to CI process

REST & gRPC: Streaming filter using query expression

Streaming only exists in gRPC. For each endpoint input from the orchestrator. We have 4 streaming events to subscribe

  • on_insert
  • on_update
  • on_delete
  • on_schema_change
  1. Each event is going to take a filter expression as input to narrow down the messages the client received
  2. Filter input should have the same structure as Filter Expression inside Query Expression in cache
  3. When the event is triggered, apply filter to decide whether or not the server should notify the client

Implement data apis for consuming data from cache.

Following Components are needed to expose Dozer Data APIs.

  • Cache Module #31
  • Auto generated gRPC endpoints
    • Dynamic parameters for gRPC to support all cache features
  • Auto generated REST + JSON APIs for endpoints. (Future Release)
    • Dynamic parameters for REST to support all cache features (Future Release)
  • Frontend SDKs
    • JS SDK

Implement resume in iterator

Current implement of sync with snapshot and replication messages does not allow user to resume sync.

Describe the solution you'd like
Iterator should have method resume with info about last successful insert. Information can be different based on database type and process (snapshot or replication).

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Support all messages formats in ingestion forwarder

Is your feature request related to a problem? Please describe.
At the moment only OperationEvent ingestion message type is supported.

Describe the solution you'd like
Implement Scehma, Commit, Begin types support.

Describe alternatives you've considered

Additional context

REST & gRPC: Auth Support for gRPC

REST is already implemented authentication using the JWT token
The same mechanism should be applied to the same gRPC server

checklist has to be done for this

  • Implement generate_token cli + input api auth config in dozer config yaml - dozer-orchestrator => output gonna be jwt from API config
  • Apply the JWT auth layer to gRPC server

Remaining concern:
Along with API authentication, should we include the dozer-admin authentication in the next milestone ?

  1. User login in Dozer-admin using aws Cognito
  2. A page to setup API authentication - same as auth config input in dozer-config yaml

Implement validations for postgres ingestion connector

In order to make Postgres connector work fine we need to check several details before running it.

  • Details
  • Connection
  • User permissions
  • Replication WAL level
  • Tables available (When user define which table he will only use)

When user is not providing replication details:

  • Limit of replication slots
  • Check if it is possible to create new slot and publication

When user is providing details:

  • Slot available
  • Last LSN comparison with slot flush LSN

GRPC Streaming on change

A feature on GRPC allows end-user observer changes. We provide 4 streaming route

  • on_insert
  • on_update
  • on_delete
  • on_schema_change

Example

service FilmService {
rpc film(GetFilmRequest) returns (GetFilmResponse);
rpc by_id(GetFilmByIdRequest) returns (GetFilmByIdResponse);
rpc query(QueryFilmRequest) returns (QueryFilmResponse);
rpc on_insert(OnInsertRequest) returns (stream OnInsertResponse);
rpc on_update(OnUpdateRequest) returns (stream OnUpdateResponse);
rpc on_delete(OnDeleteRequest) returns (stream OnDeleteResponse);
rpc on_schema_change(OnSchemaChangeRequest) returns (stream OnSchemaChangeResponse);
}

Summary Implementation

  • GRPC Server Init with a notifier - A broadcast channel is created based on this notifier
  • Every time event happened - (Insert, Update, Delete, Schema update) - Cache sink will trigger this notifier => broadcast channel on GRPC will have this event
  • Client invoke gprpc stream route => aka a subscribe to broadcast channel + filter event based on route => Client will receive event accordingly

Schema validation is validating not used tables

Describe the bug
Not used tables schema are validated

To Reproduce
Create PostgreSQL table with not supported column types and try to run orchestration

Expected behaviour
Only tables used as source should be validated

Implement plan generation

#149

Implemented strategy:

Make a sorted inverted index containing all Eq filters and one range filter / sort option.
Make a full text index for each text filters.

All permutations of the Eq filter fields are generated.

Next:

Let handler handle multiple IndexScans in one plan.

Databricks integration

Is your feature request related to a problem? Please describe.
Allow users to use data from databricks in ingestion.

Describe the solution you'd like
Implement new connector based on the current connector interface.

fix: proto generator support timestamp

We have an issue when generating Proto including the TimeStamp field in the schema, by default, proto should import google lib support for Timestamp

import "google/protobuf/timestamp.proto";

Cache not invalidating old value if PK changes

Describe the bug
In the pipeline cache sink we are processing operations. During update operation processing, a record is updated based on old primary key. Although primary key can be part of changes, the new primary key is ignored.

Expected behavior

if old_pk == new_pk {
  cache.update(old_pk, ...);
} else {
  cache.delete(old_pk, ...);
  cache.insert(old_pk, ...);
}

Screenshot 2022-10-21 at 07 15 34

Operation::Update { old, new } => {

Improve postgres types handling

Is your feature request related to a problem? Please describe.
The current conversion of replication messages Bytes to particular types is not efficient.

Describe the solution you'd like
We need to explore options about implementing custom parser of raw xlog messages or find more efficient way to convert Bytes.

Describe alternatives you've considered

Additional context

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.