cashapp / pranadb Goto Github PK

View Code? Open in Web Editor NEW

615.0 13.0 23.0 3.35 MB

License: Apache License 2.0

Go 99.61% Shell 0.05% Makefile 0.15% Dockerfile 0.03% C 0.16%

pranadb's Introduction

PranaDB

PranaDB is a distributed streaming database, designed from the outset to be horizontally scalable.

Do you like relational databases? Do you like Apache Kafka? PranaDB lives at the intersection of the two.

Ingest data from Apache Kafka topics
Define continuously and incrementally updating materialized views over that data.
Use standard SQL to query that data.
Define custom processors to process that data
Stream data directly into and out of PranaDB

Think:

Like Kafka but where you can query the data in your topics
Like a relational database, but where you can get incrementally updating materialized views, and streaming queries.

Status

PranaDB is currently a work in progress and is some of the features are currently available a a tech preview. We aim to get most of the feature complete later this year.

Contributing

Please take a look at the outstanding issues, and chat with us in our Gitter community.

In order to contribute please sign the block CLA

Docs

User manual

Try the demo

Frequently asked questions

pranadb's People

Contributors

Stargazers

Watchers

pranadb's Issues

Write more SQLTests

Remaining tests TODO

pull select expressions (where clauses)
pull project expressions (from clauses)
push select expressions
push project expressions
MV fill with multiple sources (when we support joins)
order by test - test order by multiple columns - When writing this test be careful - if rows have the same order by key, order of rows returned will be undefined in a cluster - so test will give non deterministic results.
query limit test -Use special comment to set query batch size limit to small number, Ingest lots of rows, do a query to select all rows - ensures that pagination works ok
type range test - test storing max and min values for each type, negative values, decimals with different scale and precision and truncation
upper and lower case test
alias tests

errors

test various errors:

invalid source and mv names on create

dropping unknown source and mv

selecting from unknown source and mv

trying to create mv or source same name more than once

syntax errors in mv

syntax errors in pull query

test unsupported features such as sub queries and some functions - make sure it gives good error

server kill test

test killing servers and making sure cluster still works

source compound PKs

More proto tests for different types - nested enums etc

Parser/Planner

Much of the plumbing is already in place here but we need to:

Screen out currently unsupported SQL
Make sure error messages are sane

Clean up Participle AST and use it for DROP/PREPARE/EXECUTE commands

I hacked the Participle AST to support properties and topic info for create source but it could be done better / properly:

Don't use custom struct to represent ColumnSelector - string will do
Source ColumnSelectors should be comma delimited
Source ColumnSelectors should be optional and support zero elements
Source Properties should be comma delimited
Source Properties should be optional and support zero elements
DROP SOURCE/MV commands currently don't use the Participle parser, we should change it so they do
Prepared statement PREPARE and EXECUTE currently don't use the Participle parser, we should change it so they do

Reliable storage engine

Currently Prana uses a fake implementation of the Storage interface.

We will create an implementation of Storage that plugs into the Dragonboat multi-raft implementation, which in turn should use Pebble as the underlying KV store.

Metrics

Simple, non intrusive way to gather metrics from various parts of Prana. Should use a sampling approach for minimal overhead.

Allow different metrics gathering mechanisms to be plugged in (e.g. Prometheus)

Adjust common.Row setters/getters to match common.Type enumerations

Materialized view population/rebuild

When creating a MV currently we do not populate the initial MV state from the state of the sources.

As part of the MV creation process, before the MV is activated it's state must be built by replaying data from its sources through its DAG.

This can be accomplished by performing a table scan on all of it's sources and feeding the data through the MV DAG.

The process needs to be executed on every shard of the cluster.

The tricky part is how to do this without blocking the existing sources or MVs and making sure the view is completely up to date on each node before atomically activating on every node and not missing a row.

This can be accomplished as follows:

When we create the MV we first lock the source and create another table for each of the mv sources, then unlock it and when rows are inserted into the source we also insert them into this other table, with a sequence number.
At the same time we create a snapshot of the source table using the pebble API. We lock the source while this create is happening.
We then build the MV from the snapshot.
We then replay the and rebuild from the rows in the other table. When we’ve fully scanned all the rows in that table. We then lock the source, and make sure no more rows are inserted into the table, if so replay those.
Once those have been replayed, we can activate the MV and delete the temporary table

Sharder

The Sharder is responsible for determining which shard a particular key belongs to.

Eventually we will support both hash and range based approaches and we will allow sharding type to be specified on a source/materialized view level. For streaming data sources with monotonically increasing keys, hash based partitioning makes sense. For many aggregations range based partitioning can have advantages.

Range based partitioning is harder to implement as it requires consistently maintaining a binary search tree of ranges across the cluster.

For phase 1 we will support a hash based approach only. The hashing can be a trivial implementation as we do not support changing number of shards in this phase 1. We will need a consistent hashing approach in later phases.

Push engine

The push engine is used for managing the "push" flow of events from being ingested from sources to propagation to potentially multiple materialized views and reliably forwarding between different nodes.

The processing of push data through the cluster is done by multiple dags (directed acyclic graphs) of push executors that all connnect with each other.

Much of the framework is in place. It needs completing and thoroughly testing.

For phase 1 we will not support the full set of SQL for queries in materialized views, so we do not need the full set of executors. For example we do not need the join operator.

Use Participle parser for parsing DROP statements

Currently DROP statements are parsed in a hacky way using simple string operations. We should use the Participle parser for this.

Prepared Statements

For pull queries we should implement prepared statements - as we do not want to be parsing and planning the query for every invocation.

Data reaper

There should be a retention time for the data in each source or materialized view.

We will create a reaper, which will periodically scan tables and delete records past the retention time.

Cluster correctness tests

We should write a simple test framework that can be run locally and on CI that

Loading up Kafka with test data
Running scripts to create sources and materialized views
Running the load through Kafka
Concurrently executing pull queries against Prana
Starting and stopping nodes
Killing nodes
Verifying MVs have correct state after end of test run

Pebble bug in compaction

There appears to be a Pebble bug which manifests as the following in the logs:

background error: pebble: keys must be added in order: #0,DEL, #52724,SET

Can be reproduced by running basic_source SqlTest in a loop (--repeat 1000;) and waiting. Can take 15-20 mins to occur

Stack trace is:

github.com/cockroachdb/pebble/sstable.(*Writer).addPoint at writer.go:227
github.com/cockroachdb/pebble/sstable.(*Writer).Add at writer.go:217
github.com/cockroachdb/pebble.(*DB).runCompaction at compaction.go:1994
github.com/cockroachdb/pebble.(*DB).flush1 at compaction.go:969
github.com/cockroachdb/pebble.(*DB).flush.func1 at compaction.go:903
runtime/pprof.Do at runtime.go:40
github.com/cockroachdb/pebble.(*DB).flush at compaction.go:900
runtime.goexit at asm_amd64.s:1371

Async Stack Trace
github.com/cockroachdb/pebble.(*DB).maybeScheduleFlush at compaction.go:849

It's been reported on the Pebble Slack channel:

https://cockroachdb.slack.com/archives/CQVRDNE23/p1627592171026500?thread_ts=1627494326.025400&cid=CQVRDNE23

Adding this issue to track it.

SQL test framework

We will create a simple SQL test tool which can be run as part of the local test suite.

The test tool will allow us to write SQL tests in the form of:

A script with a list of commands to be executed against Prana - just like commands that can be typed into the CLI

create source customers as ....
-- add some data to customers - we can have a special command for this
create materialized view cust_heights_by_location as select location, avg(height) from customers group by location
-- now execute a pull query
select * from cust_heights_by_location

Some text showing expected results:
create source customers as ....
[Source created}
create materialized view cust_heights_by_location as select location, avg(height) from customers group by location
[Materialized view created]
select * from cust_heights_by_location

london, 1.54
los angeles, 1.58
[2 rows returned]

Actual results are compared against expected results to see if test passes.

This will make it much easier to test the very many combinations of SQL queries

Ingest/egest sstables for Dragon snapshots

We should investigate using SSTable ingest/egest to implement snapshot save/recover for our dragon state machines.

Integration tests

We should create a simple test framework which can start up a bunch of Prana nodes on the same machine and tests the multi-node behaviour, including failures. These should be runnable as standard go tests

Authn/Authz

For phase 1 we should implement mTLS certificate based auth for Prana for the gRPC API.

Command line client

A simple command line client that uses the gRPC API to execute DDL and DML against Prana

Sources

Implement sources for ingesting data from Kafka.

Much of the plumbing is already in place for this.

One instance of a particular source will be deployed on each node of the cluster. The source instance will maintain a set of Kafka consumers (can be just one for phase 1). As a batch of records is fetched from the consumer it should be passed to the source.ingestRows method for processing. If the batch is processed without error the consumer offsets for the batch should be committed.

If a source has a key already that should be used for the key in the source table. If not, then the user can optionally specify a key from an existing field (e.g. timestamp). A monotonically increasing value can also be generated for the key if needed.

Protobuf field extraction function

We will create a SQL function to extract field information from Protobufs, similar to

https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/scalar-functions/#extractjsonfield

We can consider allowing a simple SQL query (without joins or aggregations) to be specified in source configuration to allow the fields to be extracted at the source level without having to create a materialized view to do this.

Split up common package

It's getting too cluttered.
In particular we should look at splitting out the encoding suff to other packages.

Forwarder optimisation - don't read forwarded rows from storage

Queued rows can be passed in memory so transferData doesn't need to load them from storage

Add VARBINARY support

Support VARBINARY column type - we need this for storing arbitrary []byte

Prana main and executable

We need a main() so we can actually start a Prana server.

This will involve parsing command line args and starting/stopping the Prana server.

We need the build to output an executable, and any required scripts to run the executable.

We need to consider whether we want to configure prana with command line args only or using a config file. If the latter we need to define the config file and settings

Data types

We will initially need the following data types:

TINYINT, INT, BIGINT, DECIMAL, VARCHAR, TIMESTAMP

Some are already implemented, will need to implement DECIMAL and TIMESTAMP - suggest wrapping the TiDB types for now.

Fix natural ordering in tables

The natural ordering of rows in tables is currently broken.

This is because we are encoding keys in little endian order, which means key comparisons (which are big endian) do not work, so iterating tables with some keys brings them back in an unexpected order.

We should write tests for all key types, and compound keys and ensure ordering is correct.

Pull engine

The pull engine is responsible for executing pull queries.

Much of the infrastructure for this is already in place. For phase 1 we will need a limited subset of SQL functionality including:

Query by secondary key (e.g. select * from customer_transactions where customer_id=xyz and month=3 and year=2021)
Point lookup (stretch)
Order by
Select
Projection

Metadata controller

This has the responsibility for managing and persisting the metadata for the database - i.e. the metadata on the schemas, and the materialized views, sources and sinks that they contain.

Each node in the cluster needs to have a consistent view of this information. As mvs/sources/sinks are created or dropped on any particular node, the changes need to be visible in the metadata controller instances on all nodes.

We can reserve a special "system table" with a known id for storing schema metadata then we can just use the standard mechanisms for upserting data in that table and excuting pull queries internally to load the data on startup - i.e. select * from system_meta_table, the process the results to reconstruct the in memory state.

As mvs/sources/sinks are created or dropped on a node, we persist the changes using the Storage API then we need to broadcast this to each node so they are aware. We will provide a notification mechanism via the Cluster API for this.

Remove INT type?

All int types are internally stored as int64.
We don't have a requirement to support every SQL type and do everything that MySQL can do (i.e. we are not trying to be a drop in replacement for MySQL like TiDB), so we don't need to support every SQL type - we are a subset.
However we are still a database, and we don't want to create yet another "SQL-like" language (This is one of the biggest issues with ksqlDB amongst others imo) - one of the key design goals of Prana is to be immediately familiar to anyone who has used SQL before.
Considering that's ok to be a subset, should we drop the INT type?

Varbinary type

For when we just want to store arbitrary bytes in columns from incoming events

Implement UNION ALL push executor

We need a push executor that implements UNION ALL.

UNION ALL takes input from multiple push executors and adds all the received rows into the output rows. Unlike UNION it does not detect duplicates. (For UNION we will use an aggregation plus a UNION ALL)

Need to make the column types of the inputs matches the column types of the inputs - there must be the same number of columns and the column types must be the same or castable/easily convertible into the output column type.

MySQL wire server

One shot optimisation for pull queries

When executing non prepared pull queries, and the query is complete on the remote node in the first call, there is no need to store current query in a remote session and no need to create the session at all if it doesn't exist.

Command parser

We need a simple parser which takes input from the gRPC API or console in the form of prana commands (i.e. the kinds of things you would type into the CLI), e.g.

CREATE SOURCE BLAH ....
CREATE MATERIALIZED VIEW ...
SELECT * FROM SOME_TABLE

and then parses them so it can call the appropriate methods on the command executor.

We already have a sql parser for parsing queries we don't have anything to parse the other commands.

An initial first step would be just to write a "poor man's parser" and do some basic string operations to parse out the create source and create materialized view commands - these have a very simple structure, and probably don't need a "real" parser compiled from a grammar (unless that's easy to do!).

Get server main working

The current server main, just uses test config. It needs to parse config from a config file. We should also provide default cluster configs for easy setup.

Do not fanout when executing metadata query

In the load of the metacontroller, a query is used to select the data. Metadata always lives in shard 1000 so there is no need to fanout to all shards in RemoteExecutor when executing this query.

Implement a special case and only call to shard 1000

Implement notifications server/client

Implement a simple notifications system which does not abuse a raft group.

Pull/push executors should return column names in addition to type information

Return ColumnInfo instead of ColumnType.

Use []Services for Server Stop()

... like we do in Server.Start()

Prana deployment on cloud

Setting up whatever is necessary (new components?) to deploy Prana on AWS

gRPC API

We will create an externally facing gRPC API for Prana that can be used for DDL operations (e.g. creating/dropping mvs, sources etc) and also for executing pull and push queries.

Session close broadcast optimisation

When broadcast session is closed there is no need to broadcast a message for each shard - we can broadcast just one message for the session, and on the receiving side it can remove all sessions for all shards.

Secondary indexes

For efficient secondary key lookups in pull queries we need to implement indexes.

This will involve a new create index command, and a new Index pull executor.

Implement push aggregation [stretch]

This is already partially completed. Needs to be completed and fully tested.

Cluster implementation

Prana currently has a fake implementation of the Cluster interface.

We will create a real implementation.

The Cluster implementation has the job of maintaining cluster wide topology information, such as which nodes are the leaders and followers for each shard.

It also has the job of maintaining cluster wide counters, e.g. for generating table IDs.

To implement these, we will create custom Raft groups using Dragonboat to maintain consistent state machines across the cluster.

The cluster implementation also has the responsibility for notifying the rest of the code via a callback when leader state changes. This can be implemented by plugging into the Dragonboat implementation.

It is also the interface for direct cluster communication between nodes, e.g. for executing pull queries. The interface hides the actual communication mechanism from the rest of the code. The mechanism might be gRPC, direct sockets or some other mechanism.

For phase 1 the cluster will have a fixed number of nodes and shards. Functionality for scaling up / scaling down a cluster will come in later phases.

Error messages

Go through every possible user error, including:

Errors reported at the CLI or from gRPC API, e.g. SQL syntax errors, create source, mv errors etc
Errors due to bad server configuration
Cluster errors
Errors in ingesting messages at sources, due to unable to decode messages, bad source configuration etc.

Make sure all these errors are clear and easy to understand and have max information, and errorcode associated with them

cashapp / pranadb Goto Github PK

pranadb's Introduction

PranaDB

Status

Contributing

Docs

pranadb's People

Contributors

Stargazers

Watchers

Forkers

pranadb's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs