danthegoodman1 / icedb Goto Github PK

View Code? Open in Web Editor NEW

91.0 7.0 3.0 8.38 MB

An in-process Parquet merge engine for better data warehousing in S3

Home Page: https://blog.danthegoodman.com/icedb-v2

Python 97.33% Go 2.49% Shell 0.19%

database duckdb serverless python data parquet warehouse

icedb's Introduction

IceDB

IceDB is an in-process Parquet merge engine for better data warehousing in S3, using only S3.

No Spark, no JVM, no data warehouse experience required.

IceDB runs stateless, and stores data in easily readable formats to allow any language or framework to parse the log (jsonl) and read the data (parquet), making it far easier to insert, migrate, query, run, and scale than alternatives.

It's queryable by anything that understands parquet, and runs 54x cheaper than managed solutions such as BigQuery, Snowflake, and Athena.

IceDB tracks table schemas as standard SQL types, supports dynamic schema evolution, and divides your data into tables and partitions.

IceDB merges parquet files and manages tombstone cleanup to optimize your data storage for faster queries, very similar to what systems like ClickHouse do under the hood. Except IceDB does it effectively stateless: All state and storage is in S3. This makes it extremely easy to run and scale. When you need to introduce concurrent mutations at the table level, you can introduce coordination through exclusive locks. The IceDB log format also uses the widely understood newline-delimited JSON format, making it trivial to read from any language.

It retains many of the features of modern OLAP systems (such as ClickHouse’s Materialized Views), adds some new ones, and makes it way easier to build scalable data systems with a focus on true multi-tenancy.

IceDB can replace systems like BigQuery, Athena, and Snowflake, but with clever data design can also replace provisioned solutions such as a ClickHouse cluster, Redshift, and more.

Query engines such as DuckDB, ClickHouse, CHDB, Datafusion, Pandas, or custom parquet readers in any language can easily read IceDB data in hundreds of milliseconds, and even faster when combined with the IceDB S3 Proxy for transparent queries (the client just thinks it's S3) like with the ClickHouse S3 function s3('https://icedb-s3-proxy/**/*.parquet') or DuckDB's read_parquet('s3://icedb-s3-proxy/**/*.parquet').

IceDB

Quick Start

Clone the repo:

git clone https://github.com/danthegoodman1/icedb
docker compose up -d # starts local minio server
pip install git+https://github.com/danthegoodman1/icedb duckdb

Use pip3 if you are on macOS or have a dual installation

import duckdb
from icedb.log import S3Client, IceLogIO
from icedb import IceDBv3, CompressionCodec
from datetime import datetime
from time import time

# create an s3 client to talk to minio
s3c = S3Client(s3prefix="example", s3bucket="testbucket", s3region="us-east-1", s3endpoint="http://localhost:9000", 
               s3accesskey="user", s3secretkey="password")

example_events = [
    {
        "ts": 1686176939445,
        "event": "page_load",
        "user_id": "user_a",
        "properties": {
            "page_name": "Home"
        }
    }, {
        "ts": 1676126229999,
        "event": "page_load",
        "user_id": "user_b",
        "properties": {
            "page_name": "Home"
        }
    }, {
        "ts": 1686176939666,
        "event": "page_load",
        "user_id": "user_a",
        "properties": {
            "page_name": "Settings"
        }
    }, {
        "ts": 1686176941445,
        "event": "page_load",
        "user_id": "user_a",
        "properties": {
            "page_name": "Home"
        }
    }
]

def part_func(row: dict) -> str:
    """
    Partition by user_id, date
    """
    row_time = datetime.utcfromtimestamp(row['ts'] / 1000)
    part = f"u={row['user_id']}/d={row_time.strftime('%Y-%m-%d')}"
    return part

# Initialize the client
ice = IceDBv3(
    part_func,
    ['event', 'ts'],  # Sort by event, then timestamp of the event within the data part
    "us-east-1",
    "user",
    "password",
    "http://localhost:9000",
    s3c,
    "dan-mbp",
    s3_use_path=True,  # needed for local minio
    compression_codec=CompressionCodec.ZSTD  # Let's force a higher compression level, default is SNAPPY
)

# Insert records
inserted = ice.insert(example_events)
print('inserted', inserted)

# Read the log state
log = IceLogIO("demo-host")
_, file_markers, _, _ = log.read_at_max_time(s3c, round(time() * 1000))
alive_files = list(filter(lambda x: x.tombstone is None, file_markers))

# Setup duckdb for querying local minio
ddb = duckdb.connect(":memory:")
ddb.execute("install httpfs")
ddb.execute("load httpfs")
ddb.execute("SET s3_region='us-east-1'")
ddb.execute("SET s3_access_key_id='user'")
ddb.execute("SET s3_secret_access_key='password'")
ddb.execute("SET s3_endpoint='localhost:9000'")
ddb.execute("SET s3_use_ssl='false'")
ddb.execute("SET s3_url_style='path'")

# Query alive files
query = ("select user_id, count(*), (properties::JSON)->>'page_name' as page "
         "from read_parquet([{}]) "
         "group by user_id, page "
         "order by count(*) desc").format(
    ', '.join(list(map(lambda x: "'s3://" + ice.s3c.s3bucket + "/" + x.path + "'", alive_files)))
)
print(ddb.sql(query))

inserted [{"p": "example/_data/u=user_a/d=2023-06-07/c2bc1eef-b2cd-404a-9ec6-097e27d3130f.parquet", "b": 693, "t": 1702822195892}, {"p": "example/_data/u=user_b/d=2023-02-11/2d8cb9b1-450f-455f-84e0-527b8fb35d5f.parquet", "b": 585, "t": 1702822195894}]
┌─────────┬──────────────┬──────────┐
│ user_id │ count_star() │   page   │
│ varchar │    int64     │ varchar  │
├─────────┼──────────────┼──────────┤
│ user_a  │            2 │ Home     │
│ user_a  │            1 │ Settings │
│ user_b  │            1 │ Home     │
└─────────┴──────────────┴──────────┘

For more in-depth examples, see the Examples section

How does IceDB work?

Inserts, merges, and tombstone cleanup are powered by Python and DuckDB. IceDB runs stateless with a log in S3, meaning that you only pay for storage and compute during operations, enabling true serverless analytical processing. It does so in an open and easily readable format to allow for any language or framework to parse the icedb log (jsonl) and read the data (parquet).

The IceDB log keeps track of alive data files, as well as the running schema which is updated via insertion. Query engines such as DuckDB, ClickHouse, CHDB, Datafusion, Pandas, or custom parquet readers in any language can easily read IceDB data in hundreds milliseconds, especially when combined with the IceDB S3 Proxy.

See more in ARCHITECTURE.md

Examples

See the examples/ directory for many examples like Materialized Views, custom merge queries, schema validation before insert, and more.

Simple example
Materialized View example
Custom merge queries aggregation and replacing
Verify schema before insert
API using flask and falcon
Segment webhook sink

Performance test

IceDB can easily insert hundreds of thousands of rows per second per instance, and query engines can query upwards of hundreds of millions of rows per second.

Performance depends on a variety of things such as query engine, network/disk, and how efficiently your data is merged at query time.

See perf_tests for examples.

Comparisons to other systems

IceDB was made to fill the gap between solutions like ClickHouse and BigQuery, solving a stronger balance between decoupled storage (data and metadata) and compute, easily self-hosted, open source, easily extensible and flexible,
multi-tentant, and ready for massive scale.

To be transparent, I’ve omitted systems I’ve never used before (such as databricks), and do not have extensive experience with some (such as Snowflake).

Why IceDB?

IceDB offers many novel features out of the box that comparable data warehouses and OLAP DBs don't:

Native multi-tenancy with prefix control and the IceDB S3 Proxy, including letting end-users write their own SQL queries
True separation of data storage, metadata storage, and compute with shared storage (S3)
Zero-copy data sharing by permitting access to S3 prefixes for other users
Multiple options for query processing (DuckDB, ClickHouse, CHDB, Datafusion, Pandas, custom parquet readers in any language)
Open data formats for both the log and data storage
Extreme flexibility in functionality due to being in-process and easily manipulated for features like materialized views, AggregatingMergeTree and ReplacingMergeTree-like functionality

Why not BigQuery or Athena?

BigQuery offers a great model of only paying for S3-price storage when not querying, and being able to summon massive resources to fulfill queries when requested. The issues with BigQuery (and similar like Athena) is that:

They are egregiously expensive at $5/TB processed
Charge on uncompressed data storage (and they reserve compressed billing for their largest customers)
They are limited to their respective cloud providers
Closed source, no way to self-host or contribute
Only one available query engine
Show query startup time

For example, queries on data that might cost $6 on BigQuery would only be around ~$0.10 running IceDB and dynamically provisioning a CLickHouse cluster on fly.io to respond to queries. That's a cost reduction of 60x without sacrificing performance.

While IceDB does require that you manage some logic like merging and tombstone cleaning yourself, the savings, flexibility, and performance far outweigh the small management overhead.

To get the best performance in this model, combine with the IceDB S3 Proxy

Why not ClickHouse, TimescaleDB, RedShift, etc.?

We love ClickHouse, in fact it's our go-to query engine for IceDB in the form of BigHouse (dynamically provisioned ClickHouse clusters)

The issue with these solutions are the tight coupling between storage, metadata, and compute. The lack of elasticity in these systems require that have the resources to answer massive queries idle and ready, while also requiring massive resources for inserting when large queries are only occasional (but need to be answered quickly).

IceDB allows for ingestion, merging, tombstone cleaning, and querying all in a serverless model due to compute being effectively stateless, with all state being managed on S3 (plus some coordination if needed).

Ingestion workers can be summoned per-request, merging and tombstone cleaning can be on timers, and querying can provision resources dynamically based on how much data needs to be read.

Why not the Spark/Flink/EMR ecosystem

Beyond the comical management overhead, performance is shown to be inferior to other solutions, and the flexibility of these systems is paid 10-fold over in complexity.

Why not Iceberg?

Ah, yes... I probably named IceDB too close. In all fairness, I named IceDB before I knew about Iceberg.

Iceberg has a few problems right now in my eyes:

Very few ways to write to it (spark, pyiceberg)
Very complex (look at any example - requires schema definition, cataloging is verbose, it's really a painful DX)
Very few ways to read it (a few DBs like ClickHouse can read the catalog, but you couldn't casually bring it in to your Go code like you can with icedb)
Really painful to import existing data, you basically have to write it through Iceberg which is going to be very slow and wasteful

I specifically designed IceDB to be super easy to use in any language:

The log is just newline-delimited JSON, any language can easily read JSON. Other serialization formats are hard and vary extremely by language (like protobuf)
With the S3 Proxy no readers have to understand the log, they only need to understand how to read parquet (every language can do this now)
Very simple DX. Check the examples, it's far easier to use than pyIceberg, and I'd argue it's more flexible too based on the exposed primitives and extensibility
Strong multitenancy with the S3 proxy (or a custom system), this means you can let your customers run sql queries on their data. The S3 proxy is designed to handle this with virtual buckets and prefix enforcement. You can even extend checks on table permissions before submitting queries :)
Much easier to import existing datasets by writing to _data subdirectory (just rename files) and writing a log file manually to the _log subdirectory using the log format.

When not to use IceDB

If you need answers in <100ms, consider ClickHouse or Tinybird (well-designed materialized views in IceDB can provide this performance level as well)
If you need tight integrations with cloud-provider-specific integrations and can't spare writing a little extra code, consider BigQuery and Athena
If your network cards are not very fast, consider ClickHouse
If you can't access an S3 service from within the same region/datacenter, and are not able to host something like minio yourself, consider ClickHouse
If you need something 100% fully managed, depending on your needs and budget consider managed ClickHouse (Altinity, DoubleCloud, ClickHouse Cloud, Aiven), Tinybird, BigQuery, or Athena

Tips before you dive in

Insert in large batches

Performance degrades linearly with more files because the log gets larger, and the number of parquet files in S3 to be read (or even just listed) grows. Optimal performance is obtained by inserting as infrequently as possible. For example, you might write records to RedPanda first, and have workers that insert in large batches every 3 seconds. Or maybe you buffer in memory first from your API nodes, and flush that batch to disk every 3 seconds (example)

Merge and Tombstone clean often

Merging increases the performance of queries by reducing the number of files that data is spread across. Merging combines files within the same partition. Parquet natively providers efficient indexing to ensure that selective queries remain performant, even if you are only selecting thousands of rows out of millions from a single parquet file.

Tombstone cleaning removes dead data and log files from S3, and also removes the tracked tombstones in the active log files. This improves the performance of queries by making the log faster to read, but does not impact how quickly the query engine can read the data once it knows the list of parquet files to read.

The more frequently you insert batches, the more frequently you need to merge. And the more frequently you merge, the more frequently you need to clean tombstones in the log and data files.

Large partitions, sort your data well!

Do not make too many partitions, they do very little to improve query performance, but too many partitions will greatly impair query performance.

The best thing you can do to improve query performance is to sort your data in the same way you'd query it, and write it to multiple tables if you need multiple access patterns to your data.

For example, if you ingest events from a webapp like Mixpanel and need to be able to query events by a given user over time, and a single event over time, then you should create two tables:

A table with a partition format uid={user_id} and a sort of timestamp,event_id for listing a user's events over time (user-view)
A table with a partition format of d={YYYY-MM-DD} and a sort of event_id,timestamp for finding events over time (dashboards and insights)

Schema validation before insert

Reading the log and/or data will fail if there are conflicting schemas across files. IceDB accepts missing and new columns across files, but rejects a column changing data types.

The best way to handle this across multiple ingestion workers that might insert into the same table is to cache (a hash of) the schema in memory, and when a schema change for a given table is detected perform a serializable SELECT FOR UPDATE level isolation lock on some central schema store. You can then determine whether the schema change is allowed (or already happened), and update the remote schema definition to add the new columns as needed.

If the schema change is not allowed (a column changed data type), you can either attempt to solve it (e.g. change a BIGINT to a DOUBLE/DECIMAL), drop the offending rows, or quarantine them in a special {table name}_quarantine table and notify users of the violating rows for manual review.

You could also require users pre-define schemas, and push updates to ingestion workers, or do a similar check sequence when one is detected (omitting the functionality of updating the remote schema from the ingestion worker).

Columns will always show as NULLABLE in the schema, however the only columns that should never be null are the ones required to determine the partition (unless you have defaults on those columns).

See a simple example here on verifying the schema before inserting.

Tracking the running schema

IceDB will track the running schema natively. One caveat to this functionality is that if you remove a column as a part of a partition rewrite and that column never returns, IceDB will not remove that from the schema.

Usage

pip install git+https://github.com/danthegoodman1/icedb

from icedb import IceDBv3

ice = IceDBv3(...)

Partition function (`part_func`)

Function that takes in an dict, and returns a str. How the partition is determined from the row dict. This function is run for every row, and should not modify the original dict.

While not required, by formatting this in Hive format many query engines can read these from the file path and use them as additional query filters. DuckDB can do this natively with read_parquet([...], hive_partition=1), and with ClickHouse you can write something like extract(_path, 'u=([^\s/]+)') AS user_id, extract(_path, 'd=([0-9-]+)') AS date

Example:

from datetime import datetime
from icedb import IceDBv3


def part_func(row: dict) -> str:
    """
    We'll partition by user_id, date
    Example: u=user_a/d=2023-08-19
    """
    row_time = datetime.utcfromtimestamp(row['ts'] / 1000)
    part = f"u={row['user_id']}/d={row_time.strftime('%Y-%m-%d')}"
    return part


ice = IceDBv3(partition_strategy=part_strat, sort_order=['event', 'timestamp'])

Additionally, a _partition property can be pre-defined on the row, which will avoid running the part_func and use this instead (for example if you calculate this on ingest). This property will be deleted from the row before insert, so if you intend on re-using the row or intend for that to be stored in the row, you should provide an additional property named something else (e.g. __partition or partition). You can use preserve_partition=True to prevent IceDB from deleting this property on each row. Generally it's faster to delete it, as the time to delete for large batches is smaller than the extra data copy time.

Sorting Order (`sort_order`)

Defines the order of top-level keys in the row dict that will be used for sorting inside the parquet file. This should reflect the same order that you filter on for queries, as this directly impacts performance. Generally, you should start with the lowest cardinality column, and increase in cardinality in the list

Example sorting by event, then timestamp:

['event', 'timestamp']

This will allow us to efficiently query for events over time as we can pick a specific event, then filter the time range while reducing the amount of irrelevant rows read.

`unique_row_key` (`_row_id`)

If provided, will use a top-level row key as the _row_id for deduplication instead of generating a UUID per-row. Use this if your rows already have some unique ID generated.

Removing partitions (`remove_partitions`)

The remove_partitions function can be dynamically invoked to remove a given partition from the data set. This can be used for features like TTL expiry, or for removing user data if that is tracked via partition (prefer partition removal for performance, otherwise see partition rewriting below).

This method takes in a function that evaluates a list of unique partitions, and returns the list of partitions to drop.

Then, a log-only merge occurs where the file markers are given tombstones, and their respective log files have tombstones created. No data parts are involved in this operation, so it is very fast.

This requires the merge lock to run concurrently on a table. The merge lock is required for ensuring that no data is copied into another part while you are potentially dropping it.

This is only run on alive files, parts with tombstones are ignored as they are already marked for deletion.

Rewriting partitions (`rewrite_partition`)

For every part in a given partition, the files are rewritten after being passed through the given SQL query to filter out unwanted rows. Useful for purging data for a given user, deduplication, and more. New parts are created within the same partition, and old files are marked with a tombstone. It is CRITICAL that new columns are not created (against the known schema, not just the file) as the current schema is copied to the new log file, and changes will be ignored by the log.

Because this is writing the same data, it's important to acquire the merge lock during this operation, so this should be used somewhat sparingly.

The target data will be at _rows, so for example your query might look like:

select *
from _rows
where user_id != 'user_a'

Pre-installing DuckDB extensions

DuckDB uses the httpfs extension. See how to pre-install it into your runtime here: https://duckdb.org/docs/extensions/overview.html#downloading-extensions-directly-from-s3

and see the extension_directory setting: https://duckdb.org/docs/sql/configuration.html#:~:text=PHYSICAL_ONLY-,extension_directory,-Set%20the%20directory with the default of $HOME/.duckdb/

You may see an example of this in the example Dockerfile.

Merging

Merging takes a max_file_size. This is the max file size that is considered for merging, as well as a threshold for when merging will start. This means that the actual final merged file size (by logic) is in theory 2*max_file_size, however due to parquet compression it hardly ever gets that close.

For example if a max size is 10MB, and during a merge we have a 9MB file, then come across another 9MB file, then the threshold of 10MB is exceeded (18MB total) and those files will be merged. However, with compression that final file might be only 12MB in size.

Concurrent merges

Concurrent merges won't break anything due to the isolation level employed in the meta store transactions, however there is a chance that competing merges can result in conflicts, and when one is detected the conflicting merge will exit. Instead, you can choose to immediately call merge again (or with a short, like 5 seconds) if you successfully merged files to ensure that lock contention stays low.

However, concurrent merges in opposite directions is highly suggested.

For example in the use case where a partition might look like y=YYYY/m=MM/d=DD then you should merge in DESC order frequently (say once every 15 seconds). This will keep the hot partitions more optimized so that queries on current data don't get too slow. These should have smaller file count and size requirements, so they can be fast, and reduce the lock time of files in the meta store.

You should run a second, slower merge internal in ASC order that fully optimizes older partitions. These merges can be much large in file size and count, as they are less likely to conflict with active queries. Say this is run every 5 or 10 minutes.

Tombstone cleanup

Using the remove_inactive_parts method, you can delete files with some minimum age that are no longer active. This helps keep S3 storage down.

For example, you might run this every 10 minutes to delete files that were marked inactive at least 2 hours ago.

Custom Merge Query (ADVANCED USAGE)

You can optionally provide a custom merge query to achieve functionality such as aggregate-on-merge or replace-on-merge as found in the variety of ClickHouse engine tables such as the AggregatingMergeTree and ReplacingMergeTree.

This can also be used alongside double-writing (to different partition prefixes) to create materialized views!

WARNING: If you do not retain your merged files, bugs in merges can permanently corrupt data. Only customize merges if you know exactly what you are doing!

This is achieved through the custom_merge_query function. You should not provide any parameters to this query. All queries use DuckDB.

The default query is:

select *
from source_files

The ? must be included, and is the list of files being merged.

source_files is just an alias for read_parquet(?, hive_partitioning=1), which will be string-replaced if it exists. Note that the hive_partitioning columns are virtual, and do not appear in the merged parquet file, therefore is it not needed.

See examples:

"Seeding" rows for aggregations

Because this is not a "SQL-native merge" like systems such as ClickHouse, we do have to keep in mind how to format and prepare rows for merging. The ideal way to do this is to understand how the system works, and either pre-format the rows, use the format_row function, or (if user-defined) use the custom_insert_query discussed below.

For example if we are keeping a running count, we need to prepare each row with an initial cnt = 1, and merges will use sum(cnt) instead. The best way to think about this is literally concatenating multiple sub-tables of the same schema.

You can prepare rows by either:

Modifying them before insert
Using the format_row param (safe copying by default)
Using the custom_insert_query param (use if user-defined)

The example above cover different ways you can prepare rows for different scenarios.

Another way to "dynamically seed" rows is to use a (DuckDB) query like:

select user_id, event, sum(ifnull(cnt, 1)) as cnt
from (
    select null::int as user_id, null::varchar as event, null::bigint as cnt
    -- can also be used: select * from values (null, null, null) fake_table(user_id, event, cnt) 
    union all by name
    select *
    from read_parquet([{}])
)
where event is not null
group by user_id, event
order by cnt desc

This will put NULL where the cnt column doesn't yet exist, and pre-populate with 1.

In ClickHouse this is more elegant by pre-defining the schema (tracked by IceDB!) and the input_format_parquet_allow_missing_columns setting:

SELECT COUNT(a), COUNT(b)
FROM file('data.parquet', Parquet, 'a UInt32, b UInt32')
SETTINGS input_format_parquet_allow_missing_columns = 1

Custom Insert Query (ADVANCED USAGE)

This function is run at insert/schema introspection time, after any format_row function (if exists). The actual rows are available at _rows. All queries use DuckDB.

The default insert query is:

select *
from _rows
order by {your sort order}

If you want to have a materialized view that uses count(), as we've seen in the example we need to seed rows with an initial value. It's much easier if we can allow users to define an insert function to prepare the rows than doing so from python:

select *, 1::BIGINT as cnt
from _rows
order by events, time DESC

This insert query, unlike the format_row function, is safe to take as input from users.

Note: it's always best to explicitly declare the type, as DuckDB uses int32 by default here when we probably want int64.

Another example is flattening JSON before inserting:

select *
EXCLUDE properties -- remove the old properties
, to_json(properties) as properties -- replace the properties with stringified JSON
from _rows
order by event, time

De-duplicating Data on Merge

By default, no data is attempted to be de-duplicated. You can provider deterministic row IDs (or pre-defined) and deduplicate during merge.

For example, if you wanted merges to take any (but only a single) value for a given _row_id, you might use:

select
    any_value(user_id),
    any_value(properties),
    any_value(timestamp),
    _row_id
from source_files
group by _row_id

Note that this will only deduplicate for a single merged parquet file, to guarantee single rows you much still employ deduplication in your analytical queries.

Replacing Data on Merge

If you wanted to replace rows with the most recent version, you could write a custom merge query that looks like:

select
    argMax(user_id, timestamp),
    argMax(properties, timestamp),
    max(timestamp),
    _row_id
from source_files
group by _row_id

Like deduplication, you must handle this in your queries too if you want to guarantee getting the single latest row.

Aggregating Data on Merge

If you are aggregating, you must include a new _row_id. If you are replacing this should come through choosing the correct row to replace.

Example aggregation merge query:

select
    user_id,
    sum(clicks) as clicks,
    gen_random_uuid()::TEXT as _row_id
from source_files
group by user_id

This data set will reduce the number of rows over time by aggregating them by user_id.

Pro-Tip: Handling Counts

Counting is a bit trickier because you would normally have to pivot from count() when merging a never-before-merged file to sum() with files that have been merged at least once to account for the new number. The trick to this is instead adding a counter column with value 1 every time you insert a new row.

Then, when merging, you simply sum(counter) as counter to keep a count of the number of rows that match a condition.

icedb's People

Contributors

Stargazers

Watchers

Forkers

rakeshjn vzakaznikov

icedb's Issues

Merge endpoint

/merge endpoint (can optionally take in a partition?) to check for files eligible for merging.

For now we can just hard code sizes and counts, but eventually they need to be configurable.

It will run SQL queries to check what files should be merged, download and merge them into a new file, disable the old files, and insert the new file.

All merges need to happen within a single partition.

Need to ensure we keep track of the new columns available in the file.

change merge partition prefix to use `LIKE {}%`

Benchmark/Comparison to direct parquet files

With DuckDB, compare to doing a HIVE partition glob match.

Also compare to Clickhouse with the same resources parsing the parquet files.

Use JSON with consistent columns so we can directly compare performance of using the meta store.

Add customization of extension path

param that gets passed to the extension_path option of duckdb for pre-installing extensions

Faceting/Dynamic indexing (with MVs)

Facets kind of taken from datadog's terminology, but this would allow for dynamic "indexing" of columsn not in the partition stretegy.

We can add additional columns to the meta store called facet_keys and facet_values to serve as a that keeps track of the known keys and values of additional columns inside the parquet file.

Schema like:

facets JSONB

Facet values should be stored in arrays like:

{
  "some.known.path": [1, 2, 'a']
}

A secondary GIN index will then allow us to track the values so that they can be considered in queries for increased filtering. This would allow a sort of "indexing" on additional columns without double-writing (i.e. a second table).

Facets will have to be defined, and will not backfill on previous data. The known facets will need to be stored in the DB as well in a new table.

Facets can have any data type, since they are JSONB columns. We will have to match query predicates to these facets similar to #45

This should be exposed entirely as python functions so that any query engine can be used, and as long as interception of the predicate can occur then faceting can be supported (otherwise I guess really ugly functions could be used)

Golang icedb get_files example

The get_files function is very simple, just a query wrapper against the meta store, so it should be trivial to self-implement in any language. It can also be overly excessive to initialize duckdb when you only intend to read (e.g. clickhouse function binding giving issues with creating duckdb directory)

create an example of a golang program that is then bound and queried from clickhouse.

Automated warning logs if we detect that inserts might be happening to fast

(from reading log), if a single node is inserting too fast, or if we find too many unmerged parts when we are refreshing the schema in the background

Docs on optimization and tuning

How to get the most performance out of icedb, things to do and things to avoid

Cloud Run deployment

Make example deployment for cloud run and test performance

otel tracing support

If using qyrn and this, need warnings about it creating recursive tracing. We could have it skip certain tables, or have it directly insert rather than go through the tracing so it doesn't trace storing traces.

ClickHouse get files function binding example

Design log schema

How to handle merges, data tombstones, log compaction, and more.

Basic auth support

Optional username and password specified in env vars to enable basic auth for all endpoints but health check

Write part (data and append log)

Merge/Tombstone clean coordination example

Support coordination for concurrent merges to prevent duplicate data.

Maybe we make a class to implement that will handle the locking and such, with examples?

The class would need:

Lock files
Release lock

Example options:

Redis
etcd
Zookeeper/CH Keeper
Postgres/CRDB
C*/Scylla with LWT

Merging locks should be by table

Streaming inserts

A way to stream rows rather than load the body all at once before processing

IceDB Read operation can get current state from log

Read in the state and determine the list of currently active files and current merged schema. Have option for filtering on prefix match (maybe a lambda function, but this would have to read all file's schema)?

Schema should be stored in duckdb format syntax because that should be more backward compatible with other systems as it uses generic sql types.

If type collisions are detected, we need to error log and abort reading.

adjustable `row_group_size`

Should be able to set a row_group_size override in the initialization of the icedb class

Metadata without meta store?

If we can properly elect a single leader to manage metadata (raft, etc., keeper, etc.) then we can actually process metadata in an append-only log right in S3. Because S3 is a consistent, we can always read the latest version of the base plus any additional appends on top. Then every once in a while we can truncate that.

Because of this, when a query goes we can just read from S3 to find relevant info for what parts are alive at any given time. Even if update occur during a query, we get a single snapshot during a single list call, or during subsequent lists as long as we take the latest state we can update mid-flight.

better logging

debug controls, include latencies, etc

Insert makes a file per partition

Rather than inserting to the same file, we should calc the partition of each row and have a parquet accumulator for each partition. Then we write a file per partition.

Merge operation (data and log)

Log created and managed in S3

Create the format of the log file (and append files, probably the same thing with higher timestamps, log is just truncated operations)
Management of the log
Reading in and materializing state
Writing some basic tests to sanity check expected state
Tracks schema as well

Insert Partitioner

Need to be able to:

Create a "partition schema" for a given namespace
Create the partition on insert for the file to be written, and insert into the path

The partition should always start with an include the namespace.

Some brainstorms for the interface:

Being able to partition a timestamp column (string or ms int) to multiple components (year, month, day, etc.)
Being able to modulo a number

Example:

[{
  "func": "toYear",
  "args": ["timestamp"],
  "as": "year":
}, {
  "func": "toMonth",
  "args": ["timestamp"],
  "as": "month":
}, {
  "func": "toDay",
  "args": ["timestamp"],
  "as": "day":
}]

The above functions can either take in a timestamp in ms, a timestamp formated like JS ISO time, or can just pass in now() as the arg and it will take the time of insert.

Would result in ns={namespace}/year={year}/month={month}/day={day}/{randomID}.parquet

The partitions are all virtual columns (only in the file path).

Partitioning should be immutable by namespace (it cannot be changed).

Another idea is to have the partitioning provided on every insert request. While this is redundant, it does save the DB lookup.
This is dangerously easy to change however, we we'd have to be very clear about communicating that it needs to be carefully handled. This is the easiest and most scalable to implement now so we will probably go for this.

Concurrent merge locks

Rather than holding a transaction open we are relying on timeouts right now. A better way would be to use a lock within the DB that is just for merging, and lists files that are under merge. It should have a timeout as well but allows for concurrent merges because a subsequent concurrent merge can ignore those files.

Make Dockerfile example

Pre-install duckdb extensions as well.

Stream merging

Stream file merges instead of loading all at once.

Not sure this will be possible with the current parquet package used.

Custom merge query

Allow for customization of the merge query said that during merges aggregations can be performed, rows that are replaced can be dropped, etc.

allows for super easy self implementation of the same functionality as aggregating merge trees and replacing merge trees in clickhouse

more elegant "table" support with single-bucket design

Should be able to manage multiple tables. A table can simply be a prefix in the bucket path, so we could support that.

Current workaround is to hard-code it into the partition function, then for something like the duckdb table macro, you can add an extra parameter that is the prefix of time-based partition.

S3 security via query interception

Since relying on IAM only works with solutions like minio, we can intercept S3 read conditions.

For example if our IAM allows reading of all objects in a bucket (path), but we have the tenant ID as the prefix for an object path, we can intercept the reads to that S3 bucket and ensures that only valid files would be read (otherwise cancel the query).

DuckDB has an inherent issue here that we'd be looking at only allowing S3 querying to a single bucket at a time, unlike clickhouse which can use the s3 table engine and read from multiple private buckets per-query.

For example a tenant would not be able to join data on their own private s3 bucket.

Merge and insert with duckdb

Might be faster to use duckdb to merge the files, would need to write a new schema transform where we grab the schema of each file as well but with the http cache enabled that should be a nearly instant query after the merge is done

Meta store as a class

We can make the meta store a class that can be implemented (and provide pre-created postgres/crdb one). This allows for easy integration with other systems.

For example maybe someone wants to use FoundationDB instead, which also provides the needed guarantees.

They really need to follow the guarantees though.

lambda interface

Abstract the merge and insert handler to a function (away from echo) and make a lambda-handler compatible version

Lambda can in theory use the container right now but that's not optimal because of the lack of shared resources for insert handling.

Investigate poor s3 performance on gcp

perhaps it's just the s3 interface, but it's taking a strangely long time:

Concurrent merging

Pull files for merging concurrently

Query Interface

Waiting on PRs to be merged in DuckDB.

Probably make in Python so we don't have to worry about configuring nodejs memory limits and such.

Track schema (and changes) in log file

Schema changes are detected, need to cache previous log state in memory for a few seconds on ingestion nodes (inconsistency is ok as long as data type is the same because they will be compacted and deduplicated if same column name)

Track schema on ingest
If take not of schema changes
Cache current schema in memory
Have background process for refreshing known in-memory schema from log (set interval, 0 is off)

IceDB v3

Version 3 of IceDB

What changes:

No need for meta store, S3 has consistent mutations so we can use a log. Do this via a truncated state with multiple append files that are eventually truncated. This works because each append is a full mutation (either merge or write) so nothing can get caught between operations when listing (or if they do, it's not a problem)
- Risk that this could create a TON of append files and slow down reads, need to make sure that writes are infrequent and highly batched
Needs coordination, if we delegate that responsibility or make it dynamic (we have multiple options) then we can do either raft directly or rely on something like etcd or event kubernetes API
- If we are using etcd for coordination, should we just use it instead of the log in s3? Guess we can omit it if not coordinating merges (e.g. single host enabled for merging)
- Maybe just delegate coordination to the developer, maybe we make a class they can implement?
Reads will first parse the log to find what files are "active", and use those in query
Query engines (DuckDB, ClickHouse) can natively trim down what files to list via Hive partitioning/_file virtual naming. ClickHouse would need ClickHouse/ClickHouse#23297 at a minimum if using s3 table func
Schema tracked in same log, API for adding columns manually as well (to protect against things like number ambiguity).
- Partition schema should be tracked intentionally at table creation as part of the schema log
- Possibly have header for file that mentions what byte schema starts, what byte file info starts
Support materialized views natively?
- Could have the sql statement run on the DF that was used to copy to the parquet file with their transformations that they want, and have their custom merge function
- Need to consider the security of both custom merge functions and MVs. Maybe we leave this up to the developer to consider (running a tenant process in isolation to make sure they can't see other's data/secrets)

What is still not addressed:

Query engines like DuckDB and ClickHouse could have this custom table integrated, but ultimately there will still be something that needs to figure out the list of files to pass down
- See notes for s3 proxy that does timestamp snapshots of file system with virtual bucket or known IPs

Tasks

merge optional partition prefix

an optional parameter to use partition prefix when finding files to merge, so you can do different "tables" via partition prefix strategy

part removal method

Method for removing parts that are not active and older than some number of ms

Example of read only access

As long as insert or merge is not called, IceDB will work just fine with read only access to the meta store to find files.

Make an example of how read only access to meta store and S3 can be used.

Make arch and usage docs

Docs that cover the high level arch, and how to use it

merging optimization

Non-blocking reading for file listing, locking in separate tx

Materialized view example

Can write a SQL query that will process the entire df on a given target "table" (target of write), apply that SQL, and write it out to another "table".

Need to also make sure we support merges and custom merge queries as well. Maybe that needs to be a map instead and we need to give a list of tables for icedb to keep track of (add mat views to that)

Need to emhpasize security implications in docs

Schema Tracking

We need to keep track of all of the columns available for a given namespace.

A simple way to do this is to have another table that is just namespace, column which also have the type of either string, number or list(x...).

Read operation can "time travel" with ms timestamp

If provided, then we will stop reading state up to that timestamp. If the timestamp is before the most recent truncation then we throw an error

Log merges don't rewrite entire state

Currently we rewrite the entire state into the new log file when we merge. This creates a lot of log amplification unless we are tombstone cleaning frequently.

An alterantive is to keep track of what alive files came from what log file, and then the newly merged log file is just the merge of the log files invovled in the merge. This will reduce log write amplification.

The current benefit of full rewrites is that we could tombstone clean more aggressively, it's simpler to parse, and in less frequent merge settings could result in faster operations due to not needing to list and read so many files individually.

However when more log files exist this will exponentially slow down parsing of the current state as nearly every file will have redundant info.

Table predicate interception

Rather than having a table function like icedb(start_year, start_month...) we can instead intercept queries to a normal table, and look at the predicates.

select *
from icedb
where year >= 2022
and year <= 2023

Could be intercepted in python, and rewritten to be:

select *
from icedb(start_year:=2022, end_year:=2023)

This should be achievable with https://github.com/paultiq/fabduckdb to intercept the query, parse the AST, and transpile the meta store query.

Control unique row ID

Can choose a top-level event property to indicate an existing row ID to deduplicate on. This value will be placed into the _row_id column.

Otherwise if None, we will randomly generate one.

Example using clickhouse with `url()` and `urlCluster()`

Using this method (https://repost.aws/knowledge-center/s3-private-connection-no-authentication) we can query private buckets from inside the VPC.

Might also be able to just allow a security group/ip range without making the VPC endpoint anyway?

This is useful because s3() and s3Cluster() will always do ListObject calls even when given an exact set of files if using anything but a single file name.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble