risinglightdb / risinglight Goto Github PK

View Code? Open in Web Editor NEW

1.6K 34.0 212.0 2.93 MB

An educational OLAP database system.

License: Apache License 2.0

Rust 99.76% Makefile 0.13% Shell 0.09% Dockerfile 0.01%

sql rust database olap education analytics embedded-database

risinglight's People

Contributors

Stargazers

Watchers

Forkers

skyzh ardxwe rapiz1 xuanwo sleepingpirate7 clslaid 3aceshowhand ludics yapple jyz0309 harryline-996 qiyuzhuang benjaminxiang tkoniy rouzip veeupup fatelei xiamengyu jacke guntermueller longfangsong kshvakov jinof jingshanglu junli1026 yuuch zbtzbtzbt hawkingrei isgasho jmpotato st1page st1page-graduation-design sohardforaname dongzl liuyuhui rishikumarray fedomn zzl200012 devops-forked tabversion arkbriar arbersephirotheca jackwener nanderstabel georgkreuzmayr xiaoyong-z wenym1 xuyifangreeneyes sarvex windowsxp-beta d2lark db-extreme jayicez platoneko rtenzyme keanji-x yisaer del-zhenwu sundy-li watch-later baymaxhwy ezreal1997 zhangwentai kwannoel ffanyq leozki strikew adlternative y7n05h stenicholas xujia0210 jess-x ptbxzrt djsczhu eliasyaoyc kikkon issac-newton ricardo-charles fkuner chowc lokax ted-jiang hushui502 yuzi-neko tencode yaoxiao1 yiyi-philosophy noneback infdahai litone01 cnutshell xinchengxx andypeng2015 apachecn ray1888 gun9nir cadl code4jy wangdexinhp gogim1

risinglight's Issues

planner: pretty print plan tree

We need a pretty (and fancy) print of plan tree to help developing and debugging for optimizers.

Examples

DuckDB

D explain select v2 from t where 3 > v1;
┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             v2            │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│          SEQ_SCAN         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             v2            │
│             v1            │
└───────────────────────────┘

Databend:

Projection: ((1 + 2) + 3):UInt32
  Expression: 6:UInt32 (Before Projection)
    ReadDataSource: scan partitions: [1], scan schema: [dummy:UInt8], statistics: [read_rows: 1, read_bytes: 1]

discussion: developer experience with `async_stream`

Developing with async_stream is unfriendly to developers. We never get suggestions from rust-analyzer, and the stacktrace is also hard to read.

  13: core::iter::traits::iterator::Iterator::for_each
             at /rustc/497ee321af3b8496eaccd7af7b437f18bab81abf/library/core/src/iter/traits/iterator.rs:727:9
  14: <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend
             at /rustc/497ee321af3b8496eaccd7af7b437f18bab81abf/library/alloc/src/vec/spec_extend.rs:40:17
  15: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
             at /rustc/497ee321af3b8496eaccd7af7b437f18bab81abf/library/alloc/src/vec/spec_from_iter_nested.rs:56:9
  16: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
             at /rustc/497ee321af3b8496eaccd7af7b437f18bab81abf/library/alloc/src/vec/spec_from_iter.rs:33:9
  17: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /rustc/497ee321af3b8496eaccd7af7b437f18bab81abf/library/alloc/src/vec/mod.rs:2485:9
  18: core::iter::traits::iterator::Iterator::collect
             at /rustc/497ee321af3b8496eaccd7af7b437f18bab81abf/library/core/src/iter/traits/iterator.rs:1739:9
  19: risinglight::array::data_chunk::DataChunk::get_row_by_idx
             at ./src/array/data_chunk.rs:46:9
  20: risinglight::executor::nested_loop_join::NestedLoopJoinExecutor::execute::{{closure}}
             at /home/skyzh/.cargo/registry/src/mirrors.sjtug.sjtu.edu.cn-7a04d2510079875b/async-stream-0.3.2/src/lib.rs:237:9
  21: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/497ee321af3b8496eaccd7af7b437f18bab81abf/library/core/src/future/mod.rs:80:19
  22: <async_stream::async_stream::AsyncStream<T,U> as futures_core::stream::Stream>::poll_next
             at /home/skyzh/.cargo/registry/src/mirrors.sjtug.sjtu.edu.cn-7a04d2510079875b/async-stream-0.3.2/src/async_stream.rs:53:17
  23: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at /home/skyzh/.cargo/registry/src/mirrors.sjtug.sjtu.edu.cn-7a04d2510079875b/futures-core-0.3.17/src/stream.rs:120:9
  24: <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next
             at /home/skyzh/.cargo/registry/src/mirrors.sjtug.sjtu.edu.cn-7a04d2510079875b/futures-core-0.3.17/src/stream.rs:120:9
  25: <&mut S as futures_core::stream::Stream>::poll_next
             at /home/skyzh/.cargo/registry/src/mirrors.sjtug.sjtu.edu.cn-7a04d2510079875b/futures-core-0.3.17/src/stream.rs:104:9
  26: <async_stream::next::Next<S> as core::future::future::Future>::poll
             at /home/skyzh/.cargo/registry/src/mirrors.sjtug.sjtu.edu.cn-7a04d2510079875b/async-stream-0.3.2/src/next.rs:30:9
  27: risinglight::executor::projection::ProjectionExecutor::execute::{{closure}}
             at ./src/executor/projection.rs:13:9

The backtrace will never report the exact line where the panic happens -- the content of try_stream has been rewritten by the procedure macro!

Therefore, I propose manually expand the try_stream macro.

async_stream provides two utilities: a thread-local channel implementation to transfer the yield value from the stream to the caller, and an AsyncStream to synchronize between the stream generator the the receiver function. e.g., the InsertExecutor is expanded as follows:

mod insert {
    use super::*;
    use crate::array::DataChunk;
    use crate::catalog::TableRefId;
    use crate::storage::{Storage, Table, Transaction};
    use crate::types::ColumnId;
    use std::sync::Arc;
    /// The executor of `insert` statement.
    pub struct InsertExecutor<S: Storage> {
        pub table_ref_id: TableRefId,
        pub column_ids: Vec<ColumnId>,
        pub storage: Arc<S>,
        pub child: BoxedExecutor,
    }
    impl<S: Storage> InsertExecutor<S> {
        pub fn execute(self) -> impl Stream<Item = Result<DataChunk, ExecutorError>> {
            {
                let (mut __yield_tx, __yield_rx) = ::async_stream::yielder::pair();
                ::async_stream::AsyncStream::new(__yield_rx, async move {
                    let table = match self.storage.get_table(self.table_ref_id) {
                        ::core::result::Result::Ok(v) => v,
                        ::core::result::Result::Err(e) => {
                            __yield_tx.send(::core::result::Result::Err(e.into())).await;
                            return;
                        }
                    };
                    let mut txn = match table.write().await {
                        ::core::result::Result::Ok(v) => v,
                        ::core::result::Result::Err(e) => {
                            __yield_tx.send(::core::result::Result::Err(e.into())).await;
                            return;
                        }
                    };
                    {
                        let mut __pinned = self.child;
                        let mut __pinned =
                            unsafe { ::core::pin::Pin::new_unchecked(&mut __pinned) };
                        loop {
                            let chunk = match ::async_stream::reexport::next(&mut __pinned).await {
                                ::core::option::Option::Some(e) => e,
                                ::core::option::Option::None => break,
                            };
                            {
                                match txn
                                    .append(match chunk {
                                        ::core::result::Result::Ok(v) => v,
                                        ::core::result::Result::Err(e) => {
                                            __yield_tx
                                                .send(::core::result::Result::Err(e.into()))
                                                .await;
                                            return;
                                        }
                                    })
                                    .await
                                {
                                    ::core::result::Result::Ok(v) => v,
                                    ::core::result::Result::Err(e) => {
                                        __yield_tx
                                            .send(::core::result::Result::Err(e.into()))
                                            .await;
                                        return;
                                    }
                                };
                            }
                        }
                    }
                    match txn.commit().await {
                        ::core::result::Result::Ok(v) => v,
                        ::core::result::Result::Err(e) => {
                            __yield_tx.send(::core::result::Result::Err(e.into())).await;
                            return;
                        }
                    };
                    __yield_tx
                        .send(::core::result::Result::Ok(DataChunk::single()))
                        .await;
                })
            }
        }
    }
}

Which seems nearly identical with the original code.

There are two further problems to solve:

Internal implementation of async_stream is subject to change. Someday the AsyncStream struct might have a different functionality and different constructor, and we need to pin async_stream version to the exact version we want, instead of using semver.
Error handling is painful. We cannot simple write ?. However, we might use a custom macro to expand error handling to Ok => get value, Err => send message and return.

storage: add snapshot interface

add snapshot interface and refactor both in-memory and on-disk engine
add epoch manager

cli: use on-disk engine by default

The only blocker might be #96, and I'll implement this soon. After that, almost all queries can be run on disk engine, and we can find bugs in the storage engine prior to our release.

storage: crash recovery and persistence

refactor catalog #650
add checksum to manifest
cleanup unused files on startup #135
runtime vacuum #135

binder: add `primary key` constraint support

create table t(v1 int not null primary key, v2 int not null, v3 int not null) # supported
create table t(v1 int not null, v2 int not null, v3 int not null, primary key(v1)) # not supported

executor: auto split csv input

Currently, everything is ingested in a single batch. We should ingest and yield DataChunk little by little.

Migrate parser to `sqlparser`

sqlparser is a widely-used SQL parser crate in Rust.

Compared to the current postgres-parser:

👍 sqlparser is standalone, while postgres-parser depends on llvm and Postgres.
👍 sqlparser is more active and widely used. (799 vs 70 stars)
👍 sqlparser generates an elegant, well-documented AST, while postgres-parser generates a more verbose AST which needs additional transformation (~1.5k lines now).
🤔 postgres-parser is fully compatible with PG, but sqlparser is not. However this is not critical to an educational system.
🙈 postgres-parser has memory leak, which makes it totally unusable.

We plan to migrate our parser from postgres-parser to sqlparser.

storage: table reader

executor: TPC-H data generator

We can make a TPCHScanExecutor, which generates data of a TPCH table. e.g.

INSERT INTO my_table SELECT * FROM gen.tpch.xxxx

ci: report code coverage

storage: remove I/O operation in `RowSetBuilder`

... and we need a new RowSetWriter to write what's inside builder to the disk.

Tracking: Storage Engine Stage 2

TODO List for Stage 1

Parser:

Parsing arithmetic expressions (+, -, * and /) @wangrunji0408
Use sqlparser as new parser @wangrunji0408

Binder:

Add return names and types for select statement. @MingjiHan99
Binding arithmetic expressions (+, -, * and /) @MingjiHan99 @wangrunji0408
Add necessary implicit casting when binding the expressions. For example: 1.0 + 1 -> 1.0 + (1 cast as double) @wangrunji0408

Executor:

Implement expression evaluator and ProjectionExecutor.
@MingjiHan99 @wangrunji0408
Implement CreateTableExecutor, and InsertExecutor @MingjiHan99
@wangrunji0408

Storage:
Implement on-disk storage system:

Add base definitions for table storage (Segment and Block) @MingjiHan99
Add disk manager @MingjiHan99
Add buffer pool （Ongoing）

Tracking: support inner join

Tracking: Implement Delete executor

storage: catalog persistence

storage: simple compaction

array: use macro to generate match branches

As we are adding more and more functions, it seems now very tedious to have so many matching branches to statically dispatch functions. Maybe we can use the for_all_variants macro from RisingWave.

https://github.com/singularity-data/risingwave/blob/master/rust/common/src/array/mod.rs#L190

/// `for_all_variants` includes all variants of our array types. If you added a new array
/// type inside the project, be sure to add a variant here.
///
/// Every tuple has four elements, where
/// `{ enum variant name, function suffix name, array type, builder type }`
///
/// There are typically two ways of using this macro, pass token or pass no token.
/// See the following implementations for example.
#[macro_export]
macro_rules! for_all_variants {
  ($macro:tt $(, $x:tt)*) => {
    $macro! {
      [$($x),*],
      { Int16, int16, I16Array, I16ArrayBuilder },
      { Int32, int32, I32Array, I32ArrayBuilder },
      { Int64, int64, I64Array, I64ArrayBuilder },
      { Float32, float32, F32Array, F32ArrayBuilder },
      { Float64, float64, F64Array, F64ArrayBuilder },
      { UTF8, utf8, UTF8Array, UTF8ArrayBuilder },
      { Bool, bool, BoolArray, BoolArrayBuilder },
      { Decimal, decimal, DecimalArray, DecimalArrayBuilder },
      { Interval, interval, IntervalArray, IntervalArrayBuilder }
    }
  };
}

/// Define `ArrayImpl` with macro.
macro_rules! array_impl_enum {
  ([], $( { $variant_name:ident, $suffix_name:ident, $array:ty, $builder:ty } ),*) => {
    /// `ArrayImpl` embeds all possible array in `array` module.
    #[derive(Debug)]
    pub enum ArrayImpl {
      $( $variant_name($array) ),*
    }
  };
}

catalog: internal table support

It would be very helpful if we could know internal states of our storage engine using SQL queries. e.g.

SELECT rowset_id, size FROM internal.storage.rowsets WHERE table_id = 1;

doc: storage engine

I'll draft some detailed design doc about the storage engine, and add more docstring to our codebase.

storage: rowset concat iterator

storage: char encoding

block builder #116
column builder #128
block iterator #136
column iterator #136

storage: simple point deletion support

Proposal: pin toolchain to `nightly-2021-09-10`

storage: benchmark and verification tool

As our SQL layer is missing some supports, and it's relatively not easy to turn our benchmarks into SQL queries, we plan to add a new tool like "secondary-bench" to benchmark the performance and verify our storage engine's correctness in a large dataset. It may be very similar to RocksDB's db_bench tool, with the following functionality:

secondary-bench filltable <table name> <schema>
secondary-bench scan <table name> <column>
secondary-bench sort-scan <table name> <column>
secondary-bench compact <table name>
etc.

storage: flush memtable to disk

discussion: which I/O API to use?

Now as we are using async everywhere in our system, we need to select a File I/O API to use. We need to be able to do positioned read, which is not supported by tokio. We have several choices.

Manage a I/O thread pool, which calls Rust std's read_at, and sends the result through channels. https://doc.rust-lang.org/std/os/unix/fs/trait.FileExt.html
Use https://github.com/tokio-rs/tokio-uring
Use https://github.com/hmwill/tokio-linux-aio
Use Mutex over Tokio's File, and therefore we can never concurrently read several blocks from a single file.

execution: add full support of `select count(?)`

Currently, the select count(x) from table acts as counting row, which is not expected. And count(*) is simply not supported. For example,

> select * from t
+------+
| 1    |
| 2    |
| NULL |
+------+

> select count(v1) from t
+---+
| 3 |
+---+

# expected: 2

> select count(*) from t
thread 'main' panicked at 'not yet implemented: bind expression: Wildcard', src/binder/expression/mod.rs:85:18
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

# expected: 3

Therefore, we need to add support for full count support. This could be split into two steps:

Implement a Count aggregator in https://github.com/singularity-data/risinglight/tree/main/src/executor/aggregation
Replace current RowCount with Count in https://github.com/singularity-data/risinglight/blob/main/src/binder/expression/agg_call.rs, so that select count(x) will work correctly.
Read binder code and try adding back support for select count(*).

planner: convert `ColumnRef` to `InputRef` in physical planner

After supporting InputRef, we can implement following queries:

select avg(a) from t
select sum(a) + sum(b) from t

planner: what does each stage do?

I feel that there might be something wrong in my implementation. For example, when I was implementing sorted scan, the procedure of selecting whether to use sorted scan in SeqScanExecutor in logical planner. #121

When I was discussing with @pleiadesian on where to generate InputRef from ColumnRef, I also felt hard to determine where to do this step.

Is there any spec or general convention about which stage should do what?

cc @MingjiHan99 @wangrunji0408

storage: simple table builder

TODO list for stage 3

TODO List for stage 2

Support aggregation function: Implement GlobalAggregationOperator and HashAggregationOperator @pleiadesian
- #55
- #54
Support where, order by, limit and offset
- #38
- #71
- #75
#49
- #41
- #44
#40
#48
Add optimizer framework and implement constant folding and predicate pushdown

discussion: use `Arc<str>` instead of `String` as `UTF8Array`'s owned type

The overhead of clone will be greatly reduced.

storage: support multiple storage backends

Currently, RisingLight cannot be compiled or run on Windows, because we are using ReadAt extension of UNIX. As our students might use Windows as their development environment, we need to add new storage backends (for reading).

Basically, all reads are handled in Column structure https://github.com/singularity-data/risinglight/blob/main/src/storage/secondary/column.rs

It's better to change file: Arc<std::fs::File> into an enum, e.g.

pub struct ColumnReadableFile {
  /// For `read_at`
  #[cfg(unix)]
  PositionedRead(Arc<std::fs::File>),
  /// For `file.lock().seek().read()`
  NormalRead(Arc<Mutex<tokio::fs::File>>),
  // In the future, we can even add minio / S3 file backend
}

And we should refactor the whole code path to use ColumnReadableFile instead of Arc<std::fs::File> throughout the storage system.

storage: in-memory row format (and sort key support)

executor: implement float sum state

binder: `insert into .. select .. from ..` support

This is useful for batch writes and data import. We can later simply:

INSERT INTO my_table SELECT * FROM 'table.tbl'

Tracking: Unify Storage Interface

As discussed in https://singularity-data.larksuite.com/docs/docusneKe7PxHGG96UrUPwqZo6e, we plan to use a unified trait for all storage engines.

Add trait
Migrate current memory table to the new storage trait
Migrate executor framework to use the new storage trait

storage: in-memory store

executor: show progress when copy from a file

Importing a file might take some time to complete. It's good to show the progress.

Tracking: Merge-Tree-based Storage Engine

storage: primitive null block (and column) iterator

binder: insert implicit cast

create table t(v1 int not null, v2 int not null, v3 double not null)
insert into t values(1,4,2.5), (2,3,3.2), (3,4,4.7), (4,3,5.1)
select sum(v1+v2),sum(v1+v3) from t;
bind error: binary operator types mismatch: Int(None) != Double

executor: implement simple aggregator

executor: refine HashAgg

#69 (comment)
As mentioned earlier, constructing a visibility bitmap for every unique group key incurs high time and space complexity. Instead, if we use the row-by-row update in the current implementation, we can avoid the cost from bitmap construction.

Originally posted by @pleiadesian in #69 (comment)

storage: auto split memtable

It is possible that users ingest a large amount of data into the engine, so we need to periodically flush memtable to disk.

add interface to get current memtable estimated on-disk size
add option in StorageOptions, like target_rowset_size
split memtable in txn
support flush in the background
support apply add_rowset with multiple tables

storage: merge sort compaction

sorted rowset #113
merge iterator #123
planner support #121

Tracking: Efficient In-Memory Representation of Data

Currently, the in-memory representation of RisingLight's data is simply a vector of data chunks. When doing updates and deletions, this could be highly inefficient. We should find a way to optimize this.