GithubHelp home page GithubHelp logo

raft-engine's Introduction

Raft Engine

Rust codecov Docs crates.io

Raft Engine is a persistent embedded storage engine with a log-structured design similar to bitcask. It is built for TiKV to store Multi-Raft logs.

Features

  • APIs for storing and retrieving protobuf log entries with consecutive indexes
  • Key-value storage for individual Raft Groups
  • Minimum write amplification
  • Collaborative garbage collection
  • Supports lz4 compression over log entries
  • Supports file system extension

Design

Raft Engine consists of two basic constructs: memtable and log file.

In memory, each Raft Group holds its own memtable, containing all the key value pairs and the file locations of all log entries. On storage, user writes are sequentially written to the active log file, which is periodically rotated below a configurable threshold. Different Raft Groups share the same log stream.

Write

Similar to RocksDB, Raft Engine provides atomic writes. Users can stash the changes into a log batch before submitting.

The writing of one log batch can be broken down into three steps:

  1. Optionally compress the log entries
  2. Write to log file
  3. Apply to memtable

At step 2, to group concurrent requests, each writing thread must enter a queue. The first in line automatically becomes the queue leader, responsible for writing the entire group to the log file.

Both synchronous and non-sync writes are supported. When one write in a batch is marked synchronous, the batch leader will call fdatasync() after writing. This way, buffered data is guaranteed to be flushed out onto the storage.

After its data is written, each writing thread will proceed to apply the changes to memtable on their own.

Garbage Collection

After changes are applied to the local state machine, the corresponding log entries can be compacted from Raft Engine, logically. Because multiple Raft Groups share the same log stream, these truncated logs will punch holes in the log files. During garbage collection, Raft Engine scans for these holes and compacts log files to free up storage space. Only at this point, the unneeded log entries are deleted physically.

Raft Engine carries out garbage collection in a collaborative manner.

First, its timing is controlled by the user. Raft Engine consolidates and removes its log files only when the user voluntarily calls the purge_expired_files() routine. For reference, TiKV calls it every 10 seconds by default.

Second, it sends useful feedback to the user. Each time the GC routine is called, Raft Engine will examine itself and return a list of Raft Groups that hold particularly old log entries. Those log entries block the GC progress and should be compacted by the user.

Using this crate

Put this in your Cargo.toml:

[dependencies]
raft-engine = "0.4"

Available Cargo features:

  • scripting: Compiles with Rhai. This enables script debugging utilities including unsafe_repair.
  • nightly: Enables nightly-only features including test.
  • internals: Re-exports key components internal to Raft Engine. Enabled when building for docs.rs.
  • failpoints: Enables fail point testing powered by tikv/fail-rs.
  • swap: Use SwappyAllocator to limit the memory usage of Raft Engine. The memory budget can be configured with "memory-limit". Depending on the nightly feature.

See some basic use cases under the examples directory.

Contributing

Contributions are always welcome! Here are a few tips for making a PR:

  • All commits must be signed off (with git commit -s) to pass the DCO check.
  • Tests are automatically run against the changes, some of them can be run locally:
# run tests with nightly features
make
# run tests on stable toolchain
make WITH_STABLE_TOOLCHAIN=force
# filter a specific test case
make test EXTRA_CARGO_ARGS=<testname>
  • For changes that might induce performance effects, please quote the targeted benchmark results in the PR description. In addition to micro-benchmarks, there is a standalone stress test tool which you can use to demonstrate the system performance.
cargo +nightly bench --all-features <bench-case-name>
cargo run --release --package stress -- --help

License

Copyright (c) 2017-present, PingCAP, Inc. Released under the Apache 2.0 license. See LICENSE for details.

raft-engine's People

Contributors

5kbpers avatar busyjay avatar dependabot[bot] avatar fullstop000 avatar glorv avatar guoxiangcn avatar hicqu avatar jiayang-zheng avatar koushiro avatar little-wallace avatar longfangsong avatar lykxsassinator avatar michaelscofield avatar mrcroxx avatar renkai avatar sticnarf avatar tabokie avatar uchiha007 avatar v01dstar avatar wenyxu avatar yuqi1129 avatar zhangjinpeng87 avatar ztelur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

raft-engine's Issues

Support to trigger rewrite by garbage ratio

Now the rewrite routine is triggered by size of the append log queue. If we can support to trigger rewrite routine by the garbage ratio of the append log queue, this engine can be more suitable for some situations. For example, a normal Hashmap + WAL.

Relax the lock during writing

The current write procedue:

sort memtables
lock all memtables
    lock pipe log -> append pipe log -> unlock pipe log
    write memtables
unlock memtables

At least it can be improved to

lock pipe log
    append pipe log
    lock all memtables
unlock pipe log
write memtables
unlock memtables

And maybe there are more potential improvements.

Recycle log files

File allocation incurs big overhead, as #100 showed. We could reduce that overhead by reusing old log files. But directly recycling is not secure, because old log items could resurface under certain condition. One way to overcome this is to consider file id when calculating checksum for each log batch.

Release Roadmap

when will this repo be used in tikv? Is there some guide if I want to use it in self-build tikv.

Make Raft Engine production ready for TiKV

Development Task

As of now Raft Engine is already wired with TiKV, and can be enabled by TiKV's raft-engine configuration group. But we haven't announce this feature publicly, and do not encourage using it for production.

This task aims to change this situation and make raft engine ready for production. We will be tracking the progress here and the detailed design in this internal doc.

Staff: @tabokie @MrCroxx

Time estimation: currently shooting for late 2021

Task breakdown:

Interface

  • API for storage access
    • Basic file system interface for creating IO wrapper: #91
    • Strongly typed file system interface: #96
  • Customized storage layer for TiKV: tikv/tikv#10937
  • API for customized thread pool (optional)

Observability

Performance

  • Fast recovery: 6x faster
    • New log format for recovering indexes: #83
    • Parallel recovery: #97
  • Fast log batch encoding: 3x faster
    • Streaming encoding: #91
    • Reduce allocation: #98
  • Optimize file operation: 2.5x throughput, #100
  • Benchmark and best practice for usage within TiKV
    • Known issue: IOPS explosion, Fix: #116

Safety

  • Improve handling of IO error: #131
  • Unit test

Tools

  • Standalone performance benchmark: #84
  • Data administration tool: #99

Integration test

  • Benchmark with TPC-C, sysbench
  • High pressure workload with random failures

Misc

  • Remove entry cache: #77
  • Update README
  • Compile with stable Rust (optional)
  • Build on Windows (optional)

Change `PipeLog` to a trait

So that we can use a black hole implementation or a simple memory implementation to benchmark raft engine without any disk I/O.

Add a benchmark framework

Raft Engine is in hot develop. It changes every day. So a benchmark framework can help us find performance degradations.

Add administration tool for recovery from data corruption

Development Task

This task attempts to add a data administration tool similar to ldb offered by RocksDB.

Raft engine is unable to start when it encounters data corruption (checksum mismatch, unexpected EOF). But user might employs data redundancy at application level. So we should allow user to manually decide how to deal with corrupted data, and recover the data files to a consistent state once the affected raft groups are properly handled.

Workload estimation: M (Two man week)

Task breakdown:

  • Lock file: multiple processes can't operate on the same db directory simultaneously: #102
  • Dump: output operations in log files
  • Consistency check*: scan for holes in raft group and truncate corrupted file tails: #132
  • Unsafe recover**: recover to consistent state by modifying log files
  • A standalone binary that encapsulates the two methods as sub-commands: #144

* Ideally user should be able to extract the remaining log entries for more complex recovery logic, but that is too cumbersome for this phrase.

** For now, we only offer three options for unsafe recovery:

  1. Remove the entire raft group in question
  2. Remove log entries before a given index, making the rest of log entries in consistent state
  3. Fill in empty messages to repair log entry holes

[Feature Request] Support Multiple Paths

Currently, in TiFlash we already support using multiple paths to store data. It is necessary for many cases, for example, users tend to use many slower and cheaper disks for TiFlash in a single machine. And due to the CPU and RAM resources limitation, deploying each TiFlash node for one disk is not practical. So the preferred deploying mode is one TiFlash instance with many disks. With multiple disks supported, we can utilize the IOPS and bandwidth of all disks.

While TiFlash using TiKV as the raft front end, to receive raft logs, we need raft-engine to also support multiple disks.

Interrupted write causes corruption

At this line, we use write_all to append some bytes:

self.writer.write_all(buf)?;

If this write is interrupted, we directly bubble its error. But some portion of the data might already be written. In this case, the self.written is inconsistent with underlying writer's internal offset. A fractured write will remain as a phantom record.

Meet a problem while upgrade tikv to v5.1.0 and enable raft-engine

hi,

I'm trying to upgrade my tikv cluster to v5.1.0 to test raft-engine's performance, but I met some problem

my old version cluster is using two disks to store raftdb and rocksdb(by mount /var/lib/tikv/raft to another disk A). But when I try to do the update, i found raft-engine is using /var/lib/tikv/raft-engine dir. If i remount diskA to /var/lib/tikv/raft-engine, tikv cannot dump old raftdb's log.

So my question is how can i upgrade my cluster, can use raft-engine with two disks and don't need to lose my old raft's log?

Thx~

introduce file manifest

Memtable recovery logic is over-complicated because we have to handle intermediate states during GC. It would be much easier if we could maintain the file states in a persistent manifest.

append_compact_purge example never compacts

Found in #237

In this code snippet:

let mut e = entry.clone();
e.index = state.last_index + 1;
batch.add_entries::<MessageExtTyped>(region, &[e]).unwrap();
batch
.put_message(region, b"last_index".to_vec(), &state)
.unwrap();
engine.write(&mut batch, false).unwrap();
if state.last_index % compact_offset == 0 {
let rand_compact_offset = rand_compacts.next().unwrap();
if state.last_index > rand_compact_offset {
let compact_to = state.last_index - rand_compact_offset;
engine.compact_to(region, compact_to);
println!("[EXAMPLE] compact {} to {}", region, compact_to);
}
}

state.last_index is not updated in sync with entry index. Entries will never be compacted.

Question: Rewritten file can not be rewritten again?

I noticed that raft-engine has a interesting feature:

During garbage collection, Raft Engine scans for these holes and compacts log files to free up storage space.

Through the code, it seems that, after these *.raftlog files are compacted(rewritten), they become *.rewrite files.
And there is no chance for they to become *.raftlog files again, which indicates that *.rewrite files can not be rewritten again?

Reduce memory footprint

Memtables can take up a lot of memory when log entries are not compacted in time.

  • Add memory tracing
  • Reuse file block handle for entry indexes inside the same log batch.
  • Change entry offset and entry length to u32
  • Store entry offset in file instead of in memory
  • Implement a mmap-based allocator for memtable

raft engine seems dead lock

I'm loading data into TiKV, and one of the node somehow is stuck. The stacktrace looks like following.

  70   Thread 0x7f3faa67e700 (LWP 22442) "background-1" 0x00007f3fb3ac154d in __lll_lock_wait () from /lib64/libpthread.so.0
  32   Thread 0x7f3f837df700 (LWP 22496) "raftstore-4-0" 0x00007f3fb3ac154d in __lll_lock_wait () from /lib64/libpthread.so.0
  31   Thread 0x7f3f835de700 (LWP 22497) "raftstore-4-1" 0x00007f3fb3ac154d in __lll_lock_wait () from /lib64/libpthread.so.0

Thread 70 (Thread 0x7f3faa67e700 (LWP 22442)):
#0  0x00007f3fb3ac154d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f3fb3abce9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f3fb3abcd68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000055dd465fc5fa in raft_engine::purge::PurgeManager$LT$E$C$W$C$P$GT$::purge_expired_files::ha4e7409244ce5c8d () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#4  0x000055dd46619659 in _$LT$raft_log_engine..engine..RaftLogEngine$u20$as$u20$engine_traits..raft_engine..RaftEngine$GT$::purge_expired_files::h6267e943f2a6fbfa () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#5  0x000055dd46dfe7ba in raftstore::store::worker::raftlog_gc::Runner$LT$EK$C$ER$C$R$GT$::flush::h2c1d38c097dfcf3d () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#6  0x000055dd46b7367c in _$LT$raftstore..store..worker..raftlog_gc..Runner$LT$EK$C$ER$C$R$GT$$u20$as$u20$tikv_util..worker..pool..Runnable$GT$::run::h64bbe6d2c6cb9fa9 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#7  0x000055dd46d5e7a3 in _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h3b0d350c770ce605 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#8  0x000055dd47b2b881 in _$LT$yatp..task..future..Runner$u20$as$u20$yatp..pool..runner..Runner$GT$::handle::hc59a7a64f171a303 ()
#9  0x000055dd46ca7cef in yatp::pool::worker::WorkerThread$LT$T$C$R$GT$::run::h33ca1869d4641979 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#10 0x000055dd46be7c65 in std::sys_common::backtrace::__rust_begin_short_backtrace::h40fdd7e5cdbd32b8 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#11 0x000055dd470ef5d5 in std::panicking::try::do_call::h19dc3bce6576f8aa () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#12 0x000055dd46bfbe4d in core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hc737d6d377959610 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#13 0x000055dd473389d7 in call_once<(),FnOnce<()>,alloc::alloc::Global> () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546
#14 call_once<(),alloc::boxed::Box<FnOnce<()>, alloc::alloc::Global>,alloc::alloc::Global> () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546
#15 std::sys::unix::thread::Thread::new::thread_start::hfbe13ead469fd0bc () at library/std/src/sys/unix/thread.rs:71
#16 0x00007f3fb3abaea5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f3fb30c38dd in clone () from /lib64/libc.so.6

Thread 32 (Thread 0x7f3f837df700 (LWP 22496)):
#0  0x00007f3fb3ac154d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f3fb3abce9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f3fb3abcd68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000055dd46600e33 in raft_engine::engine::Engine$LT$E$C$W$C$P$GT$::apply_to_memtable::haf3ea2cc1eee62a5 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#4  0x000055dd46601c29 in raft_engine::engine::Engine$LT$E$C$W$C$P$GT$::write::h775bbd18d2b90939 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#5  0x000055dd466190e5 in _$LT$raft_log_engine..engine..RaftLogEngine$u20$as$u20$engine_traits..raft_engine..RaftEngine$GT$::consume_and_shrink::h1fa296df8acad832 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#6  0x000055dd46afc60d in raftstore::store::fsm::store::RaftPoller$LT$EK$C$ER$C$T$GT$::handle_raft_ready::h1ed29fbb2ac0ec36 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#7  0x000055dd468ee4f0 in _$LT$raftstore..store..fsm..store..RaftPoller$LT$EK$C$ER$C$T$GT$$u20$as$u20$batch_system..batch..PollHandler$LT$raftstore..store..fsm..peer..PeerFsm$LT$EK$C$ER$GT$$C$raftstore..store..fsm..store..StoreFsm$LT$EK$GT$$GT$$GT$::end::h72af608d6cce3a73 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#8  0x000055dd46b86e6d in batch_system::batch::Poller$LT$N$C$C$C$Handler$GT$::poll::h10865c3245dddabb () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#9  0x000055dd46be9467 in std::sys_common::backtrace::__rust_begin_short_backtrace::h6f5de66a8bb8712e () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#10 0x000055dd470f3c04 in std::panicking::try::do_call::hade66f03795540dd () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#11 0x000055dd46bf91ac in core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4f7954cf4e4bd9b9 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#12 0x000055dd473389d7 in call_once<(),FnOnce<()>,alloc::alloc::Global> () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546
#13 call_once<(),alloc::boxed::Box<FnOnce<()>, alloc::alloc::Global>,alloc::alloc::Global> () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546
#14 std::sys::unix::thread::Thread::new::thread_start::hfbe13ead469fd0bc () at library/std/src/sys/unix/thread.rs:71
#15 0x00007f3fb3abaea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f3fb30c38dd in clone () from /lib64/libc.so.6

Thread 31 (Thread 0x7f3f835de700 (LWP 22497)):
#0  0x00007f3fb3ac154d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f3fb3abce9b in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f3fb3abcd68 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000055dd46600e33 in raft_engine::engine::Engine$LT$E$C$W$C$P$GT$::apply_to_memtable::haf3ea2cc1eee62a5 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#4  0x000055dd46601c29 in raft_engine::engine::Engine$LT$E$C$W$C$P$GT$::write::h775bbd18d2b90939 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#5  0x000055dd466190e5 in _$LT$raft_log_engine..engine..RaftLogEngine$u20$as$u20$engine_traits..raft_engine..RaftEngine$GT$::consume_and_shrink::h1fa296df8acad832 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#6  0x000055dd46afc60d in raftstore::store::fsm::store::RaftPoller$LT$EK$C$ER$C$T$GT$::handle_raft_ready::h1ed29fbb2ac0ec36 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#7  0x000055dd468ee4f0 in _$LT$raftstore..store..fsm..store..RaftPoller$LT$EK$C$ER$C$T$GT$$u20$as$u20$batch_system..batch..PollHandler$LT$raftstore..store..fsm..peer..PeerFsm$LT$EK$C$ER$GT$$C$raftstore..store..fsm..store..StoreFsm$LT$EK$GT$$GT$$GT$::end::h72af608d6cce3a73 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#8  0x000055dd46b86e6d in batch_system::batch::Poller$LT$N$C$C$C$Handler$GT$::poll::h10865c3245dddabb () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#9  0x000055dd46be9467 in std::sys_common::backtrace::__rust_begin_short_backtrace::h6f5de66a8bb8712e () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#10 0x000055dd470f3c04 in std::panicking::try::do_call::hade66f03795540dd () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#11 0x000055dd46bf91ac in core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h4f7954cf4e4bd9b9 () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/core/src/slice/mod.rs:568
#12 0x000055dd473389d7 in call_once<(),FnOnce<()>,alloc::alloc::Global> () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546
#13 call_once<(),alloc::boxed::Box<FnOnce<()>, alloc::alloc::Global>,alloc::alloc::Global> () at /rustc/16bf626a31cb5b121d0bca2baa969b4f67eb0dab/library/alloc/src/boxed.rs:1546
#14 std::sys::unix::thread::Thread::new::thread_start::hfbe13ead469fd0bc () at library/std/src/sys/unix/thread.rs:71
#15 0x00007f3fb3abaea5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f3fb30c38dd in clone () from /lib64/libc.so.6

The only configuration change is:

[raft-engine]
enable = true

Change `Engine` generic type

In order to complete the issue Update the version of raft-engine as it has out of date,

Raft-engine can't export Engine::open method, see below

TiKV

    pub fn new(config: RaftEngineConfig) -> Result<Self> {
        Ok(RaftLogEngine(
            RawRaftEngine::open(config).map_err(transfer_error)?,
        ))
    }

Raft-engine

pub type Engine<M, FileBuilder = file_builder::DefaultFileBuilder> = engine::Engine<M, FileBuilder>;

impl<M> Engine<M, DefaultFileBuilder, FilePipeLog<DefaultFileBuilder>>
where
    M: MessageExt,
{
    pub fn open(
        cfg: Config,
    ) -> Result<Engine<M, DefaultFileBuilder, FilePipeLog<DefaultFileBuilder>>> {
        Self::open_with_listeners(cfg, vec![])
    }
}

We need to change

pub type Engine<M, FileBuilder = file_builder::DefaultFileBuilder> = engine::Engine<M, FileBuilder>;

to

pub type Engine<M, FileBuilder = file_builder::DefaultFileBuilder, FilePipe = file_pipe_log::FilePipeLog<file_builder::DefaultFileBuilder>> = engine::Engine<M, FileBuilder, FilePipe>;

Speed up read performance with AIO

  1. Expanding FileSystem interface to provide async tasking capability:
trait AsyncContext {
  fn wait() -> Result<()>;
}

trait FileSystem {
  pub type AsyncIoContext: AsyncContext;
  fn new_async_context() -> AsyncContext;
  fn read(ctx: &mut AsyncIoContext, offset: u64, size: usize) -> IoResult<usize>;
}
  1. Modify read_entry implementation to generate read requests of all file blocks, then wait for completion.

For best kernel compatibility, use AIO to implement the first version. Reference: https://docs.rs/nix/latest/nix/sys/aio/fn.aio_suspend.html

Stabilize this crate

Unstable features currently in use:

  • shrink_to: #192
  • btree_drain_filter: #190
  • generic_associated_types: #189
  • test: #188

After they are removed,

  • Add CI build on stable toolchain

Support secondary log directory

It can be useful to configure raft engine on a separate (smaller) disk. And when that disk is full, we should provide a way for raft-engine to locate newly created log files on the main disk volume.

Key value tombstone is lost during rewrite

Bug Report

Description

During Append queue rewrite, key value pairs of targeted Raft groups are scanned out and rewritten to Rewrite queue. After that, data files in Append queue is deleted.

However, in this process, key value tombstones (delete key operation) can't be detected and rewritten. Suppose rewrite is triggered after a Put and before a later Delete, the old key value is stored in Rewrite queue, but the subsequent deletion marker is purged from Append queue. When the engine restarts, the old key value will re-surface.

Right now, TiKV doesn't use kv delete API, so this bug is of minor urgency.

Unify File abstraction

In order to support file system extension, e.g. encryption and rate limiting, we introduced the FileBuilder trait in #96. It is built on top of std::io::Read and Write, which doesn't provide any low-level functionality required for a high-performance log writer. So we kept the usage of old implementation LogFd along with the new abstraction.

Moving forward, it's better to unify the both into a new set of abstractions:

pub trait FileSystem: Send + Sync {
  type Handle: Clone + Send + Sync;
  type Reader: Seek + Read + Send;
  type Writer: Seek + Write + WriteExt + Send;

  fn create<P: AsRef<Path>>(&self, path: P) -> Handle;
  fn open<P: AsRef<Path>>(&self, path: P) -> Handle;
  fn new_reader(&self, handle: &Handle) -> Reader;
  fn new_writer(&self, handle: &Handle) -> Writer;
  fn file_size(&self, handle: &Handle) -> IoResult<usize>;
}

pub trait WriteExt {
  fn finish(&self) -> IoResult<()>;
}

Default enable log recycling.

Description

According to PR: #224 , we have make RaftEngine support log recycling with configuration enable-log-recycle: true. Also, the testing results shown in this pr have proved that this feature have positive effects to write workloads.
So, starting from 0.3 version, we want to make it open as default.

Related changes

Default values to format-version and enable-log-recycle should be modified respectively:

  • Default value of format-version should be changed to 2 (origin is 1);
  • Default value of enable-log-recycle should be changed to true (origin is false);

Raft Group tombstone is rewritten in the wrong order

Bug Report

Description

When a Raft Group is deleted (via Clean command), a tombstone is written to log file, and added to MemTableAccessor. During rewrite, log file holding that tombstone is purged, but tombstone in memtable still can be retrieved and rewritten to Rewrite queue.

However, consider a Append file containing these operations [append1 (..), clean, append2 (..)]. During rewrite, the tombstone is actually rewritten in the wrong order [(..), append2 (..), clean]. This will cause the newly recreated Raft Group to disappear after restart.

TiKV guarantees to not recreate a region with the same ID, so this bug is of low severity.

Support `dump` subcommand in ctl

As #148 has been merged, this issue aim to support dump subcommand, that is

Dump: dump out all operations in log files

ctl dump --file /path/to/file [--raft id1,id2,id3]
ctl dump --path /path path [--raft id1,id2,id3]

For more detail, please see #144

Support synchronize WAL in parallel

Write to multiple WALs concurrently to leverage hardware parallelism.

Raft Engine provides a WriteToken that maps to a WAL stream underneath. If WriteToken is not provided, it will write to the default WAL stream.

User guarantees there will be no concurrent writing to the same region using different tokens. But writes to a region are allowed to be sent to different WALs at different time.

There're two ways to establish total order between logs among muliple log streams,

(1) Raft Engine have awareness of Raft term of the log entries. Key-values use the term of latest log in the region.

(2) Raft Engine internally manages a sequence number for each region.

Remove prometheus dependency

Client might not want/could use prometheus to monitor system statistics. We should replace prometheus metrics with lightweight counters, and expose public interfaces to query those metrics.

Counter or gauge can be replaced with atomics. Histogram can be replaced with HdrHistogram crate.

Add ctl tool for Raft Engine

We need a ctl tool to do data administration work mentioned in #99.

Some subcommands I have in mind:

  • Dump: dump out all operations in log files
    • ctl dump --file /path/to/file [--raft id1,id2,id3]
    • ctl dump --path /path path [--raft id1,id2,id3]
  • Consistency check: check log files for logical errors
    • ctl check --path /path
  • Truncate: unsafely truncate log entries in log files
    • ctl truncate --path /path --mode front/back/all --queue append/rewrite/all [--raft id1,id2,id3]
  • Autofill: repair log entry holes by fill in empty message
    • ctl autofill --path /path --queue append/rewrite/all [--raft id1,id2,id3]\

We can use structopt to build a skeleton for this tool, fill in implementation later.

Support compression stream

Currently compression is done per writes. But there are a lot of duplicated bytes between raft log entries, if compression is enabled for a stream, the compress ratio can be optimized further.

speed up the write implementation

Here is a flame graph:
enable.svg.zip
We can see that when writting a log batch into raft engine, there are 2 main parts:

  • 1 for applying to memtable, and
  • 2 for encode log batch into bytes (compression is included).
    Maybe we can improve it.

[Bug] First log has no records when `target-file-size < LogFileFormat::encoded_len()`

Referring to pr: #269, after we refactored the sync, we found that the first log (file_seq == 1) is always empty when setting target-file-size < LogFileFormat::encoded_len().

The root cause is that we will check whether to do rotate before each real write, in the processing of append.

if active_file.writer.offset() >= self.target_file_size {
if let Err(e) = self.rotate_imp(&mut active_file) {

And if we set target-file-size with a abnormal value, such as 1, the startup of Pipe will automatically generate the first seq file with sequence number 1. Any append to this Pipe will not be added to the first file, which will be added to the next seq file, file_seq == 2.

overwrite panic during recovery

Panic message: attempt to overwrite compacted entries in xx.

This panic should be elided during parallel recovery. Consider this case:

file A: [index=1][index=2]
file B: [index=3][index=2][index=3]

It would raise a false panic if we recovery file B alone and try to merge it with file A afterwards.

Workaround:

  1. Disable parallel recovery by setting recovery_threads to 1
  2. Repair the log files with this script:
fn filter_append(id, first, count, rewrite_count, queue, ifirst, ilast) {
  if id == rid && first + count - 1 >= ifirst {
    return 2;  // discard existing
  }
  0
}
fn filter_compact(id, first, count, rewrite_count, queue, compact_to) {
  0
}
fn filter_clean(id, first, count, rewrite_count, queue) {
  0
}

Log batch decoding and encoding should consider format version

As the title shown, the VERSION of log files does not make sense currently. Pratically, different VERSIONs should make difference on the behaviors of reading and writing on log files.
For the further development, we want to do the relevant implementation on it to make the VERSION significant.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.