tikv / raft-rs Goto Github PK

View Code? Open in Web Editor NEW

2.9K 55.0 391.0 2.74 MB

Raft distributed consensus algorithm implemented in Rust.

License: Apache License 2.0

Rust 100.00%

distributed-consensus-algorithms distributed-systems raft rust

raft-rs's Introduction

Raft

Problem and Importance

When building a distributed system one principal goal is often to build in fault-tolerance. That is, if one particular node in a network goes down, or if there is a network partition, the entire cluster does not fall over. The cluster of nodes taking part in a distributed consensus protocol must come to agreement regarding values, and once that decision is reached, that choice is final.

Distributed Consensus Algorithms often take the form of a replicated state machine and log. Each state machine accepts inputs from its log, and represents the value(s) to be replicated, for example, a hash table. They allow a collection of machines to work as a coherent group that can survive the failures of some of its members.

Two well known Distributed Consensus Algorithms are Paxos and Raft. Paxos is used in systems like Chubby by Google, and Raft is used in things like tikv or etcd. Raft is generally seen as a more understandable and simpler to implement than Paxos.

Design

Raft replicates the state machine through logs. If you can ensure that all the machines have the same sequence of logs, after applying all logs in order, the state machine will reach a consistent state.

A complete Raft model contains 4 essential parts:

Consensus Module, the core consensus algorithm module;
Log, the place to keep the Raft logs;
State Machine, the place to save the user data;
Transport, the network layer for communication.

Note: This Raft implementation in Rust includes the core Consensus Module only, not the other parts. The core Consensus Module in the Raft crate is customizable, flexible, and resilient. You can directly use the Raft crate, but you will need to build your own Log, State Machine and Transport components.

Using the raft crate

You can use raft with either rust-protobuf or Prost to encode/decode gRPC messages. We use rust-protobuf by default. To use Prost, build (or depend on) Raft using the prost-codec feature and without default features.

Developing the Raft crate

Raft is built using the latest version of stable Rust, using the 2018 edition. Minimum supported version is 1.44.0.

Using rustup you can get started this way:

rustup component add clippy
rustup component add rustfmt

In order to have your PR merged running the following must finish without error:

cargo test --all && \
cargo clippy --all --all-targets -- -D clippy::all   && \
cargo fmt --all -- --check

You may optionally want to install cargo-watch to allow for automated rebuilding while editing:

cargo watch -s "cargo check"

Modifying Protobufs

See instructions in the proto subdirectory.

Benchmarks

We use Criterion for benchmarking.

It's currently an ongoing effort to build an appropriate benchmarking suite. If you'd like to help out please let us know! Interested?

You can run the benchmarks by installing gnuplot then running:

cargo bench

You can check target/criterion/report/index.html for plots and charts relating to the benchmarks.

You can check the performance between two branches:

git checkout master
cargo bench --bench benches -- --save-baseline master
git checkout other
cargo bench --bench benches -- --baseline master

This will report relative increases or decreased for each benchmark.

Acknowledgments

Thanks etcd for providing the amazing Go implementation!

Projects using the Raft crate

TiKV, a distributed transactional key value database powered by Rust and Raft.

Links for Further Research

raft-rs's People

Contributors

Stargazers

Watchers

Forkers

hoverbear dxf1122 langzime csmoe linpingchuan july2993 mypmc mapbased hicqu brucezu chenxiaohui xlkness qiuyesuifeng iangneal jokerzh zhangjinpeng87 hejy12 busyjay tizzybec queenyjin fabianfreyer stevenberge lucab evolarium choubacha matklad djc charleschege dreamquster bkchr ltseeley crixalis2013 m10io xiangyuf sevistuo ryan-git shikaiwi fredchenbj mathiaslengler delphen connor1996 utsl42 sujithjay xavier1994 5kbpers bdelmas sushisource fullstop000 zhhzhhz nolouch ckousik simonzhangsm zbelial kerollmops kikokikok weiboyiyou rust-stuff jimmyyan joshmcguigan chrischiedo luciofranco poonai pwoolcoc liexusong aylei nrc keeganmyers mistshi rjunx ming535 lightkool husteryhz chyuch-github seansu4you87 cih-y2k rw spongedu ss576v stamp711 zier-one x1957 thedrow zhuboshuai denisse-dev hhwyt qiankeqin isgasho 14bits glock42 ishiihara zyh-hust saxier yangkeao fantom-foundation researchmore hanskorg mosuka rustlangs fsasieta maitysubhasis

raft-rs's Issues

Promote Learner need not panic on failure condition.

Currently promote_leader can panic on a failure case.

https://github.com/pingcap/raft-rs/blob/1e0741a9aea3b242d33dadc46245ad611e7e7aa4/src/progress.rs#L129-L136

This issue would be solved if promote_leader() instead returned a value such as Result<()> so the calling code can handle the situation.

Reduce some internal panics.

The Raft library has several internal panics. This is partially due to being fairly old code with Go origins, and partially due to it being previously internal to TiKV.

This issue, broadly, invites you to tackle any panic!() or .unwrap() you find in the library.

Notable targets to get you started:

fn insert_learner in src/raft.rs.
fn insert_voter in src/raft.rs.

You are welcome to tackle any potential panics though. Please link them back to this issue for overall tracking.

Your PR may take some time to merge as we would like to stage public API changes. The next targeted API change is in 0.5.0.

raft: add a learner nodes function and separate learners from nodes

Refer etcd-io/etcd#9116

A README and description would be nice

I'm learning from this project at Fosdem. It would be nice if this project was easier to find. First steps would be a README.md and a project description.

Configure Rustfmt so it doesn't emit warnings.

Right now cargo +stable fmt raises the following warning:

Warning: can't set `use_try_shorthand = true`, unstable features are only available in nightly channel.

This is a harmless warning, and doesn't make CI red, but it'd be great to not see it.

raft: fix Ready.MustSync logic

Refer etcd-io/etcd#10106

Seem that we have already used the Raft hard state, not the ready's, but we need to port the test.

Btw, seem that we don't consider Entries number for MustSync, see https://github.com/etcd-io/etcd/blob/b9b75f81e56241b0f573ba31b3c3a018290e9343/raft/node.go#L584

Do you know why? @BusyJay

Raft::new() returns a Result

Depends on:

Currently calling Raft::new() will panic if it is provided an invalid configuration.

https://github.com/pingcap/raft-rs/blob/18f1cd8bb1a452dd9a2e3a8c0b507fc138c1862d/src/raft.rs#L390-L391

The consequence of this is that the user of the Raft is not provided an opportunity to handle certain types of error.

Fixing this issue would include changing Raft::new() to return a Result<Raft<T>> instead of a Raft<T>. In order to do this you may need to add new error variants:

https://github.com/pingcap/raft-rs/blob/18f1cd8bb1a452dd9a2e3a8c0b507fc138c1862d/src/errors.rs#L19-L22.

Module shuffling and organizing

Reorganize some of the modules to have their own files and directories.

In general:

Files should contain no more than one logical unit. (Eg. the RawNode struct and implementations.)
Files should contain in order:
- Module level docs (if existing)
- use statements.
- Any constants. (Eg INVALID_ID)
- Any short utility functions that are not associated with the structure.
- Any minor structures/enums that are trivial and don't warrant their own file. Eg enum Foo { Bar, Baz }. (If it has implementations beyond 1-2 functions split it off)
- The structure itself.
- Any implementations for the structure.
- A test module if it's small (less than 5 tests, otherwise make it a $NAME/{mod, tests}.rs pair).

Changes for this PR should only include moving code and updating related use statements etc. No logic changes should happen here, they deserve their own PR/issue. This is to ease review.

Rationale

In TiKV we've been seeing a lot of long build times and one optimization to this is splitting up large files. Since raft is a considerably smaller crate it's easy to test our ideas here. At the same time, Raft has some very large files which may be both intimidating and confusing to navigate for potential contributors.

Benefits

No run time performance changes are expected.
Build time performance should strictly improve.
Increased ease of navigation

Process

In some cases, we should to split files because they hold large structure definitions that can be split up.
For example, src/progress.rs contains the structs Progress and ProgressSet , as well as Inflights.

Split modules with several non-trivial structures. In our example, src/progress.rs becomes src/progress/{mod, progress, progress_set, inflights}.rs

There are also modules that are very large due to large test suites etc. They can be split up. For example, src/raft.rs contains a large #[test] module.

Extract large test suites into their own files. In our example all of the #[test] code can go in a tests file. src/raft.rs becomes src/raft/{mod, tests}.rs.

raft: fix deadlock during PreVote migration process

refer etcd-io/etcd#8525

RFC: responsive leadership transfer

Summary

This RFC proposes to use an index to trace leadership transfer, and if follower fails to start election, a response should be sent back to leader.

Motivation

When leader transfer leadership to a follower, follower may and may not start an election. Leader can't know precisely what happen, so it will stop read/write and wait for an election timeout and then try to retain leadership if no one campaigns. We observe some unexpected high latency when a leadership transfer fails. Note that failure doesn't have to be caused by network failure, it can also be caused by slow apply of logs. For example, a newly promoted voter may not start campaign if conf change is applied locally.

Detailed design

We can introduce an index to trace every leadership transfer. Everytime leadership transfer happens, index should increase by 1. The index is also sent via transfer command. If a follower checks its own state, and decides not to campaign, it should send back a TransferLeaderResponse to tell leader its decision. Leader finds a rejected response's index matches its own latest transfer index, then abort leadership transfer immediately.

Unresolved questions

What if transfer command is dropped due to network failure? It may be hard to handle all situations, but at lease should make it work as expected when infrastructures work as expected.

Dcoument EntryConfChange context

If the storage does not contain any data, raft sends the initial node list as ConfChanges and uses the peer context as context for the ConfChange. That is not documented anywhere and very confusing, especially if you expect your own context type in the ConfChange entry.
This behavior should be documented somewhere.

consider degrading log version from 0.4 to 0.3

It shows Unknown for the file name in TiKV log, and hard to debug for us.

raft: Really avoid scanning raft log in becomeLeader

refer etcd-io/etcd#9887

Lease based read-only request without `check_quorum` behaves wrong.

Lease based read-only request without check_quorum behaves wrong, at least in outdated leader in a minority partition.

In step_leader for handling MsgReadIndex messages.

https://github.com/pingcap/raft-rs/blob/3799cfa482e288a0fc4a3a6c365c6681afd85b3a/src/raft.rs#L1429-L1449

I think line 1437 should use read_index instead of self.raft_log.commited, if
I'm not wrong (and if I'm wrong the read_index variable should be created in
the else branch to prevent confusion).

If my understanding of the code is correct then lease base read requests allow
you to read up to the commit index iff self.check_quorum is set as this would
make sure that the leader still has the majority behind them and return INVALID_INDEX
if not (which btw. seems to be kind of a strange, shouldn't then Config::validate at last
warn on using a leas based reads without check_quorum).

add a prelude to export commonly used structs, functions, etc.

Like this:

use raft::prelude::*;

The prelude may contain Storage, RawNode, Ready, Progress, Status, Error, all Raft protobuf messages, etc.

Tested examples in src/lib.rs Rustdocs

#69 introduces examples for usage to src/lib.rs. However most of them are not actually compiled or tested.

We would like them to remove the ignore from the code fences so the examples are tested.

Please use hiding in the examples to keep the redundant setup code from getting in the way.

@Hoverbear is very happy to help mentor this issue. Just ping. :)

Ready is Ready-Only, should only have getters.

The Ready struct should not allow values to be mutated, it should only provide immutable getters.

https://github.com/pingcap/raft-rs/blob/1e0741a9aea3b242d33dadc46245ad611e7e7aa4/src/raw_node.rs#L84-L121

It's possible to do this by writing functions that return immutable borrows of the contained values, or using a crate like https://crates.io/crates/getset which will write them for you.

raft: Avoid scanning raft log in becomeLeader

Refer etcd-io/etcd#9073

Use Stable Rust

As part of #1 we'd like to use Stable Rust.

Currently the only feature we use is:

https://github.com/pingcap/raft-rs/blob/1badec3bfa0c8f25da196050d1ba015c9efc5d4a/src/lib.rs#L32

Disabling this yields:

error: `impl Trait` in return position is experimental (see issue #34511)
  --> src/progress.rs:91:34
   |
91 |     pub fn iter<'a>(&'a self) -> impl Iterator<Item = (&'a u64, &'a Progress)> {
   |                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = help: add #![feature(conservative_impl_trait)] to the crate attributes to enable

error: `impl Trait` in return position is experimental (see issue #34511)
  --> src/progress.rs:96:42
   |
96 |     pub fn iter_mut<'a>(&'a mut self) -> impl Iterator<Item = (&'a u64, &'a mut Progress)> {
   |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = help: add #![feature(conservative_impl_trait)] to the crate attributes to enable

In order to stabilize we need to change these.

Skip broadcasting heartbeat to learner

Since learner don't vote, so it's meaningless to send heartbeat to learner if it is up to date already. If learner is not up to date, heartbeat is still needed for updating progress and commit.

Missing add node check quorum test case

ref etcd-io/etcd#7830

Delay broadcast MsgAppend

When a proposal is made, leader will broadcast it to all follower immediately. This can make proposal reach follower as fast as possible. However, the message is only sent out when a ready is handled. So this may cause too many MsgAppend messages in high load, which is very not friendly to channels.

So the broadcast can be postponed till a ready is generated. In that way, log entries can be batched into the fewest messages. The same mechanism can also be applied to MsgAppendResponse too. Heartbeat messages doesn't have to be batched, as they are usually not frequent enough.

Insert Learner/Voter should not panic.

Currently these two functions have no return, yet they panic on an otherwise reasonable mistake which could be handled by application code.

https://github.com/pingcap/raft-rs/blob/1e0741a9aea3b242d33dadc46245ad611e7e7aa4/src/progress.rs#L104-L120

This issue would be solved with them returning a Result<()> with an error condition stating the voter/learner is already present, or some other #[must_use] value so that the user knows they need to handle this return value.

deny missing docs

Deny missing_docs for the entire crate.

raft: provide protection against unbounded Raft log growth

Introduce a new, optional max_uncommitted_entries configuration. This config limits the max number of uncommitted entries that may be appended to a leader's log. Once this limit is exceeded, proposals will begin to return ProposalDropped errors.

You can refer to refer etcd-io/etcd#10167

This issue can be closed by:

Modifying the Config to have this new configuration setting.
Modify Raft::handle_append_entries to check the limit. You can return the existing ProposalDropped error.
Adding a test to tests/integration_cases/test_raft.rs similar to the one in the etcd PR.

As always, if you'd like to tackle this, or need any help, please feel free to reach out! We'd love to get you involved.

raft: reuse "last index" in "appendEntry"

refer etcd-io/etcd#9282

Don't reset election_elapsed if request vote will be rejected.

For now a candidate will always reset election_elapsed when receiving a message with higher term. However if the vote is about to be rejected, reset elapsed will make campaign postponed at lease an election timeout.

Adopt the DCO over a CLA.

In an effort to reduce friction in accepting community contributions, I would propose in experimenting with using the Developer's Certificate of Origin over a Contributor License Agreement for this particular repository.

The DCO is as follows:

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

As you can see this is quite simple and does not include many scary legal terms. We can have this done via a signed-off-by.

We can use the available integration to enforce this easily.

Rationale from Other Companies

Chef
Docker

Alternatives

In lieu of this, we could simply not require any. Or keep using a CLA. All options are valid.

Determine a Rust version tracking strategy.

Currently tikv, the main consumer of this crate, tracks a specific nightly release, with infrequent jumps. This is a very acceptable strategy for a binary, however it's not ideal for a small library like raft. If we were to require a specific version of Rust we may cause downstream issues.

I see a couple possible strategies:

Track nightly but don't adopt changes which break compatibility with tikv, this can be done by doing a pair of builds, one on tikv's version and one on master during CI.
Stabilize on stable so everyone can use the crate.

Convert `Storage::entries`'s `max_size` argument to `Option<u64>`

Currently the value is just a u64 with u64::MAX defined as NO_LIMIT.
This no limit is not mentioned in the documentation. Something like the following enum would suite better:

enum Size {
    Fixed(u64),
    NoLimit,
};

I could provide a PR, if wanted.

Edit (brson): per discussion, let's change this to Option<u64> with a doc-comment explaining that None means "no limit". The Storage trait is in src/storage.rs.

Don't campaign anymore if quorum rejects vote for log gap

Apparently, if a peer's votes get rejected by quorum for log gap, then the peer can't become leader anymore by simply retry. So we can disable campaign for these peers until a leader is elected.

Maybe related to #14.

raft 1.0 release

To release 1.0, we need to solve following issues:

Add examples
Enhance documentations
Review all APIs for proper access level

raft: clarify candidate message handling, test candidate to follower transition with message from leader

Refer etcd-io/etcd#9345

raft: let raft step return error when proposal is dropped to allow fail-fast

Refer etcd-io/etcd#9067

Learner should respond to request vote but don't accept votes

Consider following case, a cluster has three nodes A, B, C, and A is Leader. And then add learner D, E to the cluster. D, E is promoted to voter then, but D is isolated from A, so D doesn't know himself is promoted. When A, B both crash, the cluster still has three voters C, D, E, which should be able to form a quorum. However, D can't vote in the current implementation, since it doesn't know it's not learner anymore.

One solution to this problem is let learner respond to vote request and let candidate check if the response is from a valid peer. Because we won't let a voter step back as learner, so when candidate consider a node is a voter, then it's definitely a voter from that point to infinite future. So the candidate check should be safe. In the situation described above, the quorum can be formed again, and the cluster can recover automatically.

/cc @xiang90 @siddontang any thoughts?

RFC: Follower Snapshot

For now snapshot is always sent from leader to follower, which is not always sufficient. For example, consider there are 5 nodes in two data center, (1, 2) and (3, 4, 5). If 5 is leader and 1 needs a snapshot, then data have to be transferred across data center.

But in fact any nodes in cluster can send a snapshot once requested logs are applied. So it's possible that 1 requests a snapshot aggressively from 2, so that data can be transferred internally.

Support requesting snapshot aggressively is also useful for recovery from snapshot files corruption.

Requirement for making a release to crates.io

Before publishing this crate, we need to resolve following issues.

Closes #16.

raft: introduce/fix TestNodeWithSmallerTermCanCompleteElection

refer etcd-io/etcd#8288

Introduce Prevote

refer: etcd-io/etcd#9352

Should `become_pre_candidate` call `reset`?

Currently become_pre_candidate does not call self.reset(term) like the other become_* functions:

https://github.com/pingcap/raft-rs/blob/a6671bed1d37805667afa022a84825477fc17e9b/src/raft.rs#L821-L832

https://github.com/pingcap/raft-rs/blob/a6671bed1d37805667afa022a84825477fc17e9b/src/raft.rs#L835-L842

https://github.com/pingcap/raft-rs/blob/a6671bed1d37805667afa022a84825477fc17e9b/src/raft.rs#L807-L814

https://github.com/pingcap/raft-rs/blob/a6671bed1d37805667afa022a84825477fc17e9b/src/raft.rs#L799-L800

This created an interesting situation in tikv/tikv#3133, since self.reset(term) is not called the leader_id != INVALID_ID which was what we were using to detect if the leader was missing in check_stale_state here.

The thesis on prevote:

If desired, Raft’s basic leader election algorithm can be extended with an
additional phase to prevent such disruptions, forming the Pre-Vote algorithm.
In the Pre-Vote algorithm, a candidate only increments its term if it first
learns from a majority of the cluster that they would be willing to grant the
candidate their votes (if the candidate’s log is sufficiently up-to-date, and
the voters have not received heartbeats from a valid leader for at least a
baseline election timeout). This was inspired by ZooKeeper’s algorithm [42],
in which a server must receive a majority of votes before it calculates a new
epoch and sends NewEpoch messages (however, in ZooKeeper servers do not
solicit votes, other servers offer them).

The Pre-Vote algorithm solves the issue of a partitioned server disrupting
the cluster when it rejoins. While a server is partitioned, it won’t be able
to increment its term, since it can’t receive permission from a majority of
the cluster. Then, when it rejoins the cluster, it still won’t be able to
increment its term, since the other servers will have been receiving regular
heartbeats from the leader. Once the server receives a heartbeat from the
leader itself, it will return to the follower state (in the same term).

We recommend the Pre-Vote extension in deployments that would benefit from
additional robustness. We also tested it in various leader election scenarios
in AvailSim, and it does not appear to significantly harm election
performance.

Publish crate

Many peoples interested in raft, maybe publishing early version involves new peoples to development process

A full benchmarking suite

It'd be great to have a full, robust benchmarking suite.

I recommend using Criterion for benchmarking since we can use it fairly unobtrusively, even on stable.

raft: Clarify conditions for granting votes and prevotes

refer etcd-io/etcd#9204

Consider priority during election

There may be differences for several nodes of a cluster, like some nodes are in a far away data center, or some nodes have poor hardware configurations. In such case we would want to make them have less chance to become a leader.

After #63, we support configuring different randomized timeout range for different nodes, which can make it hard to become leader with a larger timeout. However it can still cause noises. Because as long as the nodes have enough logs, it can still become leader even if other preferred nodes have the same logs.

So I propose to add a priority for every node. A follower votes for a candidate only when one of following prerequisites is met:

Candidate has more logs;
Candidate has the same logs and its priority is not less than the follower.

This policy is expected to work well when the nodes that are preferred to become leaders can form a quorum. In such case a non-preferred node can become leader only when it has more logs. However a node is not preferred usually means that it takes more time to catch up log for high network latency or poor performance as described above.

RFC: Support Follower Replication

While discussing #135 we determined that supporting replication between two followers, for instance in the same datacenter, to reduce WAN overhead, may be a desirable feature.

This RFC proposes the feature to be added after #135, assuming there will be some structural changes to Raft to support follower to follower communication cleanly.

Ensure leader is in ProgressStateReplicate

For the details of this issue, refer to etcd-io/etcd#10279.

To resolve this issue:

Refer to the etcd code and this crate's codebase to ensure such functionality does not already exist.
Either leave a comment here why this change is not needed, or implement the change and send a PR.
If you can add a test (either way) we would greatly appreciate that as well.

This issue is mentored, and if you'd like to take this on you can leave a comment here. We'll try to support you the best we can. :)

Remove clippy related features.
Remove --dev feature.
Ensure all documentation suggests using cargo clippy.