uwplse / verdi-raft Goto Github PK

An implementation of the Raft distributed consensus protocol, verified in Coq using the Verdi framework

License: BSD 2-Clause "Simplified" License

Shell 1.48% Makefile 0.37% Python 4.75% Coq 89.26% OCaml 2.60% Awk 0.21% C 0.04% Ruby 1.18% Perl 0.11%

raft verdi coq proof distributed-systems consensus key-value

verdi-raft's Introduction

Verdi Raft

Raft is a distributed consensus algorithm that is designed to be easy to understand and is equivalent to Paxos in fault tolerance and performance. Verdi Raft is a verified implementation of Raft in Coq, constructed using the Verdi framework. Included is a verified fault-tolerant key-value store using Raft.

Optional requirements

Executable vard key-value store:

Client for vard:

Python 2.7

Integration testing of vard:

Python 2.7

Unit testing of unverified vard code:

OUnit (2.0.0 or later)

Building and installation instructions

We recommend installing the dependencies of Verdi Raft via opam:

opam repo add coq-extra-dev https://coq.inria.fr/opam/extra-dev
opam install coq-struct-tact coq-cheerios coq-verdi

Then, run make in the root directory. This will compile the Raft implementation and proof interfaces, and check all the proofs. To speed up proof checking on multi-core machines, use make -jX, where X is at least the number of cores on your machine.

To build the vard key-value store program in extraction/vard, you first need to install its requirements. Then, run make vard in the root directory. If the Coq implementation has been compiled as above, this simply compiles the extracted OCaml code to a native executable; otherwise, the implementation is extracted to OCaml and compiled without checking any proofs.

Files

The Raft and RaftProofs subdirectories of theories contain the implementation and verification of Raft. For each proof interface file in Raft, there is a corresponding proof file in RaftProofs. The files in the Raft subdirectory include:

Raft.v: an implementation of Raft in Verdi
RaftRefinementInterface.v: an application of the ghost-variable transformer to Raft which tracks several ghost variables used in the verification of Raft
CommonTheorems.v: several useful theorems about functions used by the Raft implementation
OneLeaderPerTermInterface: a statement of Raft's election safety property. See also the corresponding proof file in RaftProofs.
- CandidatesVoteForSelvesInterface.v, VotesCorrectInterface.v, and CroniesCorrectInterface.v: statements of properties used by the proof OneLeaderPerTermProof.v
LogMatchingInterface.v: a statement of Raft's log matching property. See also LogMatchingProof.v in RaftProofs
- LeaderSublogInterface.v, SortedInterface.v, and UniqueIndicesInterface.v: statements of properties used by LogMatchingProof.v

The file EndToEndLinearizability.v in RaftProofs uses the proofs of all proof interfaces to show Raft's linearizability property.

The `vard` Key-Value Store

vard is a simple key-value store implemented using Verdi. vard is specified and verified against Verdi's state-machine semantics in the VarD.v example system distributed with Verdi. When the Raft transformer is applied, vard can be run as a strongly-consistent, fault-tolerant key-value store along the lines of etcd.

After running make vard in the root directory, OCaml code for vard is extracted, compiled, and linked against a Verdi shim and some vard-specific serialization/debugging code, to produce a vard.native binary in extraction/vard.

Running make bench-vard in extraction/vard will produce some benchmark numbers, which are largely meaningless on localhost (multiple processes writing and fsync-ing to the same disk and communicating over loopback doesn't accurately model real-world use cases). Running make debug will get you a tmux session where you can play around with a vard cluster in debug mode; look in bench/vard.py for a simple Python vard client.

As the name suggests, vard is designed to be comparable to the etcd key-value store (although it currently supports many fewer features). To that end, we include a very simple etcd "client" which can be used for benchmarking. Running make bench-etcd will run the vard benchmarks against etcd (although see above for why these results are not particularly meaningful). See below for instructions to run both stores on a cluster in order to get a more useful performance comparison.

Running `vard` on a cluster

vard accepts the following command-line options:

-me NAME             name for this node
-port PORT           port for client commands
-dbpath DIRECTORY    directory for storing database files
-node NAME,IP:PORT   node in the cluster
-debug               run in debug mode

Note that vard node names are integers starting from 0.

For example, to run vard on a cluster with IP addresses 192.168.0.1, 192.168.0.2, 192.168.0.3, client (input) port 8000, and port 9000 for inter-node communication, use the following:

# on 192.168.0.1
$ ./vard.native -dbpath /tmp/vard-8000 -port 8000 -me 0 -node 0,192.168.0.1:9000 \
                -node 1,192.168.0.2:9000 -node 2,192.168.0.3:9000

# on 192.168.0.2
$ ./vard.native -dbpath /tmp/vard-8000 -port 8000 -me 1 -node 0,192.168.0.1:9000 \
                -node 1,192.168.0.2:9000 -node 2,192.168.0.3:9000

# on 192.168.0.3
$ ./vard.native -dbpath /tmp/vard-8000 -port 8000 -me 2 -node 0,192.168.0.1:9000 \
                    -node 1,192.168.0.2:9000 -node 2,192.168.0.3:9000

When the cluster is set up, a benchmark can be run as follows:

# on the client machine
$ python2 bench/setup.py --service vard --keys 50 \
                         --cluster "192.168.0.1:8000,192.168.0.2:8000,192.168.0.3:8000"
$ python2 bench/bench.py --service vard --keys 50 \
                         --cluster "192.168.0.1:8000,192.168.0.2:8000,192.168.0.3:8000" \
                         --threads 8 --requests 100

Running `etcd` on a cluster

We can compare numbers for vard and etcd running on the same cluster as follows:

# on 192.168.0.1
$ etcd --name=one \
 --listen-client-urls http://192.168.0.1:8000 \
 --advertise-client-urls http://192.168.0.1:8000 \
 --initial-advertise-peer-urls http://192.168.0.1:9000 \
 --listen-peer-urls http://192.168.0.1:9000 \
 --data-dir=/tmp/etcd \
 --initial-cluster "one=http://192.168.0.1:9000,two=http://192.168.0.2:9000,three=http://192.168.0.3:9000"

# on 192.168.0.2
$ etcd --name=two \
 --listen-client-urls http://192.168.0.2:8000 \
 --advertise-client-urls http://192.168.0.2:8000 \
 --initial-advertise-peer-urls http://192.168.0.2:9000 \
 --listen-peer-urls http://192.168.0.2:9000 \
 --data-dir=/tmp/etcd \
 --initial-cluster "one=http://192.168.0.1:9000,two=http://192.168.0.2:9000,three=http://192.168.0.3:9000"

# on 192.168.0.3
$ etcd --name=three \
 --listen-client-urls http://192.168.0.3:8000 \
 --advertise-client-urls http://192.168.0.3:8000 \
 --initial-advertise-peer-urls http://192.168.0.3:9000 \
 --listen-peer-urls http://192.168.0.3:9000 \
 --data-dir=/tmp/etcd \
 --initial-cluster "one=http://192.168.0.1:9000,two=http://192.168.0.2:9000,three=http://192.168.0.3:9000"

# on the client machine
$ python2 bench/setup.py --service etcd --keys 50 \
                         --cluster "192.168.0.1:8000,192.168.0.2:8000,192.168.0.3:8000"
$ python2 bench/bench.py --service etcd --keys 50 \
                         --cluster "192.168.0.1:8000,192.168.0.2:8000,192.168.0.3:8000" \
                         --threads 8 --requests 100

verdi-raft's People

Contributors

Stargazers

Watchers

Forkers

dwoos ahmet-celik spiliopoulos arbitral linpingchuan channgo2203 evaluation-alex kalorie 0zand1z ppedrot mewbak fajb zeta1999 jfehrle skyskimmer mrhaandi herbelin rustbunker

verdi-raft's Issues

vard cluster members might not use the same cluster size

Since the Raft cluster size is set at runtime based on the passed parameters, strange errors can occur in vard if different cluster members use a different number of -node parameters. Even with cluster size set at compile time, this could be an issue if different deployments are compiled with different cluster size parameters.

One way to mitigate these kinds of errors is to let the shim perform some initial messaging before Raft communication starts, e.g., get confirmation that other nodes were configured with the same cluster size as the present node.

Node in singleton cluster never becomes leader

I'm trying to run the benchmarks against a single-node system:

$ ./vard.native -dbpath /tmp/vard-8000 -port 8000 -me 0 -node 0,127.0.0.1:9000 -debug
unordered shim running setup for VarD
unordered shim ready for action
client 115512982 connected on 127.0.0.1:49446
client 115512982 disconnected: client closed socket

The client logged above is the following invocation:

python2 bench/setup.py --service vard --keys 50 --cluster 127.0.0.1:8000
Traceback (most recent call last):
  File "bench/setup.py", line 34, in <module>
    main()
  File "bench/setup.py", line 27, in main
    host, port = Client.find_leader(args.cluster)
  File "/Users/tschottdorf/tla/verdi-raft/extraction/vard/bench/vard.py", line 27, in find_leader
    raise cls.NoLeader
vard.NoLeader

I haven't dug deeper but I did verify that I can run the benchmarks against a three-node cluster (everything running on the same machine). So, perhaps I'm silly or there is a problem with the edge case of a single-node system.

Raft specification leader staleness

Hi,

I was going through the Raft spec and the following lines in the handleAppendEntriesResponse looked to me to be inverted

https://github.com/uwplse/verdi-raft/blob/master/raft/Raft.v#L244-L249

 Definition handleAppendEntriesReply (me : name) state src term entries (result : bool)
  : raft_data * list (name * msg) :=
....
....
    else if currentTerm state <? term then
      (* follower behind, ignore *)
      (state, [])
    else
      (* leader behind, convert to follower *)
      (advanceCurrentTerm state term, []).

It seems to me that if currentTerm state is less than the follower's term then the leader is behind and we should call advanceCurrentTerm to update its state. Similarly for the final case the follower is behind and we should leave state unchanged although calling advanceCurrentTerm on it would not make a difference.

Crash during update of snapshot causes loss of data

From @pfons on April 13, 2016 0:29

A crash of the server when it executed the function that writes a snapshot to disk (updating the existing snapshot) can cause loss of data and prevent the server from recovering correctly afterworlds.

This bug is more serious than issue #50 because it can lead to loss of data. Loss of data can happen because the server, when it crashes while executing the function save, deletes/truncates the existing disk snapshot before it safely writes the new snapshot to disk.

This problem can be reproduced by simulating a crash immediately after the snapshot file is opened with O_TRUNC (save function in Shim.ml) and before the write is actually made, for example, by adding the statement assert(env.saves < 10000);.

It is probably harder to fix this bug than issue #50 because a correct implementation needs to ensure that several steps (i.e., replacing the old snapshot with the new snapshot and truncating the log) are atomic despite crashes.

Copied from original issue: uwplse/verdi#39

Server is unable to recover when disk log is incomplete due to a crash while writing an entry

From @pfons on April 13, 2016 0:17

A crash during a write to the disk log (by M.to_channel in function save on file Shim.ml) can cause a partial entry to be appended to the log. When this happens the server that crashed is not able to recover and produces the following error when starting:
"Fatal error: exception Failure("input_value: truncated object")"

This problem can be simulated by stopping the server and subsequently trimming the last byte from the log. We expect partial writes during crashes to be more likely to occur when the log write cross disk block boundaries.

Copied from original issue: uwplse/verdi#38

VarDRaft.v belongs in systems directory

To make all Verdi projects follow a predictable layout, VarDRaft.v should be moved to a systems directory at the root.

Server assumes that it can read the entire client request with a single recv call

From @pfons on April 13, 2016 2:31

If the server is not able to read with one call the entire client request, which is sent over TCP, the server wrongly presumes that the client is disconnected and throws an exception (function read_from_socket in file Shim.ml).

The problem can be usually reproduced by simply sending the requests using two send calls on the client side.

Copied from original issue: uwplse/verdi#41

Proposal: verify Raft handlers down to machine code level

This is a more ambitious version of uwplse/verdi#123, where one uses VST to refine the Raft handlers to C code, which CompCert can compile to verified machine code. As a first step, the verified code could interact with the existing Verdi shim that Verdi Raft uses via OCaml's C interface. One could also implement a standalone shim in C by taking inspiration from the OCaml shim.

Proposal: store log count in log file

Using the verified log transformer, the current number of log entries is stored as a nat in a separate file (Count). Having a separate file can be avoided by letting the snapshot_interval be a fixed-size "machine" integer, e.g., of type int31, and storing the current number of log entries as such an integer at the head of the Log file. This would require having an operation that replaces the head of a file while leaving the rest intact. Note that overflow is impossible because one can prove that the count is always less than snapshot_interval.

Client-server marshaling allows users to inject commands and to crash the server

From @pfons on April 13, 2016 2:37

The server-client communication protocol relies on spaces to delimit the arguments of requests and newlines to delimit the requests, which are sent over TCP connections. Because neither validation is made on the characters of the command arguments (keys and values) nor are the meta-characters escaped (newlines and spaces), by providing specially crafted input users of vard.py are able to: 1) crash the server; 2) inject commands that are executed by the servers and cause subsequent requests to return wrong results.

To reproduce the bug it suffices to create a client application that issues the following vard.py library calls:

Crash server:
GET("key1 - - \n")

During the execution of the GET the leader crashes and before it terminates the leader produces the following error message:

client disconnected: received invalid input
Fatal error: exception Unix.Unix_error(Unix.EBADF, "send", "")

Inject commands:

PUT(key1,key1) = key1
PUT(key2,key2) = key2
PUT(key3,key3) = key3
GET(key1 - - \n132201621 216857 GET key2) = key1
GET(key1) = key2
GET(key2) = key1
GET(key3) = key2

Note that the last three GET operations produce an incorrect result.

Copied from original issue: uwplse/verdi#42

Extracted code uses Coq Ascii internals

Due to the VarD code using Ascii internals and the positive type for the string map, VarDRaft also gets them. Using Ascii internals is a code smell.

Transient system call errors during recovery cause inconsistent re-initialization

From @pfons on April 13, 2016 2:29

During recovery, when opening the snapshot file, the server presumes that any error it encounters means that no snapshot was created (function get_initial_state in file Shim.ml). However, errors while opening a file can be caused by transient OS problems such as insufficient kernel memory (ENOMEM) or exceeding the system maximum number of files opened (ENFILE). If such an error occurs during recovery the server will silently discard part of the persistent state (disk snapshot) while still reading the rest of the persistent state (disk log), which will lead to safety problems.

The following sequence of steps should reproduce the bug:
a) issue client PUT requests so that snapshots and log entries are written to disk (~1000 requests)
b) stop all servers
c) remove all the permissions of the respective snapshot files (chmod 000 verdi-snapshot-900*).
d) restart all servers
e) issue one GET client request

In our tests, after this sequence of events the GET client request after recovery (step e) returns a result as if the key value store had not been populated (step a).

Apart from having replicas forget about all their state, it may be possible to create test cases where the replicas partially forget about their state given that, after recovery, replicas discard the snapshot but not the disk log.

Copied from original issue: uwplse/verdi#40

Transfer-based correctness theorem for Raft is missing

From @wilcoxjay on September 22, 2015 16:51

We should prove something like "If P is an invariant of the underlying state machine, then P is an invariant of every state machine in Raft". This should follow directly from StateMachineCorrectness.

Thanks to UPenn folks for reporting!

Copied from original issue: uwplse/verdi#23

Trouble Building in Coq 8.10-8.12.2

I'm trying to build verdi-raft for a benchmark set I'm using, and I'm having some trouble.

My tool supports the released versions of Coq 8.10.0 - 8.12.2.

With Coq 8.10 and 8.11, I can install all the dependencies of verdi-raft through opam fine, but when I try to build verdi-raft itself, I get a failure of the lia tactic:

File "./raft/CommonTheorems.v", line 87, characters 11-15:
Error: Tactic failure:  Cannot find witness.

If I revert a few commits back, omega is used instead of lia, which I figured might work better on older coq versions, but then I get errors that omega was never imported.

If instead I try to build with Coq 8.12.2, I find that opam won't install the dependencies:

(base) [5] sanchezstern@swarm033> opam install coq-verdi
Sorry, no solution found: there seems to be a problem with your request.

No solution found, exiting

This surprised me because there are commits in verdi-raft that claim to support up to Coq 8.14. But the StructTact and Cheerios dependencies both seem to list coq constraints:

  "coq" {(>= "8.6.1" & < "8.12~") | (= "dev")`)

Which requires that the coq version be strictly less than 8.12, or be "dev". The opam file for StructTact on github appears to relax this constraint a bit, but the one indexed by opam in extra-dev still has the hard less than 12 constraint (as does the opam file for Cheerios in both places).

Basically, the gist of my question is, is there a recommended procedure for building verdi-raft now on non-dev coq?

vard should signal when recovery from file fails

The current implementation of vard and shim tries to read state from snapshot (and command log) on startup, and if that fails, loads the initial state silently. A better behavior is to signal when the snapshot and/or command log files are present, but cannot be used to build the initial state.

Server crashes when trying to produce large packets because of buffer overflow

From @pfons on April 13, 2016 0:11

We found a bug that causes a buffer overflow on the leader when a lagging follower tries to recover. The stack overflow seems to occur within the recursive function “restore_from_log” (Shim.ml) when a very large packet is constructed and before the leader actually tries to send it.

This problem can be reproduced through the following process:
a) start 3 servers;
b) execute one client request;
c) stop a follower server;
d) execute many client requests (in our tests, at least 521,932 requests).
c) restart the server that was stopped

Here’s a sample output produced by the leader when it crashes:

   [Term 1] Sending 50 entries to 2 (currently have 521932 entries), commitIndex=521882_
   [Term 1] Sending 521881 entries to 3 (currently have 521932 entries), commitIndex=521882_
   [Term 1] Received AppendEntriesReply 50 entries true, commitIndex 521883
  Fatal error: exception Stack overflow

Copied from original issue: uwplse/verdi#37

Allow any decidable type for request and client ids

Currently, both Raft request ids and client ids are hardcoded to nat. For some purposes, it would be more appropriate to use some other decidable type, e.g., string, for either of these ids. The most attractive solution is to parameterize the Raft transformer on any decidable types for request and client ids.

Proposal: verify Verdi Raft store interaction down to file system level

This is natural followup to uwplse/verdi#125: use the general Verdi interface to a verified file system to prove that Verdi Raft, when run using the file system, does not suffer from bugs due to corrupted logs.

Clients can livelock system

Due to how the current shim uses file descriptors and the select system call to multiplex between inputs and network communication, clients can prevent node communication from happening by constantly sending requests. This is possible because timeouts only occur when no reads on file descriptors are pending.

uwplse / verdi-raft Goto Github PK

verdi-raft's Introduction

Verdi Raft

Meta

Optional requirements

Building and installation instructions

Files

The vard Key-Value Store

Running vard on a cluster

Running etcd on a cluster

verdi-raft's People

Contributors

Stargazers

Watchers

Forkers

verdi-raft's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs

The `vard` Key-Value Store

Running `vard` on a cluster

Running `etcd` on a cluster