redislabs / redisraft Goto Github PK

A Redis Module that make it possible to create a consistent Raft cluster from multiple Redis instances.

License: Other

C 62.12% Shell 2.50% Python 32.54% Dockerfile 0.21% HCL 1.08% CMake 1.43% Jinja 0.11% Smarty 0.01%

redisraft's Introduction

RedisRaft

⚠️ RedisRaft is still being developed and is not yet ready for any real production use. Please do not use it for any mission critical purpose at this time.

Strongly-Consistent Redis Deployments

RedisRaft is a Redis module that implements the Raft Consensus Algorithm, making it possible to create strongly-consistent clusters of Redis servers.

The Raft algorithm is provided by a standalone Raft library. This is a fork of the original library created by Willem-Hendrik Thiart, which is now actively maintained by Redis Ltd.

Main Features

Strong consistency (in the language of CAP, this system prioritizes consistency and partition-tolerance).
Support for most Redis data types and commands
Dynamic cluster configuration (adding / removing nodes)
Snapshots for log compaction
Configurable quorum or fast reads

Getting Started

Building

To compile the module, you will need:

Build essentials (a compiler, GNU make, etc.)
CMake
GNU autotools (autoconf, automake, libtool).

To build:

git clone https://github.com/RedisLabs/redisraft.git
cd redisraft
mkdir build && cd build
cmake ..
make

redisraft.so will be created under the project directory.

Creating a RedisRaft Cluster

RedisRaft requires Redis build from the 'unstable' branch. Build Redis first:

git clone https://github.com/redis/redis  
cd redis
make 
make install

To create a three-node cluster, start the first node:

redis-server \
    --port 5001 --dbfilename raft1.rdb \
    --loadmodule <path-to>/redisraft.so \
    --raft.log-filename raftlog1.db \
    --raft.addr localhost:5001

Then initialize the cluster:

redis-cli -p 5001 raft.cluster init

Now start the second node, and run the RAFT.CLUSTER JOIN command to join it to the existing cluster:

redis-server \
    --port 5002 --dbfilename raft2.rdb \
    --loadmodule <path-to>/redisraft.so \
    --raft.log-filename raftlog2.db \
    --raft.addr localhost:5002

redis-cli -p 5002 RAFT.CLUSTER JOIN localhost:5001

Now add the third node in the same way:

redis-server \
    --port 5003 --dbfilename raft3.rdb \
    --loadmodule <path-to>/redisraft.so \
    --raft.log-filename raftlog3.db 
    --raft.addr localhost:5003

redis-cli -p 5003 RAFT.CLUSTER JOIN localhost:5001

To query the cluster state, run the INFO raft command:

redis-cli -p 5001 INFO raft

Now you can start using this RedisRaft cluster. All supported Redis commands will be executed in a strongly-consistent manner using the Raft protocol.

Documentation

Please consult the documentation for more information.

License

RedisRaft is licensed under the Redis Source Available License 2.0 (RSALv2) or the Server Side Public License v1 (SSPLv1).

redisraft's People

Contributors

Stargazers

Watchers

redisraft's Issues

Is this module in Preview?

We should identify that this module is in Preview (and not fully released) in the README if that's the case.

Assert error in quorum_msg_id

Occasionally, a leader may abort with an assertion error:

redis-server: src/raft_server.c:1619: quorum_msg_id: Assertion `num_voting_nodes == msg_ids_count' failed.

Some interim observations:

This happens with current_term == 1
Leader's own node is flagged non-voting and inactive

Nodes:

(gdb) p *(raft_node_private_t *) $r->nodes[0]
$10 = {udata = 0x0, next_idx = 1, match_idx = 0, flags = 44, id = 350551569, last_acked_term = 0, last_acked_msgid = 0}
(gdb) p *(raft_node_private_t *) $r->nodes[1]
$11 = {udata = 0x7ffaff81ca80, next_idx = 1962, match_idx = 1961, flags = 54, id = 314351675, last_acked_term = 1, last_acked_msgid = 277}
(gdb) p *(raft_node_private_t *) $r->nodes[2]
$12 = {udata = 0x7ffaff81c000, next_idx = 2133, match_idx = 2132, flags = 54, id = 1275574903, last_acked_term = 1, last_acked_msgid = 295}

Raft state:

$13 = {current_term = 1, voted_for = -1, log_impl = 0x7ffb027fec20 <RaftLogImpl>, log = 0x7ffb027fee20 <redis_raft>, commit_idx = 2132, last_applied_idx = 2132, state = 3,
  timeout_elapsed = 0, nodes = 0x7ffb02d9d7e0, num_nodes = 3, election_timeout = 1000, election_timeout_rand = 1677, request_timeout = 200, 
  current_leader = 0x7ffaff80a030, cb = {send_requestvote = 0x7ffb02594847 <raftSendRequestVote>, send_appendentries = 0x7ffb02594b25 <raftSendAppendEntries>,
    send_snapshot = 0x7ffb0259dec6 <raftSendSnapshot>, applylog = 0x7ffb025950fc <raftApplyLog>, persist_vote = 0x7ffb02594fab <raftPersistVote>,
    persist_term = 0x7ffb0259504f <raftPersistTerm>, log_get_node_id = 0x7ffb02595306 <raftLogGetNodeId>,
    node_has_sufficient_logs = 0x7ffb0259532e <raftNodeHasSufficientLogs>, notify_membership_event = 0x7ffb02595498 <raftNotifyMembershipEvent>,
    notify_state_event = 0x7ffb025957b7 <raftNotifyStateEvent>, log = 0x7ffb025952d0 <raftLog>}, udata = 0x7ffb027fee20 <redis_raft>, node = 0x7ffaff80a030,
  voting_cfg_change_log_idx = -1, connected = 1, snapshot_in_progress = 0, snapshot_flags = 1, snapshot_last_idx = 2087, snapshot_last_term = 1,
  saved_snapshot_last_idx = 1856, saved_snapshot_last_term = 1, msg_id = 295, read_queue_head = 0x7ffb02c11370, read_queue_tail = 0x7ffb02c11370}

Removing node in an uninitialised cluster crashes the server

Running redis server with redisraft using:
redis-server --port 5001 --dbfilename raft1.rdb --loadmodule redisraft.so id=1 raft-log-filename=raftlog1.db addr=localhost:5001
Running redis-cli -p 5001 RAFT.NODE remove 1 causes server to crash :
=== REDIS BUG REPORT START: Cut & paste starting from here ===
47734:M 18 Feb 2019 15:23:12.982 # Redis 5.0.0 crashed by signal: 11
47734:M 18 Feb 2019 15:23:12.982 # Crashed running the instruction at: 0x10e73c3f1
47734:M 18 Feb 2019 15:23:12.982 # Accessing address: 0x40
47734:M 18 Feb 2019 15:23:12.982 # Failed assertion: (:0)
------ STACK TRACE ------
EIP:
0 redisraft.so 0x000000010e73c3f1 raft_get_node + 17
Backtrace:
0 redis-server 0x000000010e56b52e logStackTrace + 110
1 redis-server 0x000000010e56b8fd sigsegvHandler + 253
2 libsystem_platform.dylib 0x00007fff6e3c0f5a _sigtramp + 26
3 ??? 0x0000000000000001 0x0 + 1
4 redisraft.so 0x000000010e72960d cmdRaftNode + 749
5 redis-server 0x000000010e59aa33 RedisModuleCommandDispatcher + 147
6 redis-server 0x000000010e51d08c call + 220
7 redis-server 0x000000010e51da54 processCommand + 1556
8 redis-server 0x000000010e52e07f processInputBuffer + 495
9 redis-server 0x000000010e514e9c aeProcessEvents + 732
10 redis-server 0x000000010e5151bb aeMain + 43
11 redis-server 0x000000010e520cce main + 1726
12 libdyld.dylib 0x00007fff6e0b2015 start + 1
------ INFO OUTPUT ------
Server
redis_version:5.0.0
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:546b15d3f28a6ead
redis_mode:standalone
os:Darwin 17.6.0 x86_64
arch_bits:64
multiplexing_api:kqueue
atomicvar_api:atomic-builtin
gcc_version:4.2.1
process_id:47734
run_id:af2696412bfd898df44578cdd50ef7037a6d9d56
tcp_port:5001
uptime_in_seconds:6
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:6992320
executable:/Users/yaeltzirulnikov/code-redisraft/redisraft/redis-server
config_file:
Clients
connected_clients:1
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0
Memory
used_memory:1063312
used_memory_human:1.01M
used_memory_rss:3457024
used_memory_rss_human:3.30M
used_memory_peak:1063312
used_memory_peak_human:1.01M
used_memory_peak_perc:100.11%
used_memory_overhead:1037014
used_memory_startup:987328
used_memory_dataset:26298
used_memory_dataset_perc:34.61%
allocator_allocated:1028928
allocator_active:3419136
allocator_resident:3419136
total_system_memory:17179869184
total_system_memory_human:16.00G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:3.32
allocator_frag_bytes:2390208
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.01
rss_overhead_bytes:37888
mem_fragmentation_ratio:3.36
mem_fragmentation_bytes:2428096
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:49686
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0
Persistence
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1550496186
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
Stats
total_connections_received:1
total_commands_processed:2
instantaneous_ops_per_sec:0
total_net_input_bytes:55
total_net_output_bytes:11919
instantaneous_input_kbps:0.01
instantaneous_output_kbps:6.61
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
Replication
role:master
connected_slaves:0
master_replid:83f7339d0e5e71681a975b0b989d56044e369d06
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
CPU
used_cpu_sys:0.016633
used_cpu_user:0.008875
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000
Commandstats
cmdstat_command:calls=1,usec=868,usec_per_call=868.00
cmdstat_config:calls=1,usec=46,usec_per_call=46.00
Cluster
cluster_enabled:0
Keyspace
------ CLIENT LIST OUTPUT ------
id=5 addr=127.0.0.1:57163 fd=17 name= age=1 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=38 qbuf-free=32730 obl=0 oll=0 omem=0 events=r cmd=raft.node
------ CURRENT CLIENT INFO ------
id=5 addr=127.0.0.1:57163 fd=17 name= age=1 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=38 qbuf-free=32730 obl=0 oll=0 omem=0 events=r cmd=raft.node
argv[0]: 'RAFT.NODE'
argv[1]: 'remove'
argv[2]: '1'
------ REGISTERS ------
47734:M 18 Feb 2019 15:23:12.985 #
RAX:0000000000000000 RBX:000000005c6ab1c0
RCX:0000000000000001 RDX:0000000000000001
RDI:0000000000000000 RSI:0000000000000001
RBP:00007ffee16ef520 RSP:00007ffee16ef500
R8 :0000000000000000 R9 :0000000000000001
R10:0000000000000008 R11:0000000000000001
R12:0000000000000000 R13:0000000000000000
R14:00007fdd04006400 R15:0000000000000001
RIP:000000010e73c3f1 EFL:0000000000010202
CS :000000000000002b FS:0000000000000000 GS:0000000000000000
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef50f) -> 0000000000000000
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef50e) -> 0000000000000001
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef50d) -> 00007fff6e26a1b3
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef50c) -> 00007ffee16ef5e0
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef50b) -> 00007fffa6bde2cc
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef50a) -> 0000000000100000
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef509) -> 00000000000fffff
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef508) -> 0000000007000001
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef507) -> 00000000000000c0
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef506) -> 00007ffee16ef5b0
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef505) -> 000000010e72960d
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef504) -> 00007ffee16ef6d0
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef503) -> 000000010e6055b8
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef502) -> 00007fdd04006400
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef501) -> 00000000000efe5c
47734:M 18 Feb 2019 15:23:12.985 # (00007ffee16ef500) -> 000000005c6ab1c0
------ DUMPING CODE AROUND EIP ------

Split brain and lost updates

(Splitting this out from #43; it's looking like these might be separate issues)

With Jepsen 6c063da, Redis f88f866, and Redis-Raft b9ee410, testing with network partitions, process crashes, pauses, and membership changes resulted in all kinds of unusual behavior, including what looks like split-brain and the loss of acknowledged writes.

lein run test-all --raft-version b9ee410 --time-limit 600 --nemesis partition,kill,pause,member --follower-proxy --test-count 16 --concurrency 4n

20200505T133553.000-0400.zip

In this test run, several keys exhibited split-brain anomalies: multiple timelines of a single key were concurrently observed by distinct subsets of nodes, resulting in the apparent loss of some, or all, committed writes.

Minutes later, multiple nodes observe a gap between their snapshot and initial log index. Logs also contained 0 entries. These happen more frequently, in runs which do not exhibit any split-brain behavior, and may be a separate issue--see #43.

Several keys in this history exhibited split-brain behavior:

({:key 7,
  :values [[1 2 3 4 5 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 86 85 87 88 89 90 91 92]
           [1 2 3 4 5 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 86 85 87 88 89 90 91 112 114]]}
 {:key 31, :values [[1]
                    [72 73 75 76]]}
 {:key 32, :values [[53] [1]]}
 {:key 6, :values [[1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 44 46 47 48 50 53]
                   [1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 44 46 47 48 54 56]]}
 {:key 34, :values [[49]
                    [2]]}
 {:key 23, :values [[55 56 57 58 60 61 62 63 64 65 66]
                    [55 56 57 58 60 61 62 63 64 65 32 89]]})

Key 6, for example, diverged around 10:37:41 to 10:37:49, where nodes n2 and n3 started observing a history with element 50, where nodes n4 and n5 saw a history with element 54

2020-05-05 13:37:41,343{GMT}	INFO	[jepsen worker 11] jepsen.util: 51	:ok	:txn	[[:r 6 [1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 44 46 47 48 50]] [:append 32 6] [:r 31 [1 2 3 4 5 6 7 8 9]] [:append 32 7]]

2020-05-05 13:37:49,180{GMT}	INFO	[jepsen worker 13] jepsen.util: 113	:ok	:txn	[[:r 6 [1 2 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 44 46 47 48 54 56]]]

Key 7 diverged around 13:37:40 to 13:37:45: n4 and n5 diverged from n1, n2, and n3. Try, for instance, grep --color -e ':r 7 \[[0-9 ]* 92' jepsen.log vs grep --color -e ':r 7 \[[0-9 ]* 112' jepsen.log to see which reads observed which fork.

Key 31 gives us a more precise time boundary: at 10:37:45, reads of key 31 on node n4 started to observe an alternate timeline where all prior writes had not occurred:

2020-05-05 13:37:45,458{GMT}    INFO    [jepsen worker 1] jepsen.util: 41       :ok     :txn    [[:r 34 [2 4 5 1 8 3 6 7 9 10 11 16 17 18 14 15 12 13 21 19 23 20 24 22 25 26 27 28 29 30 31 32 37 38 33 36 34 35 39 42 40 41 45 46 47 51 43 44]] [:r 31 [1 2 3 4 5 6 7 8 9 16 15 19 22 23 24 25 21 28 29 31 32 33 34 36 38 39 42 44 43 46 48 49 52 54 55 56 57 59 60 58 61 62 63 64 66 65 67 68 70 69 71]]]
2020-05-05 13:37:45,535{GMT}	INFO	[jepsen worker 18] jepsen.util: 138	:ok	:txn	[[:append 34 66] [:append 31 76] [:r 31 [72 73 75 76]]]

Nodes n1, n2, and n3 continued to observe the original timeline of key 31 until 13:37:45.969, half a second later:

2020-05-05 13:37:45,969{GMT}	INFO	[jepsen worker 12] jepsen.util: 52	:ok	:txn	[[:r 31 [1 2 3 4 5 6 7 8 9 16 15 19 22 23 24 25 21 28 29 31 32 33 34 36 38 39 42 44 43 46 48 49 52 54 55 56 57 59 60 58 61 62 63 64 66 65 67 68 70 69 71 74 77 78 82 84 85]] [:append 34 107]]

Nodes n4 and n5 continued to observe the new timeline (beginning with [72 73 ...) through 13:37:51.630, when Jepsen moved on to a new key.

Raft should log all cluster events

Currently RAFT only logs the following two events

23939:M 06 Apr 2020 18:18:31.513 * RedisRaft starting, arguments: addr=10.161.18.196:19921 raft-log-fsync=no follower-proxy=yes raft-log-filename=19921-raftlog.db

23939:06 Apr 18:18:47.130 Joined Raft cluster, node id: 912788334, dbid: f86f449a23b27f59693e68614eff391d

If I kill a node and one of the followers gets promoted, I would like to see the following events:

The other nodes should log something similar to:
RedisRaft or (would be nice if this was consistent so I can filter these logs)

TIMESTAMP <redisraft> NodeId: 912788334 is no longer visible to dbid:   f86f449a23b27f59693e68614eff391d

on election

TIMESTAMP <redisraft> NodeId: 912788334 has begun the election process
TIMESTAMP <redisraft> NodeId: 912788334 has been promoted to raft leader

DISCARD doesn't always discard

Calling MULTI puts a connection into a batch state, but calling DISCARD doesn't necessary end that batch state. In particular, DISCARD frequently fails on followers without proxies enabled, returning MOVED, NOLEADER, or NOTLEADER . When this happens, the connection is still usable, but subsequent requests will execute within a batch context, rather than independently. This leads to correctness problems!

This makes it difficult for clients to reliably use MULTI, because a transaction wrapper (e.g. (with-txn conn (do-some-stuff-with conn))) has no way to guarantee that it has returned the connection to a non-batch context by the end of the block, unless it:

Closes the connection altogether
Blocks and retries (possibly indefinitely) until the connection accepts a DISCARD operation.
Sets some kind of mutable state, ensuring that any future use of that connection make sure to DISCARD prior to making a request.

MULTI & DISCARD modify in-memory state purely on the listening server, right? Would it be possible to guarantee that DISCARD always succeeds when a server is reachable?

Make membership operations synchronous

When calling RAFT.CLUSTER JOIN or RAFT.NODE REMOVE, ideally the user experience should be synchronous and return when the operation completes.

Improve proxy connection management

Currently nodes use the same connection to exchange Raft related messages (AppendEntries, RequestVote, etc.) and handle follower proxy traffic (i.e. RAFT.ENTRY commands).

Ideally we may want to maintain separate connections for this purpose, and possible use connection pools to improve proxy performance.

Note: we'll need to consider consistency when pushing client requests though to avoid re-ordering and introducing consistency issues.

Automatically generate node IDs

Auto generated node IDs will probably safer and easier. A few notes:

We'll need to store the node ID in the raft log header, as we can't expect the user to provide it as an argument anymore.
For testing purposes, it's probably a good idea to still have the option to control the ID manually.

Crash (but also not a crash?) in RaftLogWriteEntry+0x85

With redis f88f866 (redisraft 2d1cf30), processes kills can result in a node segfaulting, but... also maybe the process kept running? I'm not sure what to make of these log entries.

20200510T112604.000-0400.zip

214601:10 May 08:33:28.418 Raft log file size is 38850, initiating snapshot.
214601:10 May 08:33:28.418 Initiating snapshot.
214601:10 May 08:33:28.418 <raftlib> begin snapshot sli:14981 slt:80 slogs:234
214601:10 May 08:33:28.418 Snapshot scope: first_entry_idx=14981, current_idx=15026
214601:10 May 08:33:28.420 node:560516317: raft.c:479: <raftlib> recvd appendentries t:83 ci:15026 lc:15029 pli:15008 plt:80 #21
214612:C 10 May 2020 08:33:28.429 * DB saved on disk


=== REDIS BUG REPORT START: Cut & paste starting from here ===
214612:C 10 May 2020 08:33:28.438 # Redis 999.999.999 crashed by signal: 11
214612:C 10 May 2020 08:33:28.438 # Crashed running the instruction at: 0x7f60c1bb249f
214612:C 10 May 2020 08:33:28.438 # Accessing address: (nil)
214612:C 10 May 2020 08:33:28.438 # Failed assertion: <no assertion failed> (<no file>:0)

------ STACK TRACE ------
EIP:
/opt/redis/redisraft.so(RaftLogWriteEntry+0x85)[0x7f60c1bb249f]

Backtrace:
redis-server 0.0.0.0:6379(logStackTrace+0x37)[0x561e561f5e57]
redis-server 0.0.0.0:6379(sigsegvHandler+0xb0)[0x561e561f6580]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x110e0)[0x7f60c2b460e0]
/opt/redis/redisraft.so(RaftLogWriteEntry+0x85)[0x7f60c1bb249f]
/opt/redis/redisraft.so(RaftLogRewrite+0x9e)[0x7f60c1bb2aad]
/opt/redis/redisraft.so(initiateSnapshot+0x5c0)[0x7f60c1bad33b]
/opt/redis/redisraft.so(+0x2464e)[0x7f60c1ba864e]
/opt/redis/redisraft.so(uv__run_timers+0x58)[0x7f60c1bc56f8]
/opt/redis/redisraft.so(uv_run+0x7e)[0x7f60c1bc9b7e]
/opt/redis/redisraft.so(+0x24737)[0x7f60c1ba8737]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7f60c2b3c4a4]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f60c287ed0f]

------ INFO OUTPUT ------
# Server
redis_version:999.999.999
redis_git_sha1:f88f8661
redis_git_dirty:0
redis_build_id:e03d01ed22d54f5a
redis_mode:standalone
os:Linux 5.6.0-1-amd64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:6.3.0
process_id:214612
run_id:8a724e3ef721e32c7c617ae3a012f0b0754c69f5
tcp_port:6379
uptime_in_seconds:2
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:12066504
executable:/opt/redis/redis-server
config_file:

# Clients
connected_clients:8
client_recent_max_input_buffer:4
client_recent_max_output_buffer:0
blocked_clients:3
tracking_clients:0

# Memory
used_memory:1372672
used_memory_human:1.31M
used_memory_rss:6799360
used_memory_rss_human:6.48M
used_memory_peak:1372672
used_memory_peak_human:1.31M
used_memory_peak_perc:103.74%
used_memory_overhead:1076190
used_memory_startup:830000
used_memory_dataset:296482
used_memory_dataset_perc:54.63%
allocator_allocated:2144136
allocator_active:2641920
allocator_resident:10711040
total_system_memory:135201345536
total_system_memory_human:125.92G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.23
allocator_frag_bytes:497784
allocator_rss_ratio:4.05
allocator_rss_bytes:8069120
rss_overhead_ratio:0.63
rss_overhead_bytes:-3911680
mem_fragmentation_ratio:5.14
mem_fragmentation_bytes:5476208
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:234150
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1589124808
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:8
total_commands_processed:732
instantaneous_ops_per_sec:76
total_net_input_bytes:11846
total_net_output_bytes:1853
instantaneous_input_kbps:5.05
instantaneous_output_kbps:1.11
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:0
evicted_keys:0
keyspace_hits:283
keyspace_misses:1
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0

# Replication
role:master
connected_slaves:0
master_replid:48be50ae093588329980597873dfc4702b0eb26d
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.000000
used_cpu_user:0.003208
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000

# Modules
module:name=redisraft,ver=1,api=1,filters=1,usedby=[],using=[],options=[]

# Commandstats
cmdstat_info:calls=1,usec=79,usec_per_call=79.00
cmdstat_raft:calls=121,usec=3882,usec_per_call=32.08
cmdstat_lrange:calls=284,usec=3494,usec_per_call=12.30
cmdstat_raft.ae:calls=3,usec=230,usec_per_call=76.67
cmdstat_save:calls=1,usec=8621,usec_per_call=8621.00
cmdstat_raft.requestvote:calls=1,usec=43,usec_per_call=43.00
cmdstat_rpush:calls=318,usec=473,usec_per_call=1.49
cmdstat_config:calls=3,usec=47,usec_per_call=15.67

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=249,expires=0,avg_ttl=0

------ CLIENT LIST OUTPUT ------
id=8 addr=192.168.122.1:47660 fd=16 name= age=2 idle=0 flags=b db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=raft user=default
id=11 addr=192.168.122.1:47666 fd=17 name= age=2 idle=2 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL user=default
id=12 addr=192.168.122.1:47668 fd=18 name= age=2 idle=0 flags=b db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=raft user=default
id=38 addr=192.168.122.13:58300 fd=23 name= age=1 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL user=default
id=39 addr=192.168.122.12:42016 fd=24 name= age=1 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL user=default
id=40 addr=192.168.122.11:51572 fd=25 name= age=1 idle=0 flags=b db=0 sub=0 psub=0 multi=-1 qbuf=3481 qbuf-free=29287 obl=0 oll=0 omem=0 events=r cmd=raft.ae user=default
id=41 addr=192.168.122.14:36048 fd=26 name= age=1 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL user=default
id=7 addr=192.168.122.1:47658 fd=15 name= age=2 idle=2 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=NULL user=default

------ REGISTERS ------
214612:C 10 May 2020 08:33:28.439 # 
RAX:0000000000000000 RBX:0000000000003ab2
RCX:0000000000000a0d RDX:0000000000000005
RDI:00007f60b0004b2d RSI:00007f60b00047a0
RBP:00007f60c1b80f30 RSP:00007f60c1b80f00
R8 :00007f60b0004700 R9 :00007f60b00046c0
R10:0000000000000075 R11:00000000ffffffff
R12:00007f60c2324000 R13:00007f60c2324048
R14:00007f60c1383000 R15:0000000000000000
RIP:00007f60c1bb249f EFL:0000000000010206
CSGSFS:002b000000000033
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f0f) -> 00007f60c1bad33b
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f0e) -> 00007f60c1b812e0
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f0d) -> 0000000000000001
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f0c) -> 0000000000003a86
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f0b) -> 00007f60c2222300
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f0a) -> 0000000000000000
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f09) -> 00007f60c1dfecc0
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f08) -> 00007f60c1b810a0
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f07) -> 00007f60c1bb2aad
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f06) -> 00007f60c1b80f70
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f05) -> 000000000000000f
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f04) -> 0000000b00000000
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f03) -> 00007f60c1dfecc0
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f02) -> 0000000000003a86
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f01) -> 00007f60c2222300
214612:C 10 May 2020 08:33:28.439 # (00007f60c1b80f00) -> 0000000000000000

------ MODULES INFO OUTPUT ------

------ FAST MEMORY TEST ------
*** Preparing to test memory region 561e56545000 (2523136 bytes)
*** Preparing to test memory region 561e58638000 (135168 bytes)
*** Preparing to test memory region 7f60a4000000 (135168 bytes)
*** Preparing to test memory region 7f60ac000000 (135168 bytes)
*** Preparing to test memory region 7f60b0000000 (139264 bytes)
*** Preparing to test memory region 7f60b7800000 (8388608 bytes)
*** Preparing to test memory region 7f60b8000000 (135168 bytes)
*** Preparing to test memory region 7f60bc57c000 (2621440 bytes)
*** Preparing to test memory region 7f60bc7fd000 (8388608 bytes)
*** Preparing to test memor2020-05-10 08:34:00 Jepsen starting redis-server --bind 0.0.0.0 --dbfilename redis.rdb --loadmodule /opt/redis/redisraft.so loglevel=debug raft-log-filename=raftlog.db raft-log-max-file-size=32000 raft-log-max-cache-size=1000000 follower-proxy=yes
y region 7f60bcffe000 (8388608 bytes)
*** Preparing to test memory region 7f60bd7ff000 (8388608 bytes)
*** Preparing to test memory region 7f60be000000 (8388608 bytes)
*** Preparing to test memory region 7f60be800000 (10485760 bytes)
*** Preparing to test memory region 7f60bf380000 (8388608 bytes)
*** Preparing to test memory region 7f60bfb81000 (8388608 bytes)
*** Preparing to test memory region 7f60c0382000 (8388608 bytes)
*** Preparing to test memory region 7f60c0b83000 (8388608 bytes)
*** Preparing to test memory region 7f60c1384000 (8388608 bytes)
*** Preparing to test memory region 7f60c1dff000 (4096 bytes)
*** Preparing to test memory region 7f60c1e00000 (8388608 bytes)
*** Preparing to test memory region 7f60c2b31000 (16384 bytes)
*** Preparing to test memory region 7f60c2d4e000 (16384 bytes)
*** Preparing to test memory region 7f60c3674000 (32768 bytes)
*** Preparing to test memory region 7f60c3687000 (4096 bytes)

Weirdly, it doesn't look like that segfault actually triggered a crash, because it kept logging applying log messages:

*** Preparing to test memory region 7f60c3687000 (4096 bytes)
.214601:10 May 08:33:28.447 node:560516317: raft.c:479: <raftlib> recvd appendentries t:83 ci:15029 lc:15029 pli:15008 plt:80 #21
214601:10 May 08:33:28.447 node:560516317: raft.c:479: <raftlib> recvd appendentries t:83 ci:15029 lc:15030 pli:15026 plt:83 #5
O.O.O.O.O.214601:10 May 08:33:28.465 node:560516317: node.c:361: NodeAddPendingResponse: id=1, type=proxy, request_time=1589124808465
214601:10 May 08:33:28.465 node:560516317: node.c:361: NodeAddPendingResponse: id=2, type=proxy, request_time=1589124808465
214601:10 May 08:33:28.476 node:560516317: raft.c:479: <raftlib> recvd appendentries t:83 ci:15031 lc:15033 pli:15026 plt:83 #9
O.O.O.214601:10 May 08:33:28.502 node:560516317: raft.c:479: <raftlib> recvd appendentries t:83 ci:15035 lc:15033 pli:15029 plt:83 #9
214601:10 May 08:33:28.529 <raftlib> applying log: 14982, id: 122846204 size: 65
214601:10 May 08:33:28.529 <raftlib> applying log: 14983, id: 1563058141 size: 65
214601:10 May 08:33:28.529 <raftlib> applying log: 14984, id: 966673969 size: 121
214601:10 May 08:33:28.529 <raftlib> applying log: 14985, id: 451325546 size: 115

We also encountered duplicate elements and incompatible reads in this history, which might be #52. To reproduce, use jepsen.redis ddb68c89, and try:

lein run test-all --raft-version 2d1cf30 --time-limit 600 --nemesis kill --follower-proxy --test-count 40 --concurrency 4n

Converge to Redis Cluster mode.

Earlier versions used custom -MOVED and other error codes to redirect clients to leaders or indicate a cluster down / no leader condition.

Recently we introduced cluster_mode which is more wire-level compatible with the Redis Cluster protocol specification and thus leverages cluster support by existing clients. This includes:

Proper -MOVED reply which includes the hash slot
Support -CLUSTERDOWN and other standard Cluster errors
Support CLUSTER SLOTS to advertise follower nodes as replicas (and if shard groups are used, also foreign shard groups).

We should converge to always use Redis Cluster mode as there is no advantage in maintaining both. We'll need to consider the aspects of doing this:

Compatibility with tests
Compatibility with Jepsen (it uses proxy mode so not a big issue, but might affect error handling)
Documentation

Aborted reads with NOLEADER

With redis f88f866 (raft dfd91d4), it looks like the NOLEADER error code might sometimes indicate a successful commit, rather than a failure. Is this intended behavior? I didn't observe it in prior builds, but... here's a case involving partitions, process kills, pauses, and member changes where it seems to have occurred.

20200312T151542.000-0400.zip

For instance, take this pair of transactions:

{:type :fail,
 :f :txn,
 :value [[:append 295 223]],
 :process 562,
 :time 597134938964,
 :error :noleader,
 :index 123270}
{:type :ok,
 :f :txn,
 :value [[:r 297 [150]]
        [:r 295 [223 228 229 233]]
        [:r 295 [223 228 229 233]]],
 :process 617,
 :time 597289181022,
 :index 123334}

The first transaction appended 223 to key 295, and got a NOLEADER error in response. The second transaction went on to read 223 as the first element of the list.

This is especially strange because we have earlier reads of key 295, and those reads thought the value at that time was [60], then [], then [223 228 ...]. At least one of these versions must be a dirty read, and I suspect there's a lost update problem here too.

{:type :ok, :f :txn, :value [[:r 295 [60]] [:append 297 3]], :process 404, :time 594802102644, :index 122571}
...
{:type :ok, :f :txn, :value [[:r 297 []] [:r 297 []] [:r 295 []] [:r 295 []]], :process 617, :time 597075025988, :index 123245}
...
{:type :ok, :f :txn, :value [[:r 297 [150]] [:r 295 [223 228 229 233]] [:r 295 [223 228 229 233]]], :process 617, :time 597289181022, :index 123334}

Is it necessary to check `RedisModule_RegisterCommandFilter` in `setRaftizeMode`?

... since it's already checked in redisraft.c:

    /* Sanity check Redis version */
    if (RedisModule_SubscribeToServerEvent == NULL ||
            RedisModule_RegisterCommandFilter == NULL) {
        RedisModule_Log(ctx, REDIS_NOTICE, "Redis Raft requires Redis 6.0 or newer!");
        return REDISMODULE_ERR;
    }

that is before call to setRaftizeMode:

RRStatus setRaftizeMode(RedisRaftCtx *rr, RedisModuleCtx *ctx, bool flag)
{
    /* Is it supported by Redis? */
    if (!RedisModule_RegisterCommandFilter) {
        /* No; but if flag is false no harm is done. */
        return flag ? RR_ERROR : RR_OK;
    }

So the if (!RedisModule_RegisterCommandFilter) is always false, and the bool raftize_all_commands in RedisRaftConfig is not actually used.

Remove raftize configuration option.

The use of command filtering hooks to automatically run all commands through the Raft subsystem should be only mode of operation.

This will require some adaptations (and cleanups) of many tests.

Async log interface on leader

When appending entries to the local log, the leader can handle the operation asynchronously while also sending AppendEntries messages to cut latency down.

Possible lost updates with partitions and membership changes

This code is all REAL new so I'm not confident that this isn't some bug in Jepsen--treat this as tentative for now. It looks like we can lose short windows of updates when a membership change occurs during a network partition. This occurred with redis f88f866 (redisraft d589127), and proxy followers enabled.

For instance, in this run, a partition begins at 10:56:41, isolating n1 from all other nodes.

2020-03-06 10:56:41,004{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start-partition	[:isolated {"n2" #{"n1"}, "n5" #{"n1"}, "n4" #{"n1"}, "n3" #{"n1"}, "n1" #{"n2" "n5" "n4" "n3"}}]

Process 34 appends 53 to the end of key 20, and happens to read it, observing [...48 50 53]:

2020-03-06 10:56:46,922{GMT}	INFO	[jepsen worker 4] jepsen.util: 34	:ok	:txn	[[:append 20 53] [:r 20 [12 21 22 25 39 29 30 31 42 43 44 45 46 47 48 50 53]] [:append 19 119]]
11:23

Process 34 appends 55

2020-03-06 10:56:49,171{GMT}	INFO	[jepsen worker 4] jepsen.util: 34	:ok	:txn	[[:append 19 120] [:r 19 [7 2 32 43 59 53 64 67 75 81 90 106 107 108 109 112 113 115 116 117 118 119 120]] [:append 20 55]]

We remove node n4 from the cluster. Both n1 and n4 believe they're currently leaders, so we ask n1 to do it. Note that n1 is presently isolated from the rest of the cluster, but returns "OK" for the part request. We then kill n4 and wipe its data files, which means the only node which could have recent data would be... n5.

2020-03-06 10:56:51,109{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	"n4"
2020-03-06 10:56:51,222{GMT}	INFO	[jepsen nemesis] jepsen.redis.nemesis: Current membership
 ({:id 1583530776, :role :leader, :node "n1"}
 {:id 1211184293, :node "n5", :role :follower}
 {:id 1939788657, :node "n4", :role :leader})
...
2020-03-06 10:56:53,863{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	["n4" {"n1" "OK"}]

Ops fail for a bit, then we join n3 to the cluster. Some nodes still think n4 is a part of the cluster--likely because we told n1 to remove it, and n1 hasn't had a chance to inform anyone else, because it's partitioned. Node n1 still thinks it's a leader.

2020-03-06 10:57:03,968{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:join	"n3"
...
2020-03-06 10:57:04,080{GMT}	INFO	[jepsen nemesis] jepsen.redis.nemesis: Current membership
 ({:id 1583530776, :role :leader, :node "n1"}
 {:id 1211184293, :node "n5", :role :candidate}
 {:id 1939788657, :node "n4"})
...
2020-03-06 10:57:05,496{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:join	["n3" {"n3" "OK"}]

Things go on in noleader/nocluster for a bit, until we resolve a network partition:

2020-03-06 10:57:15,497{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:stop-partition	nil
...
2020-03-06 10:57:15,702{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:stop-partition	:network-healed

We have more noleader/timeout/conn-refused failures, then process 50, just before the end of the test, manages to execute a read of key 20

2020-03-06 10:57:23,497{GMT}	INFO	[jepsen worker 0] jepsen.util: 50	:invoke	:txn	[[:r 22 nil] [:append 21 53] [:r 20 nil] [:append 21 54]]
...
2020-03-06 10:57:25,516{GMT}	INFO	[jepsen worker 0] jepsen.util: 50	:ok	:txn	[[:r 22 []] [:append 21 53] [:r 20 [12 21 22 25 39 29 30 31 42 43 44 45 46 47 48 63]] [:append 21 54]]

It observes key 20 = [... 48 63]--we lost the updates of 50 and 53!

I suspect, based on this behavior, that node removal is asynchronous; that OK response only indicates local leader acknowledgement of the membership change, NOT a transition to joint consensus mode. I'm not sure whether that's actually a problem or not, because even if we hadn't gotten to joint consensus yet, n4 is down, and should have been required for quorum operations on the original cohort. Maybe earlier nemesis changes got us into a messed-up state? I'll keep digging.

We encountered this problem using jepsen.redis 6f4253e905588f51eb1b1e186266607a39fdf635:

lein run test-all --raft-version d589127 -w append --time-limit 600 --nemesis-interval 10 --follower-proxy --nemesis member,partition

but I haven't been able to reproduce it since. Still exploring.

Stale reads in normal operation

It looks as though redis f88f866 & redisraft d589127 exhibit stale reads when processes are allowed to pause. For instance, take a look at this test run

lein run test-all --raft-version d589127 -w append --time-limit 120 --nemesis-interval 1 --nemesis pause --test-count 5

Let:
T1 = {:type :ok, :f :txn, :value [[:r 3 [10]]], :process 3, :time 17433307265, :index 1248}
T2 = {:type :ok, :f :txn, :value [[:r 3 [10]] [:r 3 [10]] [:append 3 17]], :process 4, :time 16707553261, :index 1182}

Then:

T1 < T2, because T1 did not observe T2's append of 17 to 3.
However, T2 < T1, because T2 completed at index 1182, 0.724 seconds before the invocation of T1, at index 1247: a contradiction!

uv_poll_stop: Assertion `!uv__is_closing(handle)' failed.

With redis f88f866 and redisraft 2d1cf30, testing with membership changes, node pauses, crashes, and partitions, occasionally nodes crash with the following assertion:

76761:09 May 10:22:10.877 <raftlib> received entry t:59 id: 580797296 idx: 4067
76761:09 May 10:22:10.882 node:590852739: raft.c:479: <raftlib> sending appendentries: ci:4067 comi:4066 t:59 lc:4066 pli:4066 plt:59 msgid:79 #1
76761:09 May 10:22:10.882 node:590852739: node.c:361: NodeAddPendingResponse: id=1126, type=raft, request_time=1589044930882
76761:09 May 10:22:10.882 node:98688395: raft.c:479: <raftlib> sending appendentries: ci:4067 comi:4066 t:59 lc:4066 pli:4066 plt:59 msgid:79 #1
76761:09 May 10:22:10.882 node:98688395: node.c:361: NodeAddPendingResponse: id=1127, type=raft, request_time=1589044930882
76761:09 May 10:22:10.882 node:1029979987: node.c:74: Node connection established.
redis-server: src/unix/poll.c:112: uv_poll_stop: Assertion `!uv__is_closing(handle)' failed.

This error looks recoverable; a subsequent restart went OK, and the rest of the cluster kept serving requests.

20200509T131903.000-0400.zip

To reproduce, use Jepsen.redis 6c063da0503e9430a0354d37da8a4da430d1671e, and run:

lein run test-all --raft-version 2d1cf30 --time-limit 600 --nemesis partition,kill,pause,member --follower-proxy --test-count 16 --concurrency 4n

Node.entries Makefile compilation issue

Hi. I've been getting these errors running make -B on the Makefile.
I'm not too familiar with the C language, so there might be something I'm missing here.

node.c:164:5: warning: implicit declaration of function ‘LIST_FOREACH_SAFE’; did you mean ‘LIST_FOREACH’? [-Wimplicit-function-declaration]
     LIST_FOREACH_SAFE(node, &node_list, entries, tmp) {
     ^~~~~~~~~~~~~~~~~
     LIST_FOREACH
node.c:164:41: error: ‘entries’ undeclared (first use in this function)
     LIST_FOREACH_SAFE(node, &node_list, entries, tmp) {
                                         ^~~~~~~
node.c:164:41: note: each undeclared identifier is reported only once for each function it appears in
node.c:164:55: error: expected ‘;’ before ‘{’ token
     LIST_FOREACH_SAFE(node, &node_list, entries, tmp) {
                                                       ^
<builtin>: recipe for target 'node.o' failed
make: *** [node.o] Error 1

Built with:

git submodule init
git submodule update
make -B

Adding a node with the same id as a previously existing node should be forbidden

After creating a cluster with 2 nodes,
then running "RAFT.NODE remove 2"
Then running another server (with new log) and uid 2, when trying to run "RAFT.CLUSTER JOIN localhost:5001" the new server crashes with error:
"Assertion failed: (node), function raft_handle_append_cfg_change, file src/raft_server.c, line 1227."

Resolve ambiguous type conversions with the help of clang-tidy

Currently there are many implicit type conversions in the code, e.g. between unsigned long and int.
When compiling with clang, the behavior is sometimes different that with gcc (such as the recently fixed unsigned int connect_oks which was changed to the correct unsigned long int).

We should make sure all remaining such issues, as identified by clang-tidy, are fixed.

It would be even better to create typedefs for commonly used integer types (like size_t for example) to avoid this in the future.

Total data loss on failover

With Redis f88f866 and Redisraft 1b3fbf6, every failover results in the complete loss of all data in the cluster. This can occur both in healthy clusters due to sporadic leader elections, by pausing or killing nodes, network partitions, etc. redis monitor does show leader-executed commands on followers, but I'm not clear if that's being proxied from the Raft leader, replicated but not actually applied to the local state machine, applied but then destroyed somehow, or what.

For example:

$ ssh n1 /opt/redis/redis-cli set x foo
OK
$ ssh n1 /opt/redis/redis-cli set y bar
OK
$ ssh n1 /opt/redis/redis-cli set z baz
OK
$ ssh n1 /opt/redis/redis-cli set t xyzzy

^C⏎

Here, a leader election occurred during our set of t. We retry on n2...

$ ssh n1 /opt/redis/redis-cli set t xyzzy
MOVED 192.168.122.12:6379

$ ssh n2 /opt/redis/redis-cli set t xyzzy
OK

And lo and behold, all the keys we wrote before are gone.

$ ssh n2 /opt/redis/redis-cli keys "'*'"
t

If we kill n2 as a leader, even that disappears.

$ ssh n2 killall -9 redis-server
$ ssh n3 /opt/redis/redis-cli keys "'*'"
__raft_snapshot_info__

Add backtrace support

When an error value is returned (e.g. RAFT_ERR_SHUTDOWN) it’s hard to know where exactly it came from. Some places (especially in the libraft code) log an explicit message before returning the error, but not all of them.

We should add backtrace support (like Redis itself has) to make it easier to debug failures.

Split brain & lost update with mixed faults

In version f88f866, process pauses, crashes, partitions, and membership changes can reliably put clusters into what looks like a split-brain state leading to apparent data loss. Take, for example, this history.

A little bit before the anomaly, we start a partition, isolating [n1 n2 n5 | n3 n4]

2020-03-18 00:48:27,685{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start-partition        :majority
...
2020-03-18 00:48:27,788{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start-partition        [:isolated {"n4" #{"n2" "n5" "n1"}, "n3" #{"n2" "n5" "n1"}, "n2" #{"n4" "n3"}, "n5" #{"n4" "n3"}, "n1" #{"n4" "n3"}}]

We kill and restart all nodes:

2020-03-18 00:48:38,321{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :kill   :all
...
2020-03-18 00:48:40,527{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :kill   {"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}
...
2020-03-18 00:48:42,528{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start  :all
...
2020-03-18 00:48:42,735{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start  {"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}

Now watch what happens to key 81:

2020-03-18 00:48:43,740{GMT}    INFO    [jepsen worker 14] jepsen.util: 239     :ok     :txn    [[:r 82 [69 70]] [:r 81 [176 177 178]]]
...
2020-03-18 00:48:43,808{GMT}    INFO    [jepsen worker 15] jepsen.util: 265     :ok     :txn    [[:r 81 [171 172 176 177 178]]]
...
2020-03-18 00:48:44,222{GMT}    INFO    [jepsen worker 14] jepsen.util: 239     :ok     :txn    [[:append 82 125] [:r 81 [176 177 178 208]]]
...
2020-03-18 00:48:44,335{GMT}    INFO    [jepsen worker 5] jepsen.util: 230      :ok     :txn    [[:r 82 [68 71 72 102 125]] [:r 81 [171 172 176 177 178 208]] [:r 81 [171 172 176 177 178 208]]]
...
2020-03-18 00:48:44,343{GMT}    INFO    [jepsen worker 10] jepsen.util: 285     :ok     :txn    [[:r 81 [171 172 176 177 178 208]] [:r 83 [4]] [:r 83 [4]] [:r 81 [171 172 176 177 178 208]]]

Worker 14, talking to n5, observed key 81 as [176 177 178], but worker 15, on node n1, saw [171 172 176 177 178]. It kinda looks like n5 somehow... lost the writes of 171 and 172, but managed to apply later updates of 176, 177, and 178. Both n1 and n5 go on to append 208, which suggests that somehow both nodes are still alive, communicating, and accepting the same writes, even though their state has diverged. REAL weird.

Same thing happens with key 82, at the same time.

2020-03-18 00:48:43,740{GMT}    INFO    [jepsen worker 14] jepsen.util: 239     :ok     :txn    [[:r 82 [69 70]] [:r 81 [176 177 178]]]
...
2020-03-18 00:48:44,020{GMT}    INFO    [jepsen worker 14] jepsen.util: 239     :ok     :txn    [[:r 83 [4]] [:r 82 [69 70 102]]]
...
2020-03-18 00:48:44,335{GMT}    INFO    [jepsen worker 5] jepsen.util: 230      :ok     :txn    [[:r 82 [68 71 72 102 125]] [:r 81 [171 172 176 177 178 208]] [:r 81 [171 172 176 177 178 208]]]

Acknowledged writes of 69, 70, and 102 (all performed by process 14, on node n5), were visible to process 14, but not to any other process. Every subsequent operation observed some extension of [68 71 72 102 125 ...], without 69, 70, or 102. Something feels off here!

Logging cleanup

Need to have a round of logging cleanups:

Unify log prefix with Redis format for easier parsing
Some missing newlines
Re-consider using RM_Log()?

Crossed wires with follower-proxy

It looks like clusters with follower-proxy, network partitions, and crashes (or possibly fewer faults; I'm trying to narrow it down) can occasionally get into states where one specific Redis node gets mixed up and sends inappropriate responses to clients. I think what we're seeing is responses intended for other clients getting dispatched to the wrong place.

lein run test-all --raft-version dfd91d4 -w append --time-limit 600 --nemesis-interval 1 --nemesis standard --test-count 10 --concurrency 5n --follower-proxy

20200313T233858.000-0400.zip

We resolve a network partition:

2020-03-13 23:41:36,284{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :stop-partition nil

The first sign of weirdness comes from worker 5, executing process 130. It tries to execute a MULTI transaction, opens a fresh connection, and gets a single "211" instead of a vector of results from an LRANGE command.

2020-03-13 23:41:41,025{GMT}    INFO    [jepsen worker 5] jepsen.util: 130      :invoke :txn    [[:r 47 nil] [:r 47 nil] [:append 46 215]]
2020-03-13 23:41:41,026{GMT}    INFO    [jepsen worker 5] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.So
cket 0x5c701a81 Socket[addr=n1/192.168.122.11,port=6379,localport=41566]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x73be0bac java.io.DataInputStream@73be0bac], :out #object[java.io.BufferedOutputStream 0x6900a5e8 java.io.BufferedOutputStream@6900a5e8]}}, :in-txn? #object[clojure.lang.Atom 0x7407179f {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-
ms 10000}}
2020-03-13 23:41:41,117{GMT}    WARN    [jepsen worker 5] jepsen.core: Process 130 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 47, :value "211"}

Concurrently, worker 15/process 140 does the same thing:

2020-03-13 23:41:41,045{GMT}    INFO    [jepsen worker 15] jepsen.util: 140     :invoke :txn    [[:r 46 nil] [:append 45 226] [:append 46 216] [:append 46 217]]
2020-03-13 23:41:41,046{GMT}    INFO    [jepsen worker 15] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.S
ocket 0x2925d9b7 Socket[addr=n1/192.168.122.11,port=6379,localport=41580]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x12c9c95b java.io.DataInputStream@12c9c95b], :out #object[java.io.BufferedOutputStream 0x15deb95b java.io.BufferedOutputStream@15deb95b]}}, :in-txn? #object[clojure.lang.Atom 0x747af9c6 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout
-ms 10000}}
2020-03-13 23:41:41,123{GMT}    WARN    [jepsen worker 15] jepsen.core: Process 140 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 46, :value "211"}

It's particularly weird that both of them got "211" here.

Concurrently, process 20/worker 145 does a NON multi read--just a regular old single LRANGE by itself, on a fresh connection, and gets [2 3 ["226"]] which is... VERY weird.

2020-03-13 23:41:41,107{GMT}    INFO    [jepsen worker 20] jepsen.util: 145     :invoke :txn    [[:r 46 nil]]
2020-03-13 23:41:41,108{GMT}    INFO    [jepsen worker 20] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.S
ocket 0x6ec17aa1 Socket[addr=n1/192.168.122.11,port=6379,localport=41584]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x530146a8 java.io.DataInputStream@530146a8], :out #object[java.io.BufferedOutputStream 0xcdfc07b java.io.BufferedOutputStream@cdfc07b]}}, :in-txn? #object[clojure.lang.Atom 0x3c20d3b9 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-m
s 10000}}
2020-03-13 23:41:41,257{GMT}    WARN    [jepsen worker 20] jepsen.core: Process 145 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 46, :value [2 3 ["226"]]}

Same thing happens to worker 10's request with process 160. Fresh connection, a MULTI transaction, and it gets the vector [2 3]--it is a vector, but it's of integers, not strings, like we expected. Fresh connection here too.

2020-03-13 23:41:40,927{GMT}    INFO    [jepsen worker 10] jepsen.util: 160     :invoke :txn    [[:r 47 nil] [:append 46 211] [:append 47 111] [:append 46 212]]
2020-03-13 23:41:40,928{GMT}    INFO    [jepsen worker 10] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.S
ocket 0x274a045 Socket[addr=n1/192.168.122.11,port=6379,localport=41548]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x657968b9 java.io.DataInputStream@657968b9], :out #object[java.io.BufferedOutputStream 0x6f11d2be java.io.BufferedOutputStream@6f11d2be]}}, :in-txn? #object[clojure.lang.Atom 0x29e75eed {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-
ms 10000}}
2020-03-13 23:41:41,308{GMT}    WARN    [jepsen worker 10] jepsen.core: Process 160 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 45, :value [2 3]}

Immediately after this, we detect the completion of a node remove operation that we'd been waiting on for a while: node n1 finishes removing n5 from the cluster.

2020-03-13 23:41:41,314{GMT}    INFO    [clojure-agent-send-off-pool-10] jepsen.redis.db: :done-waiting-for-node-removal {:raft
 {:role :follower,
  :num_voting_nodes "2",
  :leader_id 813301272,
  :is_voting "yes",
  :node_id 1597309644,
  :num_nodes 2,
  :state :up,
  :nodes
  ({:id 813301272,
    :state "connected",
    :voting "yes",
    :addr "192.168.122.12",
    :port 6379,
    :last_conn_secs 94,
    :conn_errors 0,
    :conn_oks 1}),
  :current_term 114},
 :log
 {:log_entries 1243,
  :current_index 1243,
  :commit_index 1243,
  :last_applied_index 1243,
  :file_size 173406,
  :cache_memory_size 136972,
  :cache_entries 1090},
 :snapshot {:snapshot_in_progress "no"},
 :clients {:clients_in_multi_state 0}}

Node n2 starts removing n3:

2020-03-13 23:41:41,349{GMT}    INFO    [jepsen node n2] jepsen.redis.db: n2 :removing n3 (id: 859783830)

Worker 5, process 180, hits a type error as well--it expects a list of elements from an LRANGE, but gets a single Long instead.

2020-03-13 23:41:41,322{GMT}    INFO    [jepsen worker 5] jepsen.util: 180      :invoke :txn    [[:r 46 nil] [:append 47 135] [:append 47 136] [:r 46 nil]]
2020-03-13 23:41:41,323{GMT}    INFO    [jepsen worker 5] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.Socket 0x7a37e654 Socket[addr=n1/192.168.122.11,port=6379,localport=41632]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x1fcb0d1f java.io.DataInputStream@1fcb0d1f], :out #object[java.io.BufferedOutputStream 0x24bb596f java.io.BufferedOutputStream@24bb596f]}}, :in-txn? #object[clojure.lang.Atom 0x335f2349 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-ms 10000}}
2020-03-13 23:41:41,365{GMT}    WARN    [jepsen worker 5] jepsen.core: Process 180 crashed
java.lang.IllegalArgumentException: Don't know how to create ISeq from: java.lang.Long

Worker 0, executing process 200, goes to perform a MULTI transaction. It opens a new connection to do so...

2020-03-13 23:41:42,021{GMT}    INFO    [jepsen worker 0] jepsen.util: 200      :invoke :txn    [[:r 49 nil] [:append 48 21]]
2020-03-13 23:41:42,021{GMT}    INFO    [jepsen worker 0] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[j
ava.net.Socket 0x25b8659b Socket[addr=n1/192.168.122.11,port=6379,localport=41722]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x55c08913 java.io.DataInput
Stream@55c08913], :out #object[java.io.BufferedOutputStream 0x4e607e5b java.io.BufferedOutputStream@4e607e5b]}}, :in-txn? #object[clojure.lang.Atom 0x60177bc0 {:status :ready, :val false}], :spec {:
host n1, :port 6379, :timeout-ms 10000}}

That request succeeds with what looks like normal results.

2020-03-13 23:41:42,121{GMT}    INFO    [jepsen worker 0] jepsen.util: 200      :ok     :txn    [[:r 49 [111 130 131 132 133 135 136 137 138 153 154 156 166 169 170 175 176 179 180 181]] [:append 48 21]]

Weirdly, this conflicts with other reads of key 49, so I'm not sure if we got the right thing here or not.

Worker 0 then tries to perform a single read of 48. This doesn't involve an EXEC, so it should have returned a single value. Instead it gets a vector of vectors, which... what?

2020-03-13 23:41:42,257{GMT}    INFO    [jepsen worker 0] jepsen.util: 200      :invoke :txn    [[:r 48 nil]]
...
2020-03-13 23:41:42,373{GMT}    WARN    [jepsen worker 0] jepsen.core: Process 200 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 48, :value [["2" "4" "5" "6" "8" "9" "14" "15" "17" "21"] ["111" "130" "131" "132" "133" "135" "136" "137" "138" "153" "154" "1
56" "166" "169" "170" "175" "176" "179" "180" "181" "194"] ["3" "11" "12" "13" "21"]]}

Eyeballing the vectors it returned, it looks like this is a read of keys [? 49 ?] respectively.

Then worker 0 (logical process 225) goes to perform a transaction:

2020-03-13 23:41:42,446{GMT}    INFO    [jepsen worker 0] jepsen.util: 225      :invoke :txn    [[:r 49 nil] [:r 48 nil] [:r 47 nil] [:append 47 211]]

225 is a new process, so it opens a fresh connection to perform its first request. Just to confirm, yes, this is a new port and connection object!

2020-03-13 23:41:42,447{GMT}    INFO    [jepsen worker 0] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.Socket 0x417158d7 Socket[addr=n1/192.168.122.11,port=6379,localport=41762]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x6d4fbc73 java.io.DataInputStream@6d4fbc73], :out #object[java.io.BufferedOutputStream 0x5c7b23f7 java.io.BufferedOutputStream@5c7b23f7]}}, :in-txn? #object[clojure.lang.Atom 0x59d46b31 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-ms 10000}}

It performs a MULTI, then a read (LRANGE) of key 49, read of 48, read of 47, and appends (RPUSH) 211 to key 47. It EXECs the transaction, and gets back a vector of responses. The first response should have been a list of values for key 49, but instead we obtained "2", which... what?

2020-03-13 23:41:42,552{GMT}    WARN    [jepsen worker 0] jepsen.core: Process 225 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 49, :value "2"}

Later, we'll read key 49 again and get a weird response:

2020-03-13 23:41:41,979{GMT}    INFO    [jepsen worker 20] jepsen.util: 270     :ok     :txn    [[:append 49 5] [:r 47 [2 4 5 6]] [:r 49 [3 11 12]] [:append 49 6]]

This read is clearly messed up because we just appended 5 to key 49, and didn't observe it. Also, the append of 12 to key 49 happened after this, and failed. It definitely can't be reading 49.

I feel like this points to wires getting crossed either inside Redis or the Carmine library--or maybe I'm somehow STILL using the library wrong. I think... what I should do next is get a Wireshark dump of the traffic back and forth, and try to figure out what's going on at a protocol level.

Performance testing problem

Hi,
I used a cluster of three machines, each with IOPS of 1432, but I got the low test value of 50-100 Ops/s for Redisraft set. Is this normal?

Loss of committed writes, duplicate elements, dueling histories

With process crashes and network partitions, redis f88f866 (redisraft 2d1cf30) can occasionally lose operations which were visible to subsequent reads and writes. This bug might be able to cause the loss of acknowledged writes, but I can't tell for sure from the history, because the write in this case crashed rather than being acknowledged successfully.

20200510T091809.000-0400.zip

Process 526 appends 25 to key 297, talking to node n2, which results in an indeterminate failure when the connection hits EOF.

2020-05-10 09:26:55,332{GMT}    INFO    [jepsen worker 6] jepsen.util: 526      :invoke :txn    [[:r 298 nil] [:r 298 nil] [:append 297 25] [:r 293 nil]]
2020-05-10 09:26:55,386{GMT}    INFO    [jepsen worker 6] jepsen.util: 526      :info   :txn    [[:r 298 nil] [:r 298 nil] [:append 297 25] [:r 293 nil]]    :eof

A read on n4 observes element 25 six seconds later

2020-05-10 09:27:01,223{GMT}    INFO    [jepsen worker 8] jepsen.util: 548      :ok     :txn    [[:append 298 92] [:r 297 [1 3 2 4 6 5 7 8 9 10 11 12 13 14 15 16 17 18 20 21 19 22 23 24 26 25]] [:append 298 93] [:append 293 106]]

n2 confirms:

2020-05-10 09:27:01,672{GMT}    INFO    [jepsen worker 1] jepsen.util: 521      :ok     :txn    [[:append 298 127] [:r 297 [1 3 2 4 6 5 7 8 9 10 11 12 13 14 15 16 17 18 20 21 19 22 23 24 26 25 27]] [:r 299 [1 2 3 4 5 6 7 8 9 10 11 13 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 34 33 35 36 37 38 39 40 44 54 55 56 57]]]

As does n1:

2020-05-10 09:27:03,843{GMT}    INFO    [jepsen worker 10] jepsen.util: 570     :ok     :txn    [[:r 300 [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 27 28 30 31 34 35 36 37 39 40 41 42 43 44 46 47 48 50 51 54 57 58 62 63 64 65 66 67 69 70 74 76 78 80 81 82 83 84 85 87 91 93 94 96 103 104 105 107 108 109]] [:append 300 110] [:r 297 [1 3 2 4 6 5 7 8 9 10 11 12 13 14 15 16 17 18 20 21 19 22 23 24 26 25 27 29 32 35 36 37 39 40 41 42 44 45 46 47]] [:append 300 111]]

The final read of 25 completes against n2 just under two seconds later.

2020-05-10 09:27:03,859{GMT}    INFO    [jepsen worker 11] jepsen.util: 511     :ok     :txn    [[:r 299 [1 2 3 4 5 6 7 8 9 10 11 13 12 14 15 16 17 18 19 20 21 
22 23 24 25 26 27 28 29 30 31 32 34 33 35 36 37 38 39 40 44 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 78 79 82 83 84 86 89 90 91 92 96 9
9 101 105 107 108 109 110 111 112 113 118]] [:append 299 119] [:r 297 [1 3 2 4 6 5 7 8 9 10 11 12 13 14 15 16 17 18 20 21 19 22 23 24 26 25 27 29 32 35 36 37 39
 40 41 42 44 45 46 47 48]] [:r 300 [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 27 28 30 31 34 35 36 37 39 40 41 42 43 44 46 47 48 50 51 54 5
7 58 62 63 64 65 66 67 69 70 74 76 78 80 81 82 83 84 85 87 91 93 94 96 103 104 105 107 108 109 110 111 112 113 114 116 117]]]

We go through a period of noleader, notleader, and EOFs as jepsen kills n4:

2020-05-10 09:27:03,776{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :kill   :primaries
...
2020-05-10 09:27:06,085{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :kill   {"n4" ""}

And restarts all nodes three seconds later:

2020-05-10 09:27:09,085{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start  :all
2020-05-10 09:27:09,291{GMT}    INFO    [jepsen nemesis] jepsen.util: :nemesis  :info   :start  {"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}

Node n4's startup looks fairly normal, but immediately after startup, we're able to perform our first read of key 297, and it turns out that the write of 25, has been deleted from n1's history. This is especially weird, because elements which were written after 25 are still present!

2020-05-10 09:27:09,828{GMT}    INFO    [jepsen worker 0] jepsen.util: 560      :ok     :txn    [[:r 297 [1 3 2 4 6 5 7 8 9 10 11 12 13 14 15 16 17 18 20 21 19 
22 23 24 26 27 29 32 35 36 37 39 40 41 42 44 45 46 47 48 106]] [:r 306 [110]] [:r 308 [100 103]]]

It's also gone from n2:

2020-05-10 09:27:10,073{GMT}    INFO    [jepsen worker 16] jepsen.util: 576     :ok     :txn    [[:r 308 [100 103 109 110 111 112 115 116 117 118 120 121]] [:append 289 87] [:r 297 [1 3 2 4 6 5 7 8 9 10 11 12 13 14 15 16 17 18 20 21 19 22 23 24 26 27 29 32 35 36 37 39 40 41 42 44 45 46 47 48 106]] [:r 307 [30 31 33]]]

And n4:

2020-05-10 09:27:10,984{GMT}    INFO    [jepsen worker 18] jepsen.util: 538     :ok     :txn    [[:r 309 [1 2 4 5 6 9 10 13 14 15 16 19 17 20 21 22 23 26 27 28 29 31 35 42 44 45 52 59 60 63 64 67 69 70 72 73 75 80 81]] [:r 297 [1 3 2 4 6 5 7 8 9 10 11 12 13 14 15 16 17 18 20 21 19 22 23 24 26 27 29 32 35 36 37 39 40 41 42 44 45 46 47 48 106 113 114 117 118 119 120]] [:r 309 [1 2 4 5 6 9 10 13 14 15 16 19 17 20 21 22 23 26 27 28 29 31 35 42 44 45 52 59 60 63 64 67 69 70 72 73 75 80 81]] [:r 309 [1 2 4 5 6 9 10 13 14 15 16 19 17 20 21 22 23 26 27 28 29 31 35 42 44 45 52 59 60 63 64 67 69 70 72 73 75 80 81]]]

So... not only did the node we wrote to forget about the write, but every node that we know saw the write forgot about it as well--but they forgot about it in a way that somehow preserved later writes. Since they're proxying, this suggests we've got two primaries with competing histories.

Two subsequent restarts of n4 go fine, but forty seconds later, restarting Redis on n4 leads to a crash with an assertion error about an appendEntries term which conflicts with a committed entry; I've filed that bug as #53, and it might be some kind of delayed manifestation of whatever caused this anomaly.

You can reproduce this problem on Jepsen.redis ddb68c8961a8f2b1cb0b37f263bd1eb0930ab5b8 by running:

lein run test-all --raft-version 2d1cf30 --time-limit 600 --nemesis partition,kill --follower-proxy --test-count 40 --concurrency 4n

Transient empty values on process restart

It looks as though redis-raft (redis f88f866, redisraft d589127) can return empty reads when processes crash and restart. For instance, in this test, killing all nodes, then restarting all nodes, then killing n1, n3, and n4, allows n2, five seconds later, to return an empty list [] for key 0, rather than [4 5 7 26 30 31 48 49 84 86 87 89]:

2020-03-09 17:50:31,885{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:ok	:txn	[[:append 0 89] [:r 0 [4 5 7 26 30 31 48 49 84 86 87 89]]]
2020-03-09 17:50:32,155{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:ok	:txn	[[:append 0 97]]
2020-03-09 17:50:32,157{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:kill	:all
2020-03-09 17:50:34,368{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:kill	{"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}
2020-03-09 17:50:35,368{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start	:all
2020-03-09 17:50:35,575{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start	{"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}
2020-03-09 17:50:36,575{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:kill	:majority
2020-03-09 17:50:38,784{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:kill	{"n1" "", "n3" "", "n4" ""}
2020-03-09 17:50:39,784{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start	:all
2020-03-09 17:50:39,991{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start	{"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}
2020-03-09 17:50:40,418{GMT}	INFO	[jepsen worker 1] jepsen.util: 6	:ok	:txn	[[:r 0 []]]

The correct values show up in immediately following reads on the same node, so I don't think the data is gone per se. It might just be that Redis is allowed to serve requests before Raft initialization has completed?

2020-03-09 17:50:40,662{GMT}	INFO	[jepsen worker 1] jepsen.util: 6	:ok	:txn	[[:append 2 115] [:r 1 [14 20 29 30 37 38 44 47 50 52 53 54 56 77 82 96 97 129]] [:r 0 [4 5 7 26 30 31 48 49 84 86 87 89 97]] [:r 0 [4 5 7 26 30 31 48 49 84 86 87 89 97]]]

raft_node_set_voting: Assertion `raft_node_is_voting(me_)' failed.

Think I found a crash with redis f88f866 and redis-raft 2d1cf30: 20200509T103735.000-0400.zip. In this case, node n2 crashed while running, logging:

32254:09 May 07:38:22.157 node:1328432520: raft.c:479: <raftlib> received appendentries response SUCCESS ci:1566 rci:1563 msgid:71
32254:09 May 07:38:22.157 node:93331471: node.c:377: NodeDismissPendingResponse: id=1341, type=raft, latency=57
32254:09 May 07:38:22.157 node:93331471: raft.c:479: <raftlib> received appendentries response SUCCESS ci:1566 rci:1563 msgid:71
32254:09 May 07:38:22.157 <raftlib> received entry t:8 id: 905278827 idx: 1567
32254:09 May 07:38:22.174 <raftlib> received entry t:8 id: 15706788 idx: 1568
32254:09 May 07:38:22.194 <raftlib> received entry t:8 id: 848016822 idx: 1569
redis-server: src/raft_node.c:124: raft_node_set_voting: Assertion `raft_node_is_voting(me_)' failed.

On restarting the node about a minute later, it crashed again:

32521:09 May 07:39:20.060 Snapshot configuration loaded. Raft state:
32521:09 May 07:39:20.060   node <unknown?>
32521:09 May 07:39:20.060   node id=93331471,addr=192.168.122.11,port=6379
32521:09 May 07:39:20.060   node id=1328432520,addr=192.168.122.14,port=6379
32521:09 May 07:39:20.060   node id=1645702678,addr=192.168.122.13,port=6379
redis-server: src/raft_node.c:124: raft_node_set_voting: Assertion `raft_node_is_voting(me_)' failed.

A subsequent restart, roughly 36 seconds later, looks to have been successful, so maybe this was a transient problem imposed by... other nodes states? Network conditions?

You can reproduce this with jepsen.redis 6c063da0503e9430a0354d37da8a4da430d1671e, by running

lein run test-all --raft-version 2d1cf30 --time-limit 600 --nemesis partition,kill,pause,member --follower-proxy --test-count 16 --concurrency 4n

Spurious NOLEADER in healthy clusters

I'm not entirely sure what's up with this case--it looks like the cluster started up and formed normally, we joined all nodes to n1, n1 was the leader, then... immediately thereafter, n1 started insisting that there was no leader, and rejecting requests, while the rest of the cluster went on normally.

20200311T174200.000-0400.zip

To start, n1 is the leader:

 ({:id 958289600, :role :leader, :node "n1"}
 {:id 183312529, :node "n2", :role :follower}
 {:id 561617232, :node "n3", :role :follower}
 {:id 314213252, :node "n4", :role :follower}
 {:id 1472507146, :node "n5", :role :follower})

n1 executes a few transactions

2020-03-11 17:55:40,336{GMT}	INFO	[jepsen worker 15] jepsen.util: 15	:ok	:txn	[[:append 2 2] [:r 2 [2]] [:r 1 []]]

But a couple seconds later, it flips to NOTLEADER, then NOLEADER:

2020-03-11 17:42:13,330{GMT}	INFO	[jepsen worker 15] jepsen.util: 15	:fail	:txn	[[:r 2 nil]]	:notleader
2020-03-11 17:42:13,330{GMT}	INFO	[jepsen worker 30] jepsen.util: 30	:fail	:txn	[[:r 1 nil]]	:notleader
2020-03-11 17:42:13,336{GMT}	INFO	[jepsen worker 40] jepsen.util: 40	:fail	:txn	[[:r 0 nil] [:r 1 nil] [:r 0 nil] [:append 2 23]]	:noleader
2020-03-11 17:42:13,336{GMT}	INFO	[jepsen worker 25] jepsen.util: 25	:fail	:txn	[[:append 1 3] [:append 0 5]]	:noleader

After a few hundred seconds of this, we start getting socket timeouts on n1:

2020-03-11 17:42:22,466{GMT}	INFO	[jepsen worker 43] jepsen.util: 43	:info	:txn	[[:append 2 4]]	:socket-timeout

After running raft.node remove on master, the other nodes can’t find new leader

Should either block removing master or fix the behaviour so other nodes will recover from master removal and elect new leader.

Is it possible that the redisraft can work with other redis modules?

I have another Redis module which extends my business from Redis by using Redis module.

Is it possible that I can load 2 modules together, like:

redis-server --load redisraft x-module

It seems we can't. Because the 2 modules have their own commands. emmm, what's the best practice to combine redisraft to my buiness module?

Store data on only 3 nodes

Let's assume we have a RedisRaft cluster with 9 nodes. Currently the same data will be saved on all 9 nodes or am I wrong?
Do you plan that users can change the replication factor?
I think for most users it would be enough that the data is stored on 3 nodes. This way a cluster could save much more data.

Server crashes with callRaftPeriodic: Assertion `ret == 0' failed.

Under normal operation (e.g. without membership changes, partitions, etc) and with proxy mode enabled, redis f88f866 (redis-raft b9ee410) servers can spontaneously crash, logging:

22371:05 May 05:46:37.558 node:1060682460: raft.c:342: raft_recv_appendentries_response failed, error -2
22464:C 05 May 2020 05:46:37.560 * DB saved on disk
22371:05 May 05:46:37.649 Snapshot created, 4 log entries rewritten to log.
22371:05 May 05:46:37.649 Snapshot operation completed successfuly.
22371:05 May 05:46:37.654 Log rewrite complete, 0 entries appended (from idx 9174).
22371:05 May 05:46:37.655 <raftlib> end snapshot base:9173 commit-index:9174 current-index:9178
22371:05 May 05:46:38.745 node:319092381: raft.c:479: <raftlib> recvd appendentries t:11 ci:9178 lc:9176 pli:9174 plt:9 #2
22371:05 May 05:46:38.748 <raftlib> becoming follower
22371:05 May 05:46:38.748 State change: Node is now a follower, term 11
22371:05 May 05:46:38.748 <raftlib> randomize election timeout to 1922
22371:05 May 05:46:38.748 node:319092381: node.c:361: NodeAddPendingResponse: id=14638, type=proxy, request_time=1588682798748
22371:05 May 05:46:38.748 node:1598587288: raft.c:479: <raftlib> node requested vote: 1585459200 replying: not granted
22371:05 May 05:46:38.748 node:1060682460: raft.c:479: <raftlib> node requested vote: 1585459296 replying: not granted
22371:05 May 05:46:38.748 node:201926373: raft.c:479: <raftlib> node requested vote: 1640037424 replying: not granted
22371:05 May 05:46:38.755 node:319092381: node.c:377: NodeDismissPendingResponse: id=14638, type=proxy, latency=7
22371:05 May 05:46:38.795 node:319092381: node.c:361: NodeAddPendingResponse: id=14639, type=proxy, request_time=1588682798795
22371:05 May 05:46:38.806 node:319092381: node.c:377: NodeDismissPendingResponse: id=14639, type=proxy, latency=11
22371:05 May 05:46:38.848 node:319092381: raft.c:479: <raftlib> recvd appendentries t:11 ci:9178 lc:9179 pli:9174 plt:9 #5
redis-server: raft.c:784: callRaftPeriodic: Assertion `ret == 0' failed.

It looks like this node was a leader at 05:45:16, took a snapshot, discovered another node won an election and stepped down, just prior to the ret == 0 assertion failing. The node crashes at that point; active clients get an EOF on the socket, and subsequent connections are refused. Here's a complete log from Jepsen 6c063da0503e9430a0354d37da8a4da430d1671e:

lein run test-all --raft-version b9ee410 --time-limit 600 --nemesis none --follower-proxy --test-count 16

20200505T084148.000-0400.zip

This particular line suggests that maybe... the snapshot process exited with some sort of unexpected state. I wonder if maybe the stepdown occurring immediately after/during the snapshot might have been to blame...

redisraft/raft.c

Line 784 in b9ee410

assert(ret == 0);

Possible segfault?

With redis f88f866 and redis-raft 2d1cf30, tests with process pauses, crashes, network partitions, and membership changes could rarely cause nodes to exit with a nil assertion failure, no stacktrace, and signal 11, which is SIGSEGV, right?

20200509T141124.000-0400.zip

84255:09 May 11:14:05.742 node:1701768179: raft.c:479: <raftlib> recvd appendentries t:15 ci:4836 lc:4836 pli:4836 plt:15 #1
84255:09 May 11:14:05.750 node:1701768179: node.c:361: NodeAddPendingResponse: id=744, type=proxy, request_time=1589048045750
84255:09 May 11:14:05.757 node:1701768179: raft.c:479: <raftlib> recvd appendentries t:15 ci:4837 lc:4836 pli:4836 plt:15 #3
84255:09 May 11:14:05.770 node:964805122: node.c:74: Node connection established.
84255:09 May 11:14:05.770 node:644665335: node.c:74: Node connection established.


=== REDIS BUG REPORT START: Cut & paste starting from here ===
84255:M 09 May 2020 11:14:05.770 # Redis 999.999.999 crashed by signal: 11
84255:M 09 May 2020 11:14:05.770 # Crashed running the instruction at: 0x7fac51395854
84255:M 09 May 2020 11:14:05.770 # Accessing address: (nil)
84255:M 09 May 2020 11:14:05.770 # Failed assertion: <no assertion failed> (<no file>:0)

------ STACK TRACE ------

I recognize this is incredibly unhelpful, and I am SO sorry to file this. Maybe my cluster is cursed.

Add Redis Streams support

Performance improvement

Perform a round of benchmark and profiling to come up with a performance improvement plan.
Some issues are already known, see #76 and #35

RPUSH commands infinitely repeated with follower-proxy

With Redis f88f866 and Redisraft 1b3fbf6, executing a single RPUSH command against a fresh cluster creates an infinite loop, wherein that RPUSH command is executed repeatedly, gradually ballooning the size of the value, state machine, and log. It also seems to cause some (all?) redis-cli requests against the node which executed the RPUSH to hang indefinitely.

For example, try this command with a fresh cluster:

/opt/redis/redis-cli rpush "a-list" "x"
(integer) 1

redis-cli monitor shows the Raft log applying endless RPUSH operations:

1583259973.634023 [0 ?:0] "RAFT" "rpush" "a-list" "x"
1583259973.634067 [0 ?:0] "RAFT" "rpush" "a-list" "x"
1583259973.634087 [0 ?:0] "RAFT" "rpush" "a-list" "x"
1583259973.634111 [0 ?:0] "RAFT" "rpush" "a-list" "x"
1583259973.634153 [0 ?:0] "RAFT" "rpush" "a-list" "x"
1583259973.634181 [0 ?:0] "RAFT" "rpush" "a-list" "x"
1583259973.634208 [0 ?:0] "RAFT" "rpush" "a-list" "x"

In this Jepsen run, we execute multiple RPUSH operations, each with a distinct integer value: you can see their consequences interleaved in reads:

[:r 0 [1 1 1 1 1 2 3 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 5 9 10 6 2 2 2 3 3 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 1 1 1 1 1 1 1 1 1 1 1 8]

This bug appears linked to follower-proxy=yes; running without follower proxies seems to resolve the issue.

Resharding support

The initial sharding support (#73) provides the basic functionality but does not offer a way to handle re-sharding and migration of keys between shardsgroups. Initially, this will require substantial design.

Subtasks:

Snapshot delivery refactoring

Snapshots are currently delivered as a single bulk:

Generated as an RDB file
Loaded into a single memory buffer
Delivered as a single bulk Redis command
Buffered by the recipient entirely as an argument
Stored on disk and loaded as RDB.

This has several drawbacks:

A lot of memory is consumed (temporarily), depending on the size of the dataset.
The RAFT.LOADSNAPSHOT command can take a long time to be delivered, and thus cannot be used as a heartbeat message.

Can you add support for getting cluster node information

Add manual leader election support

Useful for cluster administration purposes.
As a side effect, some tests may benefit from this as well.

AE prev conflicts with committed entry

With process crashes and partitions, we observed the following crash on startup of node n4:

20200510T091809.000-0400.zip

2020-05-10 06:27:47 Jepsen starting redis-server --bind 0.0.0.0 --dbfilename redis.rdb --loadmodule /opt/redis/redisraft.so loglevel=debug raft-log-filename=raf
tlog.db raft-log-max-file-size=32000 raft-log-max-cache-size=1000000 follower-proxy=yes
197382:C 10 May 2020 06:27:47.277 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
197382:C 10 May 2020 06:27:47.277 # Redis version=999.999.999, bits=64, commit=f88f8661, modified=0, pid=197382, just started
197382:C 10 May 2020 06:27:47.277 # Configuration loaded
197382:M 10 May 2020 06:27:47.277 * Increased maximum number of open files to 10032 (it was originally set to 1024).
197382:M 10 May 2020 06:27:47.278 * Running mode=standalone, port=6379.
197382:M 10 May 2020 06:27:47.278 # Server initialized
197382:M 10 May 2020 06:27:47.278 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overco
mmit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
197382:M 10 May 2020 06:27:47.278 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issu
es with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order t
o retain the setting after a reboot. Redis must be restarted after THP is disabled.
197382:M 10 May 2020 06:27:47.278 * <redisraft> RedisRaft starting, arguments: loglevel=debug raft-log-filename=raftlog.db raft-log-max-file-size=32000 raft-log
-max-cache-size=1000000 follower-proxy=yes
197382:M 10 May 2020 06:27:47.279 * Module 'redisraft' loaded from /opt/redis/redisraft.so
197382:M 10 May 2020 06:27:47.279 * Loading RDB produced by version 999.999.999
197382:M 10 May 2020 06:27:47.279 * RDB age 12 seconds
197382:M 10 May 2020 06:27:47.279 * RDB memory usage when created 1.54 Mb
197382:M 10 May 2020 06:27:47.280 * DB loaded from disk: 0.001 seconds
197382:M 10 May 2020 06:27:47.280 * Ready to accept connections
197382:10 May 06:27:47.779 Loading: Redis loading complete, snapshot LOADED
197382:10 May 06:27:47.779 Loading: Snapshot: applied term=177 index=12482
197382:10 May 06:27:47.779 Loading: Snapshot config: node id=290960629 [192.168.122.11:6379], active=1, voting=1
197382:10 May 06:27:47.779 Loading: Snapshot config: node id=1114280924 [192.168.122.14:6379], active=1, voting=1
197382:10 May 06:27:47.779 Loading: Snapshot config: node id=1919582493 [192.168.122.15:6379], active=1, voting=1
197382:10 May 06:27:47.779 Loading: Snapshot config: node id=1347621502 [192.168.122.12:6379], active=1, voting=1
197382:10 May 06:27:47.779 Loading: Snapshot config: node id=1823052095 [192.168.122.13:6379], active=1, voting=1
197382:10 May 06:27:47.779 Snapshot configuration loaded. Raft state:
197382:10 May 06:27:47.779   node <unknown?>
197382:10 May 06:27:47.779   node id=290960629,addr=192.168.122.11,port=6379
197382:10 May 06:27:47.779   node id=1919582493,addr=192.168.122.15,port=6379
197382:10 May 06:27:47.779   node id=1347621502,addr=192.168.122.12,port=6379
197382:10 May 06:27:47.780   node id=1823052095,addr=192.168.122.13,port=6379
197382:10 May 06:27:47.780 Loading: Log loaded, 0 entries, snapshot last term=177, index=12482
197382:10 May 06:27:47.780 Loading: Log starts from snapshot term=177, index=12482
197382:10 May 06:27:47.780 Raft Log: loaded current term=183, vote=1114280924
197382:10 May 06:27:47.780 Raft state after applying log: log_count=0, current_idx=12482, last_applied_idx=12482
197382:10 May 06:27:47.888 node:1823052095: node.c:74: Node connection established.
197382:10 May 06:27:47.888 node:1347621502: node.c:74: Node connection established.
197382:10 May 06:27:47.888 node:1919582493: node.c:74: Node connection established.
197382:10 May 06:27:47.888 node:290960629: node.c:74: Node connection established.
197382:10 May 06:27:48.696 <raftlib> becoming follower
197382:10 May 06:27:48.696 State change: Node is now a follower, term 184
197382:10 May 06:27:48.696 <raftlib> randomize election timeout to 1574
197382:10 May 06:27:48.696 node:290960629: raft.c:479: <raftlib> node requested vote: -1728024480 replying: not granted
197382:10 May 06:27:48.708 node:290960629: raft.c:479: <raftlib> recvd appendentries t:184 ci:12483 lc:12482 pli:12576 plt:177 #1
197382:10 May 06:27:48.708 node:290960629: raft.c:479: <raftlib> AE no log at prev_idx 12576
197382:10 May 06:27:48.709 node:290960629: raft.c:479: <raftlib> recvd appendentries t:184 ci:12483 lc:12482 pli:12483 plt:177 #94
197382:10 May 06:27:48.709 node:290960629: raft.c:479: <raftlib> AE no log at prev_idx 12483
197382:10 May 06:27:48.711 node:290960629: raft.c:479: <raftlib> recvd appendentries t:184 ci:12483 lc:12482 pli:12482 plt:177 #95
197382:10 May 06:27:49.401 node:290960629: raft.c:479: <raftlib> recvd appendentries t:184 ci:12578 lc:12577 pli:12482 plt:177 #95
197382:10 May 06:27:50.401 node:290960629: raft.c:479: <raftlib> recvd appendentries t:184 ci:12673 lc:12637 pli:12577 plt:184 #61
197382:10 May 06:27:50.401 node:290960629: raft.c:479: <raftlib> AE term doesn't match prev_term (ie. 177 vs 184) ci:12673 comi:12577 lcomi:12637 pli:12577
197382:10 May 06:27:50.401 node:290960629: raft.c:479: <raftlib> AE prev conflicts with committed entry
redis-server: raft.c:784: callRaftPeriodic: Assertion `ret == 0' failed.

The error appeared recoverable; a subsequent restart of the node did not encounter it, and we did not observe any safety impact from this crash. However, a write-loss event (#52) did happen roughly forty seconds prior to this assertion error, which raises the possibility that the two events are somehow related.

You can reproduce this problem on Jepsen.redis ddb68c8961a8f2b1cb0b37f263bd1eb0930ab5b8 by running:

lein run test-all --raft-version 2d1cf30 --time-limit 600 --nemesis partition,kill --follower-proxy --test-count 40 --concurrency 4n

Panic on startup: log snapshot index mismatch

With Jepsen 6c063da, Redis f88f866, and Redis-Raft b9ee410, testing with network partitions, process crashes, pauses, and membership changes resulted in nodes panicking on restart, due to a mismatch between log initial indices and snapshots.

lein run test-all --raft-version b9ee410 --time-limit 600 --nemesis partition,kill,pause,member --follower-proxy --test-count 16 --concurrency 4n

20200505T133553.000-0400.zip

Minutes later, multiple nodes observed a gap between their snapshot and initial log index when restarted by Jepsen. Logs also contained 0 entries. Immediately prior to the panics, Jepsen killed n2, n3, and n5, recovered from a network partition, isolated n2, n3, and n4 away from n1 and n5, then started all nodes.

...
2020-05-05 13:39:18,798{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:kill	:majority
2020-05-05 13:39:21,005{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:kill	{"n2" "", "n3" "", "n5" ""}
2020-05-05 13:39:24,005{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:resume	:all
2020-05-05 13:39:24,109{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:resume	{"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}
2020-05-05 13:39:27,215{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:join	"n5"
2020-05-05 13:39:27,224{GMT}	WARN	[jepsen nemesis] jepsen.core: Process :nemesis crashed
2020-05-05 13:39:27,224{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:join	"n5"	indeterminate: Command exited with non-zero status 124 on node n4:
2020-05-05 13:39:30,329{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	"n1"
2020-05-05 13:39:30,337{GMT}	WARN	[jepsen nemesis] jepsen.core: Process :nemesis crashed
2020-05-05 13:39:30,337{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:leave	"n1"	indeterminate: Command exited with non-zero status 124 on node n4:
2020-05-05 13:39:33,337{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:stop-partition	nil
2020-05-05 13:39:33,544{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:stop-partition	:network-healed
2020-05-05 13:39:36,544{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start-partition	:majority
2020-05-05 13:39:36,649{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start-partition	[:isolated {"n5" #{"n2" "n4" "n3"}, "n1" #{"n2" "n4" "n3"}, "n2" #{"n5" "n1"}, "n4" #{"n5" "n1"}, "n3" #{"n5" "n1"}}]
2020-05-05 13:39:39,649{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start	:all
2020-05-05 13:39:39,856{GMT}	INFO	[jepsen nemesis] jepsen.util: :nemesis	:info	:start	{"n1" "", "n2" "", "n3" "", "n4" "", "n5" ""}

At 10:39:39, Jepsen started n2, and it immediately panicked with Log initial index (1482) does not match snapshot last index (1400), aborting.

2020-05-05 10:39:39 Jepsen starting redis-server --bind 0.0.0.0 --dbfilename redis.rdb --loadmodule /opt/redis/redisraft.so loglevel=debug raft-log-filename=raftlog.db raft-log-max-file-size=32000 raft-log-max-cache-size=1000000 follower-proxy=yes
30457:C 05 May 2020 10:39:39.772 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
30457:C 05 May 2020 10:39:39.772 # Redis version=999.999.999, bits=64, commit=f88f8661, modified=0, pid=30457, just started
30457:C 05 May 2020 10:39:39.772 # Configuration loaded
30457:M 05 May 2020 10:39:39.773 * Increased maximum number of open files to 10032 (it was originally set to 1024).
30457:M 05 May 2020 10:39:39.773 * Running mode=standalone, port=6379.
30457:M 05 May 2020 10:39:39.773 # Server initialized
30457:M 05 May 2020 10:39:39.773 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
30457:M 05 May 2020 10:39:39.773 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
30457:M 05 May 2020 10:39:39.774 * <redisraft> RedisRaft starting, arguments: loglevel=debug raft-log-filename=raftlog.db raft-log-max-file-size=32000 raft-log-max-cache-size=1000000 follower-proxy=yes
30457:M 05 May 2020 10:39:39.774 * Module 'redisraft' loaded from /opt/redis/redisraft.so
30457:M 05 May 2020 10:39:39.775 * Loading RDB produced by version 999.999.999
30457:M 05 May 2020 10:39:39.775 * RDB age 115 seconds
30457:M 05 May 2020 10:39:39.775 * RDB memory usage when created 1.58 Mb
30457:M 05 May 2020 10:39:39.775 * DB loaded from disk: 0.000 seconds
30457:M 05 May 2020 10:39:39.775 * Ready to accept connections
30457:05 May 10:39:40.276 Loading: Redis loading complete, snapshot LOADED
30457:05 May 10:39:40.276 Loading: Snapshot: applied term=4 index=1400
30457:05 May 10:39:40.276 Snapshot configuration loaded. Raft state:
30457:05 May 10:39:40.276   node <unknown?>
30457:05 May 10:39:40.276 Loading: Log loaded, 0 entries, snapshot last term=4, index=1482
30457:05 May 10:39:40.276 Loading: Log starts from snapshot term=4, index=1482
30457:05 May 10:39:40.276 

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
REDIS RAFT PANIC
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Log initial index (1482) does not match snapshot last index (1400), aborting.

Subsequent restarts of n2 panicked as well.

n3 was also started at 10:39:39, and panicked too:

30592:05 May 10:40:16.888 Loading: Snapshot: applied term=4 index=1402
30592:05 May 10:40:16.888 Snapshot configuration loaded. Raft state:
30592:05 May 10:40:16.888   node <unknown?>
30592:05 May 10:40:16.888 Loading: Log loaded, 0 entries, snapshot last term=4, index=1478
30592:05 May 10:40:16.888 Loading: Log starts from snapshot term=4, index=1478
30592:05 May 10:40:16.888 

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
REDIS RAFT PANIC
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Log initial index (1478) does not match snapshot last index (1402), aborting.

Two minutes later, at 10:41:35, Jepsen started n4, and it too panicked:

30876:05 May 10:41:35.684 Loading: Snapshot: applied term=4 index=1173
30876:05 May 10:41:35.684 Snapshot configuration loaded. Raft state:
30876:05 May 10:41:35.684   node <unknown?>
30876:05 May 10:41:35.684 Loading: Log loaded, 0 entries, snapshot last term=4, index=1390
30876:05 May 10:41:35.684 Loading: Log starts from snapshot term=4, index=1390
30876:05 May 10:41:35.684 

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
REDIS RAFT PANIC
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Log initial index (1390) does not match snapshot last index (1173), aborting.

n1 and n5 did not panic.

EntryCacheAppend: Assertion `cache->start_idx + cache->len == idx' failed

Found another crash in 2d1cf30, with redis f88f866:

20200510T023432.000-0400.zip

147415:09 May 23:38:38.585 node:1529561842: node.c:361: NodeAddPendingResponse: id=2216, type=proxy, request_time=1589092718585
147415:09 May 23:38:38.588 node:1529561842: raft.c:479: <raftlib> recvd appendentries t:80 ci:6853 lc:6965 pli:6848 plt:79 #117
redis-server: log.c:63: EntryCacheAppend: Assertion `cache->start_idx + cache->len == idx' failed.

This one seems less frequent; first time I've seen it in about 40 runs. No observed safety impact. To reproduce (slowly), try jepsen.redis ddb68c8961a8f2b1cb0b37f263bd1eb0930ab5b8, and run:

lein run test-all --raft-version 2d1cf30 --time-limit 600 --nemesis partition,kill,pause,member --follower-proxy --test-count 50 --concurrency 4

Defunct processes after snapshots

Seems like the snapshot child process doesn't (always?) get waited properly, resulting with defunct processes.

Empty reads with membership changes, crashes

Still working on narrowing this down, but I have two cases with version f88f866 where it looks like Redis started returning empty values for LRANGE reads after starting up all nodes. This is fairly infrequent, so these clusters have been through the wringer. I'm trying to narrow down behavior and get a more reproducible test pattern.

20200323T201814.000-0400.zip

Add an explicit config for in-memory logs.

In the past, not providing a log file name implied the log should be kept in memory only.
We now have a default log file name (for safety reasons), so we need an explicit configuration parameter to enable pure in-memory mode, as reported by @maguec.

Refresh configuration interface

The current configuration interface requires a refresh, and we need to do that as early as possible before approaching beta or GA releases.

Things to do:

Move from an argument=value to an ARGUMENT <value> format which is a bit easier/more intuitive and similar the the approach taken by other modules.
Reconsider and improve names where applicable (and don't forget to update documentation).