It looks like clusters with follower-proxy, network partitions, and crashes (or possibly fewer faults; I'm trying to narrow it down) can occasionally get into states where one specific Redis node gets mixed up and sends inappropriate responses to clients. I think what we're seeing is responses intended for other clients getting dispatched to the wrong place.
lein run test-all --raft-version dfd91d4 -w append --time-limit 600 --nemesis-interval 1 --nemesis standard --test-count 10 --concurrency 5n --follower-proxy
20200313T233858.000-0400.zip
We resolve a network partition:
2020-03-13 23:41:36,284{GMT} INFO [jepsen nemesis] jepsen.util: :nemesis :info :stop-partition nil
The first sign of weirdness comes from worker 5, executing process 130. It tries to execute a MULTI transaction, opens a fresh connection, and gets a single "211" instead of a vector of results from an LRANGE command.
2020-03-13 23:41:41,025{GMT} INFO [jepsen worker 5] jepsen.util: 130 :invoke :txn [[:r 47 nil] [:r 47 nil] [:append 46 215]]
2020-03-13 23:41:41,026{GMT} INFO [jepsen worker 5] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.So
cket 0x5c701a81 Socket[addr=n1/192.168.122.11,port=6379,localport=41566]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x73be0bac java.io.DataInputStream@73be0bac], :out #object[java.io.BufferedOutputStream 0x6900a5e8 java.io.BufferedOutputStream@6900a5e8]}}, :in-txn? #object[clojure.lang.Atom 0x7407179f {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-
ms 10000}}
2020-03-13 23:41:41,117{GMT} WARN [jepsen worker 5] jepsen.core: Process 130 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 47, :value "211"}
Concurrently, worker 15/process 140 does the same thing:
2020-03-13 23:41:41,045{GMT} INFO [jepsen worker 15] jepsen.util: 140 :invoke :txn [[:r 46 nil] [:append 45 226] [:append 46 216] [:append 46 217]]
2020-03-13 23:41:41,046{GMT} INFO [jepsen worker 15] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.S
ocket 0x2925d9b7 Socket[addr=n1/192.168.122.11,port=6379,localport=41580]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x12c9c95b java.io.DataInputStream@12c9c95b], :out #object[java.io.BufferedOutputStream 0x15deb95b java.io.BufferedOutputStream@15deb95b]}}, :in-txn? #object[clojure.lang.Atom 0x747af9c6 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout
-ms 10000}}
2020-03-13 23:41:41,123{GMT} WARN [jepsen worker 15] jepsen.core: Process 140 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 46, :value "211"}
It's particularly weird that both of them got "211" here.
Concurrently, process 20/worker 145 does a NON multi read--just a regular old single LRANGE by itself, on a fresh connection, and gets [2 3 ["226"]] which is... VERY weird.
2020-03-13 23:41:41,107{GMT} INFO [jepsen worker 20] jepsen.util: 145 :invoke :txn [[:r 46 nil]]
2020-03-13 23:41:41,108{GMT} INFO [jepsen worker 20] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.S
ocket 0x6ec17aa1 Socket[addr=n1/192.168.122.11,port=6379,localport=41584]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x530146a8 java.io.DataInputStream@530146a8], :out #object[java.io.BufferedOutputStream 0xcdfc07b java.io.BufferedOutputStream@cdfc07b]}}, :in-txn? #object[clojure.lang.Atom 0x3c20d3b9 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-m
s 10000}}
2020-03-13 23:41:41,257{GMT} WARN [jepsen worker 20] jepsen.core: Process 145 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 46, :value [2 3 ["226"]]}
Same thing happens to worker 10's request with process 160. Fresh connection, a MULTI transaction, and it gets the vector [2 3]--it is a vector, but it's of integers, not strings, like we expected. Fresh connection here too.
2020-03-13 23:41:40,927{GMT} INFO [jepsen worker 10] jepsen.util: 160 :invoke :txn [[:r 47 nil] [:append 46 211] [:append 47 111] [:append 46 212]]
2020-03-13 23:41:40,928{GMT} INFO [jepsen worker 10] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.S
ocket 0x274a045 Socket[addr=n1/192.168.122.11,port=6379,localport=41548]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x657968b9 java.io.DataInputStream@657968b9], :out #object[java.io.BufferedOutputStream 0x6f11d2be java.io.BufferedOutputStream@6f11d2be]}}, :in-txn? #object[clojure.lang.Atom 0x29e75eed {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-
ms 10000}}
2020-03-13 23:41:41,308{GMT} WARN [jepsen worker 10] jepsen.core: Process 160 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 45, :value [2 3]}
Immediately after this, we detect the completion of a node remove operation that we'd been waiting on for a while: node n1 finishes removing n5 from the cluster.
2020-03-13 23:41:41,314{GMT} INFO [clojure-agent-send-off-pool-10] jepsen.redis.db: :done-waiting-for-node-removal {:raft
{:role :follower,
:num_voting_nodes "2",
:leader_id 813301272,
:is_voting "yes",
:node_id 1597309644,
:num_nodes 2,
:state :up,
:nodes
({:id 813301272,
:state "connected",
:voting "yes",
:addr "192.168.122.12",
:port 6379,
:last_conn_secs 94,
:conn_errors 0,
:conn_oks 1}),
:current_term 114},
:log
{:log_entries 1243,
:current_index 1243,
:commit_index 1243,
:last_applied_index 1243,
:file_size 173406,
:cache_memory_size 136972,
:cache_entries 1090},
:snapshot {:snapshot_in_progress "no"},
:clients {:clients_in_multi_state 0}}
Node n2 starts removing n3:
2020-03-13 23:41:41,349{GMT} INFO [jepsen node n2] jepsen.redis.db: n2 :removing n3 (id: 859783830)
Worker 5, process 180, hits a type error as well--it expects a list of elements from an LRANGE, but gets a single Long instead.
2020-03-13 23:41:41,322{GMT} INFO [jepsen worker 5] jepsen.util: 180 :invoke :txn [[:r 46 nil] [:append 47 135] [:append 47 136] [:r 46 nil]]
2020-03-13 23:41:41,323{GMT} INFO [jepsen worker 5] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.Socket 0x7a37e654 Socket[addr=n1/192.168.122.11,port=6379,localport=41632]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x1fcb0d1f java.io.DataInputStream@1fcb0d1f], :out #object[java.io.BufferedOutputStream 0x24bb596f java.io.BufferedOutputStream@24bb596f]}}, :in-txn? #object[clojure.lang.Atom 0x335f2349 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-ms 10000}}
2020-03-13 23:41:41,365{GMT} WARN [jepsen worker 5] jepsen.core: Process 180 crashed
java.lang.IllegalArgumentException: Don't know how to create ISeq from: java.lang.Long
Worker 0, executing process 200, goes to perform a MULTI transaction. It opens a new connection to do so...
2020-03-13 23:41:42,021{GMT} INFO [jepsen worker 0] jepsen.util: 200 :invoke :txn [[:r 49 nil] [:append 48 21]]
2020-03-13 23:41:42,021{GMT} INFO [jepsen worker 0] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[j
ava.net.Socket 0x25b8659b Socket[addr=n1/192.168.122.11,port=6379,localport=41722]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x55c08913 java.io.DataInput
Stream@55c08913], :out #object[java.io.BufferedOutputStream 0x4e607e5b java.io.BufferedOutputStream@4e607e5b]}}, :in-txn? #object[clojure.lang.Atom 0x60177bc0 {:status :ready, :val false}], :spec {:
host n1, :port 6379, :timeout-ms 10000}}
That request succeeds with what looks like normal results.
2020-03-13 23:41:42,121{GMT} INFO [jepsen worker 0] jepsen.util: 200 :ok :txn [[:r 49 [111 130 131 132 133 135 136 137 138 153 154 156 166 169 170 175 176 179 180 181]] [:append 48 21]]
Weirdly, this conflicts with other reads of key 49, so I'm not sure if we got the right thing here or not.
Worker 0 then tries to perform a single read of 48. This doesn't involve an EXEC, so it should have returned a single value. Instead it gets a vector of vectors, which... what?
2020-03-13 23:41:42,257{GMT} INFO [jepsen worker 0] jepsen.util: 200 :invoke :txn [[:r 48 nil]]
...
2020-03-13 23:41:42,373{GMT} WARN [jepsen worker 0] jepsen.core: Process 200 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 48, :value [["2" "4" "5" "6" "8" "9" "14" "15" "17" "21"] ["111" "130" "131" "132" "133" "135" "136" "137" "138" "153" "154" "1
56" "166" "169" "170" "175" "176" "179" "180" "181" "194"] ["3" "11" "12" "13" "21"]]}
Eyeballing the vectors it returned, it looks like this is a read of keys [? 49 ?]
respectively.
Then worker 0 (logical process 225) goes to perform a transaction:
2020-03-13 23:41:42,446{GMT} INFO [jepsen worker 0] jepsen.util: 225 :invoke :txn [[:r 49 nil] [:r 48 nil] [:r 47 nil] [:append 47 211]]
225 is a new process, so it opens a fresh connection to perform its first request. Just to confirm, yes, this is a new port and connection object!
2020-03-13 23:41:42,447{GMT} INFO [jepsen worker 0] jepsen.redis.append: :conn {:pool #jepsen.redis.client.SingleConnectionPool{:conn #taoensso.carmine.connections.Connection{:socket #object[java.net.Socket 0x417158d7 Socket[addr=n1/192.168.122.11,port=6379,localport=41762]], :spec {:host n1, :port 6379, :timeout-ms 10000}, :in #object[java.io.DataInputStream 0x6d4fbc73 java.io.DataInputStream@6d4fbc73], :out #object[java.io.BufferedOutputStream 0x5c7b23f7 java.io.BufferedOutputStream@5c7b23f7]}}, :in-txn? #object[clojure.lang.Atom 0x59d46b31 {:status :ready, :val false}], :spec {:host n1, :port 6379, :timeout-ms 10000}}
It performs a MULTI, then a read (LRANGE) of key 49, read of 48, read of 47, and appends (RPUSH) 211 to key 47. It EXECs the transaction, and gets back a vector of responses. The first response should have been a list of values for key 49, but instead we obtained "2", which... what?
2020-03-13 23:41:42,552{GMT} WARN [jepsen worker 0] jepsen.core: Process 225 crashed
clojure.lang.ExceptionInfo: throw+: {:type :unexpected-read-type, :key 49, :value "2"}
Later, we'll read key 49 again and get a weird response:
2020-03-13 23:41:41,979{GMT} INFO [jepsen worker 20] jepsen.util: 270 :ok :txn [[:append 49 5] [:r 47 [2 4 5 6]] [:r 49 [3 11 12]] [:append 49 6]]
This read is clearly messed up because we just appended 5 to key 49, and didn't observe it. Also, the append of 12 to key 49 happened after this, and failed. It definitely can't be reading 49.
I feel like this points to wires getting crossed either inside Redis or the Carmine library--or maybe I'm somehow STILL using the library wrong. I think... what I should do next is get a Wireshark dump of the traffic back and forth, and try to figure out what's going on at a protocol level.