Comments (14)
Thanks, at a first glance, looks like there are concurrent pre-vote calls to the same node, from different threads, and that caused the race condition. This should not happen, let me spend more time on it.
2021.01.29 23:13:38.105010 [ 21 ] {} <Warning> RaftInstance: [PRE-VOTE INIT] my id 3, my role follower, term 2, log idx 338, log term 2, priority (target 1 / mine 1)
2021.01.29 23:13:38.105069 [ 21 ] {} <Information> RaftInstance: send req 3 -> 2, type pre_vote_request
2021.01.29 23:13:38.105110 [ 21 ] {} <Notice> RaftInstance: socket to node2:44445 is not opened yet
2021.01.29 23:13:38.352662 [ 19 ] {} <Warning> RaftInstance: [PRE-VOTE INIT] my id 3, my role follower, term 2, log idx 338, log term 2, priority (target 1 / mine 1)
2021.01.29 23:13:38.352725 [ 19 ] {} <Information> RaftInstance: send req 3 -> 2, type pre_vote_request
2021.01.29 23:13:38.352768 [ 19 ] {} <Notice> RaftInstance: socket to node2:44445 is not opened yet
from nuraft.
I haven't used asio
library before, but it's unclear to me why is this true https://github.com/eBay/NuRaft/blob/master/src/asio_service.cxx#L1021-L1022? We set the flag in send
method and remove it after the asynchronous read (response_read
or ctx_read
) (https://github.com/eBay/NuRaft/blob/master/src/asio_service.cxx#L1330-L1400). But there is no explicit synchronization. Why we cannot call send
one more time until we received a response from the previous one?
from nuraft.
Hi @alesapin
The purpose of busy_flag
is not for synchronization, but for debugging: detecting bug and abort. NuRaft does not send a message to a peer before it gets the response (including errors) of the previous message. So there should be only one message in-flight for each peer at a time. That is controlled by make_busy()
function in peer
.
NuRaft/include/libnuraft/peer.hxx
Lines 120 to 123 in c621949
If you get caught by this assert
, that means multiple messages were sent to the same peer at the same time, and that should be a bug. Can you please share the detailed information such as logs?
Thanks.
from nuraft.
Hi @alesapin
The purpose of
busy_flag
is not for synchronization, but for debugging: detecting bug and abort. NuRaft does not send a message to a peer before it gets the response (including errors) of the previous message. So there should be only one message in-flight for each peer at a time. That is controlled bymake_busy()
function inpeer
.NuRaft/include/libnuraft/peer.hxx
Lines 120 to 123 in c621949
If you get caught by this
assert
, that means multiple messages were sent to the same peer at the same time, and that should be a bug. Can you please share the detailed information such as logs?Thanks.
Sure, logs from failed instance https://gist.github.com/alesapin/9d860e183a323280ab2600f93ac195c1. Settings:
nuraft::raft_params params;
params.heart_beat_interval_ = 100;
params.election_timeout_lower_bound_ = 200;
params.election_timeout_upper_bound_ = 400;
params.reserved_log_items_ = 5000;
params.snapshot_distance_ = 5000;
params.client_req_timeout_ = 10000;
params.auto_forwarding_ = true;
params.return_method_ = nuraft::raft_params::blocking;
I've noticed that auto_forwarding_
is not recommended, but the error can be reproduced without it:
trace https://gist.github.com/alesapin/8208bdb7f6f192bcaac4ad2e4faaccf3, logs https://gist.github.com/alesapin/a4a990ea55e840722af5fc2eb86276fd.
from nuraft.
btw, node1 has priority 3, node2 -- 2, node3 -- 1
from nuraft.
I've also noticed, that peer
is not the only class, which uses asio_rpc_client
. For example send_msg_to_leader
with auto_forwarding_ = true
also call send, but without any explicit synchronization, so race condition is also possible. According to asio docs, it's recommended to use strand if we can have multiple read/writes for the same socket from different threads. So I think making send
thread-safe (at least at the socket level) is a good idea. I've updated my PR #171, maybe it will be useful.
from nuraft.
Hi @alesapin, thanks for the heads-up.
Adding synchronization to Asio itself doesn't help, as upper layers (asio_service, peer, raft_server) are not considering the case of sending messages before getting the previous response. It will "hide" the problem but the root cause (Raft is sending overlapping message) will be still there with lower probability.
Since you said
I've noticed that
auto_forwarding_
is not recommended, but the error can be reproduced without it:
I was investigating it excluding the auto-forwarding case. We are not using auto-forwarding in eBay (hence not tested and not recommended), and it has a problem as you mentioned above. This needs to be improved separately, but if the socket race happens even without auto-forwarding, this is a different issue.
From the log you shared, I found a few things.
- peer 1 and 2 were not responding for a long time, hence the server (peer 3) attempted to reconnect, but it was too frequent and sometimes there were multiple reconnect requests at the same time. This is not expected behavior and needs to be fixed, although it is the direct root cause of the "socket race". I will upload the fix for it soon.
- Below log looks very strange. The number
34359738370
comes fromget_quorum_for_election()
function, which simply returns<the number of peers> / 2 + 1
. There is nowhere to make this value overflow as you don't use custom quorum size (correct me if I missed something). Can you please check how this happened? Probably there was a memory corruption.
2021.01.29 23:12:46.794646 [ 66 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 34359738370 nodes should respond. failure count 139917149601793
- The current log misses the information about
rpc_client
, so it was hard to track the dependency of each event. I pushed this commit for debugging:
greensky00@ac7b729
so could you please share the log of socket race with this commit, and without the auto-forwarding option? It will be much appreciated.
Thanks.
from nuraft.
upper layers (asio_service, peer, raft_server) are not considering the case of sending messages before getting the previous response
Thanks, I've suspected this invariant, but cannot clearly figure it out from the code. Maybe at least comment in asio_service
will be helpful. I'll close my PR and think about a fix for auto_forwarding_
, it should be quite easy to avoid race.
I will upload the fix for it soon.
Ok, waiting for a fix :)
There is nowhere to make this value overflow as you don't use custom quorum size (correct me if I missed something). Can you please check how this happened? Probably there was a memory corruption.
I've pasted all non-default settings in #169 (comment). I'll try to run the same test with address sanitizer. If there is memory corruption we will catch it for sure.
so could you please share the log of socket race with this commit, and without the auto-forwarding option
Ok, I'll remove my strand
commits and cherry-pick your logging improvements.
from nuraft.
Reproduced the error with your patch:
Fatal trace https://gist.github.com/alesapin/54d374fa7a8fc46b20051f460115a5e3
Log: https://gist.github.com/alesapin/07539dd06680a1ecd51e28032e479d88
Suspicious message still here:
2021.01.31 16:01:52.662809 [ 50 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 34359738370 nodes should respond. failure count 140376711102638
The error reproduces with address sanitizer in release build, but no memory corruption detected:
2021.01.31 17:20:24.034188 [ 40 ] {} <Fatal> RaftInstance: socket 0x616000396c98 is already in use, race happened on connection to node1:44444
2021.01.31 17:20:24.034805 [ 41 ] {} <Fatal> RaftInstance: socket 0x616000396c98 is already idle, race happened on connection to node1:44444
Also, suspicious message disappeared (became normal):
2021.01.31 17:20:21.851833 [ 38 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 151
2021.01.31 17:20:22.131467 [ 65 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 152
2021.01.31 17:20:22.452196 [ 57 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 153
2021.01.31 17:20:22.760966 [ 43 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 154
2021.01.31 17:20:23.031649 [ 46 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 155
It seems like UB, or uninitialized memory usage, so I think I have to try other sanitizers (memory and undefined).
from nuraft.
Strange big numbers in debug mode were caused by wrong specifiers in the logging formatting string: #172.
from nuraft.
Thanks @alesapin. You're right, the hex representation of 34359738370
is 00 00 00 08 00 00 00 02
indeed. I will merge your PR.
I pushed PR #173 to fix the potential race that I found. Hope this resolves your case. Thanks for bringing this issue.
from nuraft.
+) Is there any specific reason why you set the heartbeat period to 100ms? It might be too short considering the common network environment, which makes the system unstable. We set it to around 1 second in our real deployment.
from nuraft.
Testing #173, at first glance, looks like it helps.
from nuraft.
Thanks! I'll close this issue.
from nuraft.
Related Issues (20)
- OpenSSF Security Scorecard improvements
- Wrong "src" and "dst" when cs_new<req_msg> HOT 2
- deadblock HOT 1
- Clean up CMakeLists HOT 2
- nuraft 2.0.0 cannot work on centos7 + centos8 HOT 2
- Preconditions of apply_pack HOT 1
- handle_append_entries_resp() declined append HOT 1
- How to use NuRaft with CMake's FetchContent? HOT 12
- CMake targets should be namespaced using ALIAS targets
- CMake targets are missing usage requirements
- Leadership yielding is not synchronized with replicated log HOT 1
- Should the type be uint8_t instead of size_t for serialize_v1p(...) in srv_state.hxx line 133 HOT 1
- This is a question, not an issue. About `state_machine::pre_commit` HOT 2
- Does nuraft support linearizable read if generating no raft log entry of read requests? HOT 2
- Out of order call to state_machine::create_snapshot() when manually triggering a snapshot HOT 4
- Found a MSan error in Asio HOT 4
- Data race in peer HOT 3
- auto_fwd_resp_handler should forward resp->get_result_code
- src/tracer.hxx:52 stack-buffer-overflow when using vsnprintf's return value
- cmd_result::get_result_str missing `SERVER_IS_LEAVING`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nuraft.