GithubHelp home page GithubHelp logo

Comments (14)

greensky00 avatar greensky00 commented on July 22, 2024 1

Thanks, at a first glance, looks like there are concurrent pre-vote calls to the same node, from different threads, and that caused the race condition. This should not happen, let me spend more time on it.

2021.01.29 23:13:38.105010 [ 21 ] {} <Warning> RaftInstance: [PRE-VOTE INIT] my id 3, my role follower, term 2, log idx 338, log term 2, priority (target 1 / mine 1)
2021.01.29 23:13:38.105069 [ 21 ] {} <Information> RaftInstance: send req 3 -> 2, type pre_vote_request
2021.01.29 23:13:38.105110 [ 21 ] {} <Notice> RaftInstance: socket to node2:44445 is not opened yet
2021.01.29 23:13:38.352662 [ 19 ] {} <Warning> RaftInstance: [PRE-VOTE INIT] my id 3, my role follower, term 2, log idx 338, log term 2, priority (target 1 / mine 1)
2021.01.29 23:13:38.352725 [ 19 ] {} <Information> RaftInstance: send req 3 -> 2, type pre_vote_request
2021.01.29 23:13:38.352768 [ 19 ] {} <Notice> RaftInstance: socket to node2:44445 is not opened yet

from nuraft.

alesapin avatar alesapin commented on July 22, 2024

I haven't used asio library before, but it's unclear to me why is this true https://github.com/eBay/NuRaft/blob/master/src/asio_service.cxx#L1021-L1022? We set the flag in send method and remove it after the asynchronous read (response_read or ctx_read) (https://github.com/eBay/NuRaft/blob/master/src/asio_service.cxx#L1330-L1400). But there is no explicit synchronization. Why we cannot call send one more time until we received a response from the previous one?

from nuraft.

greensky00 avatar greensky00 commented on July 22, 2024

Hi @alesapin

The purpose of busy_flag is not for synchronization, but for debugging: detecting bug and abort. NuRaft does not send a message to a peer before it gets the response (including errors) of the previous message. So there should be only one message in-flight for each peer at a time. That is controlled by make_busy() function in peer.

bool make_busy() {
bool f = false;
return busy_flag_.compare_exchange_strong(f, true);
}

If you get caught by this assert, that means multiple messages were sent to the same peer at the same time, and that should be a bug. Can you please share the detailed information such as logs?

Thanks.

from nuraft.

alesapin avatar alesapin commented on July 22, 2024

Hi @alesapin

The purpose of busy_flag is not for synchronization, but for debugging: detecting bug and abort. NuRaft does not send a message to a peer before it gets the response (including errors) of the previous message. So there should be only one message in-flight for each peer at a time. That is controlled by make_busy() function in peer.

bool make_busy() {
bool f = false;
return busy_flag_.compare_exchange_strong(f, true);
}

If you get caught by this assert, that means multiple messages were sent to the same peer at the same time, and that should be a bug. Can you please share the detailed information such as logs?

Thanks.

Sure, logs from failed instance https://gist.github.com/alesapin/9d860e183a323280ab2600f93ac195c1. Settings:

    nuraft::raft_params params;
    params.heart_beat_interval_ = 100;
    params.election_timeout_lower_bound_ = 200;
    params.election_timeout_upper_bound_ = 400;
    params.reserved_log_items_ = 5000;
    params.snapshot_distance_ = 5000;
    params.client_req_timeout_ = 10000;
    params.auto_forwarding_ = true;
    params.return_method_ = nuraft::raft_params::blocking;

I've noticed that auto_forwarding_ is not recommended, but the error can be reproduced without it:
trace https://gist.github.com/alesapin/8208bdb7f6f192bcaac4ad2e4faaccf3, logs https://gist.github.com/alesapin/a4a990ea55e840722af5fc2eb86276fd.

from nuraft.

alesapin avatar alesapin commented on July 22, 2024

btw, node1 has priority 3, node2 -- 2, node3 -- 1

from nuraft.

alesapin avatar alesapin commented on July 22, 2024

I've also noticed, that peer is not the only class, which uses asio_rpc_client. For example send_msg_to_leader with auto_forwarding_ = true also call send, but without any explicit synchronization, so race condition is also possible. According to asio docs, it's recommended to use strand if we can have multiple read/writes for the same socket from different threads. So I think making send thread-safe (at least at the socket level) is a good idea. I've updated my PR #171, maybe it will be useful.

from nuraft.

greensky00 avatar greensky00 commented on July 22, 2024

Hi @alesapin, thanks for the heads-up.

Adding synchronization to Asio itself doesn't help, as upper layers (asio_service, peer, raft_server) are not considering the case of sending messages before getting the previous response. It will "hide" the problem but the root cause (Raft is sending overlapping message) will be still there with lower probability.

Since you said

I've noticed that auto_forwarding_ is not recommended, but the error can be reproduced without it:

I was investigating it excluding the auto-forwarding case. We are not using auto-forwarding in eBay (hence not tested and not recommended), and it has a problem as you mentioned above. This needs to be improved separately, but if the socket race happens even without auto-forwarding, this is a different issue.

From the log you shared, I found a few things.

  1. peer 1 and 2 were not responding for a long time, hence the server (peer 3) attempted to reconnect, but it was too frequent and sometimes there were multiple reconnect requests at the same time. This is not expected behavior and needs to be fixed, although it is the direct root cause of the "socket race". I will upload the fix for it soon.
  2. Below log looks very strange. The number 34359738370 comes from get_quorum_for_election() function, which simply returns <the number of peers> / 2 + 1. There is nowhere to make this value overflow as you don't use custom quorum size (correct me if I missed something). Can you please check how this happened? Probably there was a memory corruption.
2021.01.29 23:12:46.794646 [ 66 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 34359738370 nodes should respond. failure count 139917149601793
  1. The current log misses the information about rpc_client, so it was hard to track the dependency of each event. I pushed this commit for debugging:
    greensky00@ac7b729
    so could you please share the log of socket race with this commit, and without the auto-forwarding option? It will be much appreciated.

Thanks.

from nuraft.

alesapin avatar alesapin commented on July 22, 2024

upper layers (asio_service, peer, raft_server) are not considering the case of sending messages before getting the previous response

Thanks, I've suspected this invariant, but cannot clearly figure it out from the code. Maybe at least comment in asio_service will be helpful. I'll close my PR and think about a fix for auto_forwarding_, it should be quite easy to avoid race.

I will upload the fix for it soon.

Ok, waiting for a fix :)

There is nowhere to make this value overflow as you don't use custom quorum size (correct me if I missed something). Can you please check how this happened? Probably there was a memory corruption.

I've pasted all non-default settings in #169 (comment). I'll try to run the same test with address sanitizer. If there is memory corruption we will catch it for sure.

so could you please share the log of socket race with this commit, and without the auto-forwarding option

Ok, I'll remove my strand commits and cherry-pick your logging improvements.

from nuraft.

alesapin avatar alesapin commented on July 22, 2024

Reproduced the error with your patch:
Fatal trace https://gist.github.com/alesapin/54d374fa7a8fc46b20051f460115a5e3
Log: https://gist.github.com/alesapin/07539dd06680a1ecd51e28032e479d88
Suspicious message still here:

2021.01.31 16:01:52.662809 [ 50 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 34359738370 nodes should respond. failure count 140376711102638              

The error reproduces with address sanitizer in release build, but no memory corruption detected:

2021.01.31 17:20:24.034188 [ 40 ] {} <Fatal> RaftInstance: socket 0x616000396c98 is already in use, race happened on connection to node1:44444                           
2021.01.31 17:20:24.034805 [ 41 ] {} <Fatal> RaftInstance: socket 0x616000396c98 is already idle, race happened on connection to node1:44444                             

Also, suspicious message disappeared (became normal):

2021.01.31 17:20:21.851833 [ 38 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 151                                    
2021.01.31 17:20:22.131467 [ 65 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 152                                    
2021.01.31 17:20:22.452196 [ 57 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 153                                    
2021.01.31 17:20:22.760966 [ 43 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 154                                    
2021.01.31 17:20:23.031649 [ 46 ] {} <Error> RaftInstance: total 1 nodes (including this node) responded for pre-vote (term 2, live 0, dead 1), at least 2 nodes should respond. failure count 155                                    

It seems like UB, or uninitialized memory usage, so I think I have to try other sanitizers (memory and undefined).

from nuraft.

alesapin avatar alesapin commented on July 22, 2024

Strange big numbers in debug mode were caused by wrong specifiers in the logging formatting string: #172.

from nuraft.

greensky00 avatar greensky00 commented on July 22, 2024

Thanks @alesapin. You're right, the hex representation of 34359738370 is 00 00 00 08 00 00 00 02 indeed. I will merge your PR.

I pushed PR #173 to fix the potential race that I found. Hope this resolves your case. Thanks for bringing this issue.

from nuraft.

greensky00 avatar greensky00 commented on July 22, 2024

+) Is there any specific reason why you set the heartbeat period to 100ms? It might be too short considering the common network environment, which makes the system unstable. We set it to around 1 second in our real deployment.

from nuraft.

alesapin avatar alesapin commented on July 22, 2024

Testing #173, at first glance, looks like it helps.

from nuraft.

alesapin avatar alesapin commented on July 22, 2024

Thanks! I'll close this issue.

from nuraft.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.