Comments (11)
Hi @sheepgrass
If log store compaction of the leader happened at least once, all new servers joining the cluster will start with receiving the latest snapshot. You can manually call compact
function to trigger compaction
NuRaft/include/libnuraft/log_store.hxx
Line 167 in 03de7d9
from nuraft.
NuRaft/src/handle_append_entries.cxx
Lines 484 to 489 in 03de7d9
I just encountered an issue related to this. For a new joining node, the local log store would be empty. req.get_last_log_idx()
would always be larger than log_store_->next_slot()
(which is 1 as log_store start at 0). It will then always append at 1 as below:
NuRaft/src/handle_append_entries.cxx
Lines 669 to 673 in 03de7d9
2020-10-20|03:36:56.257|0000711B|00007F92347F8700|DEBUG|[INIT] log_idx: 28, count: 0, log_store_->next_slot(): 1, req.log_entries().size(): 1
2020-10-20|03:36:56.257|0000711B|00007F92347F8700|DEBUG|[after SKIP] log_idx: 28, count: 0
2020-10-20|03:36:56.257|0000711B|00007F92347F8700|DEBUG|[after OVWR] log_idx: 28, count: 0
2020-10-20|03:36:56.257|0000711B|00007F92347F8700|TRACE|append at 1
2020-10-20|03:36:56.257|0000711B|00007F92347F8700|TRACE|virtual nuraft::ulong raft::RaftLogStore::append(nuraft::ptrnuraft::log_entry&): next_sequence_number=1
It seems that for log_idx > log_store_->next_slot()
case, we should specify the log_idx
for below line (i.e. store_log_entry(entry, log_idx)
):
NuRaft/src/handle_append_entries.cxx
Line 673 in 03de7d9
Or seems change the overwrite condition from log_idx < log_store_->next_slot()
to log_idx != log_store_->next_slot()
would solve the issue:
NuRaft/src/handle_append_entries.cxx
Lines 638 to 644 in 03de7d9
from nuraft.
Hello @sheepgrass
If req.get_last_log_idx()
is greater than log_store_->next_slot()
, log_term
should be 0
so that log_okay
will be false
.
NuRaft/src/handle_append_entries.cxx
Lines 500 to 503 in 03de7d9
Could you please elaborate more on your situation (how log_okay
became true
)? If you can share your logs, that will be great.
I'm attaching the log when an empty server is joining the cluster where the leader's log is already compacted:
2020-10-20T14:05:55.997_451-07:00 [58ed] [DEBG] Receive a join_cluster_request message from 1 with LastLogIndex=10, LastLogTerm=0, EntriesLength=1, CommitIndex=10 and Term=1 [raft_server.cxx:583, process_req()]
2020-10-20T14:05:55.997_472-07:00 [58ed] [INFO] got join cluster req from leader 1 [handle_join_leave.cxx:163, handle_join_cluster_req()]
...
2020-10-20T14:05:55.997_720-07:00 [58ed] [DEBG] Response back a join_cluster_response message to 1 with Accepted=1, Term=1, NextIndex=1 [raft_server.cxx:653, process_req()]
...
2020-10-20T14:05:56.099_215-07:00 [0de7] [DEBG] Receive a append_entries_request message from 1 with LastLogIndex=11, LastLogTerm=1, EntriesLength=0, CommitIndex=11 and Term=1 [raft_server.cxx:583, process_req()]
2020-10-20T14:05:56.099_225-07:00 [0de7] [TRAC] from peer 1, req type: 3, req term: 1, req l idx: 11 (0), req c idx: 11, my term: 1, my role: 1 [handle_append_entries.cxx:465, handle_append_entries()]
2020-10-20T14:05:56.099_239-07:00 [0de7] [INFO] [LOG XX] req log idx: 11, req log term: 1, my last log idx: 1, my log (11) term: 0 [handle_append_entries.cxx:528, handle_append_entries()]
2020-10-20T14:05:56.099_242-07:00 [0de7] [INFO] deny, req term 1, my term 1, req log idx 11, my log idx 1 [handle_append_entries.cxx:535, handle_append_entries()]
2020-10-20T14:05:56.099_248-07:00 [0de7] [DEBG] Response back a append_entries_response message to 1 with Accepted=0, Term=1, NextIndex=2 [raft_server.cxx:653, process_req()]
2020-10-20T14:05:56.099_682-07:00 [58ed] [DEBG] Receive a install_snapshot_request message from 1 with LastLogIndex=10, LastLogTerm=1, EntriesLength=1, CommitIndex=11 and Term=1 [raft_server.cxx:583, process_req()]
2020-10-20T14:05:56.099_727-07:00 [58ed] [INFO] save snapshot (idx 10, term 1) offset 0x0, first obj last obj [handle_snapshot_sync.cxx:412, handle_snapshot_sync_req()]
2020-10-20T14:05:56.099_759-07:00 [58ed] [INFO] sucessfully receive a snapshot (idx 10 term 1) from leader [handle_snapshot_sync.cxx:473, handle_snapshot_sync_req()]
2020-10-20T14:05:56.099_780-07:00 [58ed] [INFO] successfully compact the log store, will now ask the statemachine to apply the snapshot [handle_snapshot_sync.cxx:481, handle_snapshot_sync_req()]
...
2020-10-20T14:05:56.099_892-07:00 [58ed] [INFO] snapshot idx 10 term 1 is successfully applied, log start 11 last idx 10 [handle_snapshot_sync.cxx:517, handle_snapshot_sync_req()]
2020-10-20T14:05:56.099_907-07:00 [58ed] [DEBG] Response back a install_snapshot_response message to 1 with Accepted=1, Term=1, NextIndex=1 [raft_server.cxx:653, process_req()]
2020-10-20T14:05:56.100_459-07:00 [1a5c] [DEBG] Receive a append_entries_request message from 1 with LastLogIndex=10, LastLogTerm=1, EntriesLength=1, CommitIndex=11 and Term=1 [raft_server.cxx:583, process_req()]
2020-10-20T14:05:56.100_480-07:00 [1a5c] [TRAC] from peer 1, req type: 3, req term: 1, req l idx: 10 (1), req c idx: 11, my term: 1, my role: 1 [handle_append_entries.cxx:465, handle_append_entries()]
...
2020-10-20T14:05:56.100_504-07:00 [1a5c] [TRAC] [LOG OK] req log idx: 10, req log term: 1, my last log idx: 10, my log (10) term: 1 [handle_append_entries.cxx:528, handle_append_entries()]
2020-10-20T14:05:56.100_512-07:00 [1a5c] [DEBG] [INIT] log_idx: 11, count: 0, log_store_->next_slot(): 11, req.log_entries().size(): 1 [handle_append_entries.cxx:568, handle_append_entries()]
2020-10-20T14:05:56.100_518-07:00 [1a5c] [DEBG] [after SKIP] log_idx: 11, count: 0 [handle_append_entries.cxx:582, handle_append_entries()]
2020-10-20T14:05:56.100_524-07:00 [1a5c] [DEBG] [after OVWR] log_idx: 11, count: 0 [handle_append_entries.cxx:662, handle_append_entries()]
2020-10-20T14:05:56.100_531-07:00 [1a5c] [TRAC] append at 11 [handle_append_entries.cxx:671, handle_append_entries()]
from nuraft.
I think it's because I used the skip_initial_election_timeout_
option
from nuraft.
Hi @greensky00
I have tested again to see my conditions for getting log_okay
to be true
. I have the below conditions fulfilled:
NuRaft/src/handle_append_entries.cxx
Lines 508 to 510 in 7501788
My logs are as below:
2020-10-21|08:58:33.699|00006B34|00007F0A85FF3700|INFO |snapshot idx 27 term 1 is successfully applied, log start 0 last idx 0
2020-10-21|08:58:33.699|00006B34|00007F0A85FF3700|DEBUG|Response back a install_snapshot_response message to 1 with Accepted=1, Term=1, NextIndex=426709208533041152
2020-10-21|08:58:33.700|00006B34|00007F0A85FF3700|TRACE|nuraft::cb_func::ReturnCode raft::RaftNode::OnRaftCallback(nuraft::cb_func::Type, nuraft::cb_func::Param*): type=1
2020-10-21|08:58:33.700|00006B34|00007F0A85FF3700|DEBUG|Receive a append_entries_request message from 1 with LastLogIndex=27, LastLogTerm=1, EntriesLength=1, CommitIndex=28 and Term=1
2020-10-21|08:58:33.700|00006B34|00007F0A85FF3700|TRACE|from peer 1, req type: 3, req term: 1, req l idx: 27 (1), req c idx: 28, my term: 1, my role: 1
2020-10-21|08:58:33.700|00006B34|00007F0A85FF3700|TRACE|(update) new target priority: 1
2020-10-21|08:58:33.701|00006B34|00007F0A85FF3700|TRACE|[LOG OK] req log idx: 27, req log term: 1, my last log idx: 0, my log (27) term: 0
2020-10-21|08:58:33.701|00006B34|00007F0A85FF3700|TRACE|nuraft::cb_func::ReturnCode raft::RaftNode::OnRaftCallback(nuraft::cb_func::Type, nuraft::cb_func::Param*): type=14
2020-10-21|08:58:33.701|00006B34|00007F0A85FF3700|DEBUG|[INIT] log_idx: 28, count: 0, log_store_->next_slot(): 1, req.log_entries().size(): 1
2020-10-21|08:58:33.701|00006B34|00007F0A85FF3700|DEBUG|[after SKIP] log_idx: 28, count: 0
FYI, I have changed the overwrite condition to log_idx != log_store_->next_slot()
myself and everything seems to be ok though more tests may be needed.
from nuraft.
We want to avoid changing existing logic unless log_idx < log_store_->next_slot()
turns out to be the root cause of this problem.
From your log, it seems to me that the first log appending was denied (log_okay == false
), and then installing a snapshot was successfully done, but the problem happened in the following log appending.
2020-10-21|08:58:33.699|00006B34|00007F0A85FF3700|INFO |snapshot idx 27 term 1 is successfully applied, log start 0 last idx 0
2020-10-21|08:58:33.699|00006B34|00007F0A85FF3700|DEBUG|Response back a install_snapshot_response message to 1 with Accepted=1, Term=1, NextIndex=426709208533041152
These logs indicate that your log store is incorrect. After applying the snapshot on index 27, the log store's start index and last index should be adjusted to 28 and 27, respectively. All following Raft operations are based on the trust that the log store is successfully adjusted.
I guess you missed the implementation of the below feature in compact
function:
NuRaft/include/libnuraft/log_store.hxx
Lines 161 to 162 in 7501788
Once log_store_->compact(X)
is invoked on a log store who's the last log is smaller than X
, the following start_index()
and next_slot()
call should return X+1
, even though log X
doesn't exist.
Could you please check it on your side? Thanks.
from nuraft.
Hi @greensky00
You're right. I missed the implementation of the said feature.
Since my implementation of log store has no extra storage for the start log index, the index of the first log is taken as the start log index. After compaction, there is actually no logs any more in the log store and therefore start log index would be taken as 0. It would be complicated to add an extra start log index.
Currently, it seems that there is no clear documentation about the relationship between the snapshot and the log store. My first intuition is that snapshot and log store are separation mechanism where snapshot can be taken without affecting the log store.
But after reading your comments and taking a detailed look into the codes related to the snapshot flow, NuRaft must perform log store compaction after taken a snapshot which for me is not that good as in case there is something wrong with the most recent snapshot (like file corruption), we will have a hard way to recover the state as the logs were being compacted away.
It would be nice if the snapshot can be taken freely without log compaction (and I know that log compaction can be done manually currently).
FYI, I have made the following changes to make NuRaft work for me (i.e. send snapshot without the need of manual compact()
call and use overwrite to solve my start log index issue: changes
from nuraft.
Hi @sheepgrass
There are two kinds of snapshot operations, and log compaction is automatic in both cases:
- Taking a snapshot: it is a local operation when the number of logs reaches a certain threshold.
- Receiving and installing a snapshot: this happens for a newly joining server or any follower lagging behind. The leader sends a snapshot and the follower receives and installs it.
For 1) taking a snapshot, the log compaction is not mandatory, so that it can be delayed by setting reserved_log_items_
to a non-zero value. You don't need to worry about the situation that logs are discarded right after the snapshot creation:
Lines 508 to 511 in cd78d97
In eBay, we maintain logs up to a few days long while a snapshot is taken almost every minute, for the case you mentioned (crash and corruption).
However, for 2) snapshot installation, the log store of the receiver MUST BE compacted. The reason why is because the logs prior to the received snapshot index may conflict with what the snapshot has, which results in many critical problems, including permanent data diverging. Please refer to the paper, the explanation of Figure 13, on page 12.
https://raft.github.io/raft.pdf
Here is a simple example, let's say
X(Y): log index X whose term is Y
and S1 is the initial leader who has logs 1 to 6:
S1: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)
Due to a network partition, S1 fails to replicate logs 4--6, and in the meantime, S2 becomes the new leader and appends different data to logs 4--6.
S1: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)
S2: 1(1) 2(1) 3(1) 4(2) 5(2) 6(2)
Let's assume it creates a snapshot on the index 6 and discards logs up to 4.
S1: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)
S2: 5(2) 6(2)
After the network partition is resolved, S1 receives the snapshot (on index 6) from S2. If S1 does not compact log store after the snapshot installation, the previous invalid logs 4--6 will remain there. That means the outcome of the execution from log 1 to 6 will not match the snapshot data that S1 has.
And then let's say S1 becomes the leader back again, and think about another replica S3, which hasn't received the log 6 from the previous leader S2.
S1: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)
S2: 5(2) 6(2)
S3: 1(1) 2(1) 3(1) 4(2) 5(2)
Now, S1 will send (invalid) log 4--6 to S3, and S3 will execute them. Hence the data between S3 and S1/S2 will diverge permanently. Moreover, this incident will be silent; nobody will realize it at the moment.
S1: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)
S2: 5(2) 6(2)
S3: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)
Please note that this is just one example, and there can be many more cases that we cannot easily imagine.
from nuraft.
Hi @greensky00
Thanks for your detailed reply~
Please help to see if my understanding is correct:
For 1) taking a snapshot, if reserved_log_items_
is being set to a multiple times larger than snapshot_distance_
, we will be sure that there will be at least some valid snapshots + logs after compaction.
For 2) snapshot installation, snapshot receiver must compact it's log store at or before the snapshot index to remove invalid logs.
I think of a case that if the leader is down and being reboot to become follower, after snapshot installation, its logs are being compacted and there will be no node in the cluster containing the full log history and we should rely on snapshots from that point.
from nuraft.
Hi @sheepgrass
Regarding 1) and 2) -- yes correct.
I'm a bit careful about discussing further details (as I'm not aware of your system), but maintaining a full log is important for you? Log cannot grow forever so that it has to be compacted someday, even without the snapshot process.
from nuraft.
Hi @greensky00
Actually, you're correct. Full log is not important if snapshots (or other state recovery methods) can be used reliably. I am still developing my system which is nearly the final stage and I am trying to understand the behavior of NuRaft to avoid issues in the future. Since I am now testing my implementation of snapshot and log store, those issues which look strange to me so I raised the question here. Thanks for your help in clarifying the logics~
from nuraft.
Related Issues (20)
- Clean up CMakeLists HOT 2
- nuraft 2.0.0 cannot work on centos7 + centos8 HOT 2
- Preconditions of apply_pack HOT 1
- handle_append_entries_resp() declined append HOT 1
- How to use NuRaft with CMake's FetchContent? HOT 12
- CMake targets should be namespaced using ALIAS targets
- CMake targets are missing usage requirements
- Leadership yielding is not synchronized with replicated log HOT 1
- Should the type be uint8_t instead of size_t for serialize_v1p(...) in srv_state.hxx line 133 HOT 1
- This is a question, not an issue. About `state_machine::pre_commit` HOT 2
- Does nuraft support linearizable read if generating no raft log entry of read requests? HOT 2
- Out of order call to state_machine::create_snapshot() when manually triggering a snapshot HOT 4
- Found a MSan error in Asio HOT 4
- Data race in peer HOT 3
- auto_fwd_resp_handler should forward resp->get_result_code
- src/tracer.hxx:52 stack-buffer-overflow when using vsnprintf's return value
- cmd_result::get_result_str missing `SERVER_IS_LEAVING`
- The semantics of the nuraft::snapshot HOT 3
- Merge two clusters into one HOT 1
- Misleading append_entries result code HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nuraft.