Question as title. It would help to get a node joining the cluster faster by recoveri

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Is it possible to get the last snapshot from leader first instead of appending entries from beginning when a node joining the cluster?,about ebay/nuraft

Comments (11)

greensky00 commented on July 22, 2024

Hi @sheepgrass

If log store compaction of the leader happened at least once, all new servers joining the cluster will start with receiving the latest snapshot. You can manually call compact function to trigger compaction

NuRaft/include/libnuraft/log_store.hxx

Line 167 in 03de7d9

virtual bool compact(ulong last_log_index) = 0;

from nuraft.

sheepgrass commented on July 22, 2024

NuRaft/src/handle_append_entries.cxx

Lines 484 to 489 in 03de7d9

 // After a snapshot the req.get_last_log_idx() may less than 

 // log_store_->next_slot() but equals to log_store_->next_slot() -1 

 // 

 // In this case, log is Okay if 

 // req.get_last_log_idx() == lastSnapshot.get_last_log_idx() && 

 // req.get_last_log_term() == lastSnapshot.get_last_log_term()

@greensky00

I just encountered an issue related to this. For a new joining node, the local log store would be empty. req.get_last_log_idx() would always be larger than log_store_->next_slot() (which is 1 as log_store start at 0). It will then always append at 1 as below:

NuRaft/src/handle_append_entries.cxx

Lines 669 to 673 in 03de7d9

 // Append new log entries 

 while (cnt < req.log_entries().size()) { 

 p_tr("append at %zu\n", log_store_->next_slot()); 

 ptr<log_entry> entry = req.log_entries().at( cnt++ ); 

 ulong idx_for_entry = store_log_entry(entry);

2020-10-20|03:36:56.257|0000711B|00007F92347F8700|DEBUG|[INIT] log_idx: 28, count: 0, log_store_->next_slot(): 1, req.log_entries().size(): 1
2020-10-20|03:36:56.257|0000711B|00007F92347F8700|DEBUG|[after SKIP] log_idx: 28, count: 0
2020-10-20|03:36:56.257|0000711B|00007F92347F8700|DEBUG|[after OVWR] log_idx: 28, count: 0
2020-10-20|03:36:56.257|0000711B|00007F92347F8700|TRACE|append at 1
2020-10-20|03:36:56.257|0000711B|00007F92347F8700|TRACE|virtual nuraft::ulong raft::RaftLogStore::append(nuraft::ptrnuraft::log_entry&): next_sequence_number=1

It seems that for log_idx > log_store_->next_slot() case, we should specify the log_idx for below line (i.e. store_log_entry(entry, log_idx)):

NuRaft/src/handle_append_entries.cxx

Line 673 in 03de7d9

ulong idx_for_entry = store_log_entry(entry);

Or seems change the overwrite condition from log_idx < log_store_->next_slot() to log_idx != log_store_->next_slot() would solve the issue:

NuRaft/src/handle_append_entries.cxx

Lines 638 to 644 in 03de7d9

 // Dealing with overwrites (logs with different term). 

 while ( log_idx < log_store_->next_slot() && 

 cnt < req.log_entries().size() ) 

 { 

 ptr<log_entry> entry = req.log_entries().at(cnt); 

 p_in("overwrite at %zu\n", log_idx); 

 store_log_entry(entry, log_idx);

from nuraft.

greensky00 commented on July 22, 2024

Hello @sheepgrass

If req.get_last_log_idx() is greater than log_store_->next_slot(), log_term should be 0 so that log_okay will be false.

NuRaft/src/handle_append_entries.cxx

Lines 500 to 503 in 03de7d9

 ulong log_term = 0; 

 if (req.get_last_log_idx() < log_store_->next_slot()) { 

 log_term = term_for_log( req.get_last_log_idx() ); 

 }

Could you please elaborate more on your situation (how log_okay became true)? If you can share your logs, that will be great.

I'm attaching the log when an empty server is joining the cluster where the leader's log is already compacted:

2020-10-20T14:05:55.997_451-07:00 [58ed] [DEBG] Receive a join_cluster_request message from 1 with LastLogIndex=10, LastLogTerm=0, EntriesLength=1, CommitIndex=10 and Term=1   [raft_server.cxx:583, process_req()]
2020-10-20T14:05:55.997_472-07:00 [58ed] [INFO] got join cluster req from leader 1  [handle_join_leave.cxx:163, handle_join_cluster_req()]
...
2020-10-20T14:05:55.997_720-07:00 [58ed] [DEBG] Response back a join_cluster_response message to 1 with Accepted=1, Term=1, NextIndex=1 [raft_server.cxx:653, process_req()]
...
2020-10-20T14:05:56.099_215-07:00 [0de7] [DEBG] Receive a append_entries_request message from 1 with LastLogIndex=11, LastLogTerm=1, EntriesLength=0, CommitIndex=11 and Term=1 [raft_server.cxx:583, process_req()]
2020-10-20T14:05:56.099_225-07:00 [0de7] [TRAC] from peer 1, req type: 3, req term: 1, req l idx: 11 (0), req c idx: 11, my term: 1, my role: 1 [handle_append_entries.cxx:465, handle_append_entries()]
2020-10-20T14:05:56.099_239-07:00 [0de7] [INFO] [LOG XX] req log idx: 11, req log term: 1, my last log idx: 1, my log (11) term: 0  [handle_append_entries.cxx:528, handle_append_entries()]
2020-10-20T14:05:56.099_242-07:00 [0de7] [INFO] deny, req term 1, my term 1, req log idx 11, my log idx 1   [handle_append_entries.cxx:535, handle_append_entries()]
2020-10-20T14:05:56.099_248-07:00 [0de7] [DEBG] Response back a append_entries_response message to 1 with Accepted=0, Term=1, NextIndex=2   [raft_server.cxx:653, process_req()]
2020-10-20T14:05:56.099_682-07:00 [58ed] [DEBG] Receive a install_snapshot_request message from 1 with LastLogIndex=10, LastLogTerm=1, EntriesLength=1, CommitIndex=11 and Term=1   [raft_server.cxx:583, process_req()]
2020-10-20T14:05:56.099_727-07:00 [58ed] [INFO] save snapshot (idx 10, term 1) offset 0x0, first obj last obj   [handle_snapshot_sync.cxx:412, handle_snapshot_sync_req()]
2020-10-20T14:05:56.099_759-07:00 [58ed] [INFO] sucessfully receive a snapshot (idx 10 term 1) from leader  [handle_snapshot_sync.cxx:473, handle_snapshot_sync_req()]
2020-10-20T14:05:56.099_780-07:00 [58ed] [INFO] successfully compact the log store, will now ask the statemachine to apply the snapshot [handle_snapshot_sync.cxx:481, handle_snapshot_sync_req()]
...
2020-10-20T14:05:56.099_892-07:00 [58ed] [INFO] snapshot idx 10 term 1 is successfully applied, log start 11 last idx 10    [handle_snapshot_sync.cxx:517, handle_snapshot_sync_req()]
2020-10-20T14:05:56.099_907-07:00 [58ed] [DEBG] Response back a install_snapshot_response message to 1 with Accepted=1, Term=1, NextIndex=1 [raft_server.cxx:653, process_req()]
2020-10-20T14:05:56.100_459-07:00 [1a5c] [DEBG] Receive a append_entries_request message from 1 with LastLogIndex=10, LastLogTerm=1, EntriesLength=1, CommitIndex=11 and Term=1 [raft_server.cxx:583, process_req()]
2020-10-20T14:05:56.100_480-07:00 [1a5c] [TRAC] from peer 1, req type: 3, req term: 1, req l idx: 10 (1), req c idx: 11, my term: 1, my role: 1 [handle_append_entries.cxx:465, handle_append_entries()]
...
2020-10-20T14:05:56.100_504-07:00 [1a5c] [TRAC] [LOG OK] req log idx: 10, req log term: 1, my last log idx: 10, my log (10) term: 1 [handle_append_entries.cxx:528, handle_append_entries()]
2020-10-20T14:05:56.100_512-07:00 [1a5c] [DEBG] [INIT] log_idx: 11, count: 0, log_store_->next_slot(): 11, req.log_entries().size(): 1  [handle_append_entries.cxx:568, handle_append_entries()]
2020-10-20T14:05:56.100_518-07:00 [1a5c] [DEBG] [after SKIP] log_idx: 11, count: 0  [handle_append_entries.cxx:582, handle_append_entries()]
2020-10-20T14:05:56.100_524-07:00 [1a5c] [DEBG] [after OVWR] log_idx: 11, count: 0  [handle_append_entries.cxx:662, handle_append_entries()]
2020-10-20T14:05:56.100_531-07:00 [1a5c] [TRAC] append at 11    [handle_append_entries.cxx:671, handle_append_entries()]

from nuraft.

sheepgrass commented on July 22, 2024

I think it's because I used the skip_initial_election_timeout_ option

from nuraft.

sheepgrass commented on July 22, 2024

Hi @greensky00

I have tested again to see my conditions for getting log_okay to be true. I have the below conditions fulfilled:

NuRaft/src/handle_append_entries.cxx

Lines 508 to 510 in 7501788

 ( local_snp && 

 local_snp->get_last_log_idx() == req.get_last_log_idx() && 

 local_snp->get_last_log_term() == req.get_last_log_term() );

My logs are as below:

2020-10-21|08:58:33.699|00006B34|00007F0A85FF3700|INFO |snapshot idx 27 term 1 is successfully applied, log start 0 last idx 0
2020-10-21|08:58:33.699|00006B34|00007F0A85FF3700|DEBUG|Response back a install_snapshot_response message to 1 with Accepted=1, Term=1, NextIndex=426709208533041152
2020-10-21|08:58:33.700|00006B34|00007F0A85FF3700|TRACE|nuraft::cb_func::ReturnCode raft::RaftNode::OnRaftCallback(nuraft::cb_func::Type, nuraft::cb_func::Param*): type=1
2020-10-21|08:58:33.700|00006B34|00007F0A85FF3700|DEBUG|Receive a append_entries_request message from 1 with LastLogIndex=27, LastLogTerm=1, EntriesLength=1, CommitIndex=28 and Term=1
2020-10-21|08:58:33.700|00006B34|00007F0A85FF3700|TRACE|from peer 1, req type: 3, req term: 1, req l idx: 27 (1), req c idx: 28, my term: 1, my role: 1
2020-10-21|08:58:33.700|00006B34|00007F0A85FF3700|TRACE|(update) new target priority: 1
2020-10-21|08:58:33.701|00006B34|00007F0A85FF3700|TRACE|[LOG OK] req log idx: 27, req log term: 1, my last log idx: 0, my log (27) term: 0
2020-10-21|08:58:33.701|00006B34|00007F0A85FF3700|TRACE|nuraft::cb_func::ReturnCode raft::RaftNode::OnRaftCallback(nuraft::cb_func::Type, nuraft::cb_func::Param*): type=14
2020-10-21|08:58:33.701|00006B34|00007F0A85FF3700|DEBUG|[INIT] log_idx: 28, count: 0, log_store_->next_slot(): 1, req.log_entries().size(): 1
2020-10-21|08:58:33.701|00006B34|00007F0A85FF3700|DEBUG|[after SKIP] log_idx: 28, count: 0

FYI, I have changed the overwrite condition to log_idx != log_store_->next_slot() myself and everything seems to be ok though more tests may be needed.

from nuraft.

greensky00 commented on July 22, 2024

We want to avoid changing existing logic unless log_idx < log_store_->next_slot() turns out to be the root cause of this problem.

From your log, it seems to me that the first log appending was denied (log_okay == false), and then installing a snapshot was successfully done, but the problem happened in the following log appending.

2020-10-21|08:58:33.699|00006B34|00007F0A85FF3700|INFO |snapshot idx 27 term 1 is successfully applied, log start 0 last idx 0
2020-10-21|08:58:33.699|00006B34|00007F0A85FF3700|DEBUG|Response back a install_snapshot_response message to 1 with Accepted=1, Term=1, NextIndex=426709208533041152

These logs indicate that your log store is incorrect. After applying the snapshot on index 27, the log store's start index and last index should be adjusted to 28 and 27, respectively. All following Raft operations are based on the trust that the log store is successfully adjusted.

I guess you missed the implementation of the below feature in compact function:

NuRaft/include/libnuraft/log_store.hxx

Lines 161 to 162 in 7501788

  * If current maximum log index is smaller than given `last_log_index`, 

  * set start log index to `last_log_index + 1`.

Once log_store_->compact(X) is invoked on a log store who's the last log is smaller than X, the following start_index() and next_slot() call should return X+1, even though log X doesn't exist.

Could you please check it on your side? Thanks.

from nuraft.

sheepgrass commented on July 22, 2024

Hi @greensky00

You're right. I missed the implementation of the said feature.

Since my implementation of log store has no extra storage for the start log index, the index of the first log is taken as the start log index. After compaction, there is actually no logs any more in the log store and therefore start log index would be taken as 0. It would be complicated to add an extra start log index.

Currently, it seems that there is no clear documentation about the relationship between the snapshot and the log store. My first intuition is that snapshot and log store are separation mechanism where snapshot can be taken without affecting the log store.

But after reading your comments and taking a detailed look into the codes related to the snapshot flow, NuRaft must perform log store compaction after taken a snapshot which for me is not that good as in case there is something wrong with the most recent snapshot (like file corruption), we will have a hard way to recover the state as the logs were being compacted away.

It would be nice if the snapshot can be taken freely without log compaction (and I know that log compaction can be done manually currently).

FYI, I have made the following changes to make NuRaft work for me (i.e. send snapshot without the need of manual compact() call and use overwrite to solve my start log index issue: changes

from nuraft.

greensky00 commented on July 22, 2024

Hi @sheepgrass

There are two kinds of snapshot operations, and log compaction is automatic in both cases:

Taking a snapshot: it is a local operation when the number of logs reaches a certain threshold.
Receiving and installing a snapshot: this happens for a newly joining server or any follower lagging behind. The leader sends a snapshot and the follower receives and installs it.

For 1) taking a snapshot, the log compaction is not mandatory, so that it can be delayed by setting reserved_log_items_ to a non-zero value. You don't need to worry about the situation that logs are discarded right after the snapshot creation:

NuRaft/src/handle_commit.cxx

Lines 508 to 511 in cd78d97

 ulong compact_upto = new_snp->get_last_log_idx() - 

 (ulong)params->reserved_log_items_; 

 p_db("log_store_ compact upto %ld", compact_upto); 

 log_store_->compact(compact_upto);

In eBay, we maintain logs up to a few days long while a snapshot is taken almost every minute, for the case you mentioned (crash and corruption).

However, for 2) snapshot installation, the log store of the receiver MUST BE compacted. The reason why is because the logs prior to the received snapshot index may conflict with what the snapshot has, which results in many critical problems, including permanent data diverging. Please refer to the paper, the explanation of Figure 13, on page 12.
https://raft.github.io/raft.pdf

Here is a simple example, let's say

X(Y): log index X whose term is Y

and S1 is the initial leader who has logs 1 to 6:

S1: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)

Due to a network partition, S1 fails to replicate logs 4--6, and in the meantime, S2 becomes the new leader and appends different data to logs 4--6.

S1: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)
S2: 1(1) 2(1) 3(1) 4(2) 5(2) 6(2)

Let's assume it creates a snapshot on the index 6 and discards logs up to 4.

S1: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)
S2:                     5(2) 6(2)

After the network partition is resolved, S1 receives the snapshot (on index 6) from S2. If S1 does not compact log store after the snapshot installation, the previous invalid logs 4--6 will remain there. That means the outcome of the execution from log 1 to 6 will not match the snapshot data that S1 has.

And then let's say S1 becomes the leader back again, and think about another replica S3, which hasn't received the log 6 from the previous leader S2.

S1: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)
S2:                     5(2) 6(2)
S3: 1(1) 2(1) 3(1) 4(2) 5(2)

Now, S1 will send (invalid) log 4--6 to S3, and S3 will execute them. Hence the data between S3 and S1/S2 will diverge permanently. Moreover, this incident will be silent; nobody will realize it at the moment.

S1: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)
S2:                     5(2) 6(2)
S3: 1(1) 2(1) 3(1) 4(1) 5(1) 6(1)

Please note that this is just one example, and there can be many more cases that we cannot easily imagine.

from nuraft.

sheepgrass commented on July 22, 2024

Hi @greensky00

Thanks for your detailed reply~

Please help to see if my understanding is correct:

For 1) taking a snapshot, if reserved_log_items_ is being set to a multiple times larger than snapshot_distance_, we will be sure that there will be at least some valid snapshots + logs after compaction.

For 2) snapshot installation, snapshot receiver must compact it's log store at or before the snapshot index to remove invalid logs.

I think of a case that if the leader is down and being reboot to become follower, after snapshot installation, its logs are being compacted and there will be no node in the cluster containing the full log history and we should rely on snapshots from that point.

from nuraft.

greensky00 commented on July 22, 2024

Hi @sheepgrass

Regarding 1) and 2) -- yes correct.

I'm a bit careful about discussing further details (as I'm not aware of your system), but maintaining a full log is important for you? Log cannot grow forever so that it has to be compacted someday, even without the snapshot process.

from nuraft.

sheepgrass commented on July 22, 2024

Hi @greensky00

Actually, you're correct. Full log is not important if snapshots (or other state recovery methods) can be used reliably. I am still developing my system which is nearly the final stage and I am trying to understand the behavior of NuRaft to avoid issues in the future. Since I am now testing my implementation of snapshot and log store, those issues which look strange to me so I raised the question here. Thanks for your help in clarifying the logics~

from nuraft.

Is it possible to get the last snapshot from leader first instead of appending entries from beginning when a node joining the cluster? about nuraft HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	// After a snapshot the req.get_last_log_idx() may less than
	// log_store_->next_slot() but equals to log_store_->next_slot() -1
	//
	// In this case, log is Okay if
	// req.get_last_log_idx() == lastSnapshot.get_last_log_idx() &&
	// req.get_last_log_term() == lastSnapshot.get_last_log_term()

	// Append new log entries
	while (cnt < req.log_entries().size()) {
	p_tr("append at %zu\n", log_store_->next_slot());
	ptr<log_entry> entry = req.log_entries().at( cnt++ );
	ulong idx_for_entry = store_log_entry(entry);

	// Dealing with overwrites (logs with different term).
	while ( log_idx < log_store_->next_slot() &&
	cnt < req.log_entries().size() )
	{
	ptr<log_entry> entry = req.log_entries().at(cnt);
	p_in("overwrite at %zu\n", log_idx);
	store_log_entry(entry, log_idx);

	ulong log_term = 0;
	if (req.get_last_log_idx() < log_store_->next_slot()) {
	log_term = term_for_log( req.get_last_log_idx() );
	}

	( local_snp &&
	local_snp->get_last_log_idx() == req.get_last_log_idx() &&
	local_snp->get_last_log_term() == req.get_last_log_term() );

	* If current maximum log index is smaller than given `last_log_index`,
	* set start log index to `last_log_index + 1`.

	ulong compact_upto = new_snp->get_last_log_idx() -
	(ulong)params->reserved_log_items_;
	p_db("log_store_ compact upto %ld", compact_upto);
	log_store_->compact(compact_upto);