GithubHelp home page GithubHelp logo

willemt / raft Goto Github PK

View Code? Open in Web Editor NEW
1.1K 68.0 266.0 716 KB

C implementation of the Raft Consensus protocol, BSD licensed

License: Other

Makefile 1.04% C 82.69% Shell 0.45% C++ 1.77% Python 14.04%
raft-consensus-algorithm

raft's Introduction

https://travis-ci.org/willemt/raft.png https://coveralls.io/repos/willemt/raft/badge.png

C implementation of the Raft consensus protocol, BSD licensed.

See raft.h for full documentation.

See ticketd for real life use of this library.

Networking is out of scope for this project. The implementor will need to do all the plumbing. The library doesn't assume a network layer with ordering or duplicate detection. This means you could use UDP for transmission.

There are no dependencies, however https://github.com/willemt/linked-List-queue is required for testing.

Building

make tests

Quality Assurance

We use the following methods to ensure that the library is safe:

virtraft2

This cluster simulator checks the following:

  • Log Matching (servers must have matching logs)
  • State Machine Safety (applied entries have the same ID)
  • Election Safety (only one valid leader per term)
  • Current Index Validity (does the current index have an existing entry?)
  • Entry ID Monotonicity (entries aren't appended out of order)
  • Committed entry popping (committed entries are not popped from the log)
  • Log Accuracy (does the server's log match mirror an independent log?)
  • Deadlock detection (does the cluster continuously make progress?)

Chaos generated by virtraft2:

  • Random bi-directional partitions between nodes
  • Message dropping
  • Message duplication
  • Membership change injection
  • Random compactions

Run the simulator using:

make test_virtraft

virtraft2 succeeds virtraft

Single file amalgamation

The source has been amalgamated into a single raft.h header file. Use clib to download the source into your project's deps folder, ie:

brew install clib
clib install willemt/raft_amalgamation

The file is stored in the deps folder like below:

deps/raft/raft.h

How to integrate with this library

See ticketd for an example of how to integrate with this library.

If you don't have access to coroutines it's easiest to use two separate threads - one for handling Raft peer traffic, and another for handling client traffic.

Be aware that this library is not thread safe. You will need to ensure that the library's functions are called exclusively.

Initializing the Raft server

Instantiate a new Raft server using raft_new.

void* raft = raft_new();

We tell the Raft server what the cluster configuration is by using the raft_add_node function. For example, if we have 5 servers [1] in our cluster, we call raft_add_node 5 [2] times.

raft_add_node(raft, connection_user_data, node_id, peer_is_self);

Where:

  • connection_user_data is a pointer to user data.
  • peer_is_self is boolean indicating that this is the current server's server index.
  • node_id is the unique integer ID of the node. Peers use this to identify themselves. This SHOULD be a random integer.
[1]AKA "Raft peer"
[2]We have to also include the Raft server itself in the raft_add_node calls. When we call raft_add_node for the Raft server, we set peer_is_self to 1.

Calling raft_periodic() periodically

We need to call raft_periodic at periodic intervals.

raft_periodic(raft, 1000);

Example using a libuv timer:

static void __periodic(uv_timer_t* handle)
{
    raft_periodic(sv->raft, PERIOD_MSEC);
}

uv_timer_t *periodic_req;
periodic_req = malloc(sizeof(uv_timer_t));
periodic_req->data = sv;
uv_timer_init(&peer_loop, periodic_req);
uv_timer_start(periodic_req, __periodic, 0, 1000);

Receiving the entry (ie. client sends entry to Raft cluster)

Our Raft application receives log entries from the client.

When this happens we need to:

  • Redirect the client to the Raft cluster leader (if necessary)
  • Append the entry to our log
  • Block until the log entry has been committed [3]
[3]When the log entry has been replicated across a majority of servers in the Raft cluster

Append the entry to our log

We call raft_recv_entry when we want to append the entry to the log.

msg_entry_response_t response;
e = raft_recv_entry(raft,  &entry, &response);

You should populate the entry struct with the log entry the client has sent. After the call completes the response parameter is populated and can be used by the raft_msg_entry_response_committed function to check if the log entry has been committed or not.

Blocking until the log entry has been committed

When the server receives a log entry from the client, it has to block until the entry is committed. This is necessary as our Raft server has to replicate the log entry with the other peers of the Raft cluster.

The raft_recv_entry function does not block! This means you will need to implement the blocking functionality yourself.

Example below is from the ticketd client thread. This shows that we need to block on client requests. ticketd does the blocking by waiting on a conditional, which is signalled by the peer thread. The separate thread is responsible for handling traffic between Raft peers.

msg_entry_response_t response;

e = raft_recv_entry(sv->raft, &entry, &response);
if (0 != e)
    return h2oh_respond_with_error(req, 500, "BAD");

/* block until the entry is committed */
int done = 0;
do {
    uv_cond_wait(&sv->appendentries_received, &sv->raft_lock);
    e = raft_msg_entry_response_committed(sv->raft, &r);
    switch (e)
    {
        case 0:
            /* not committed yet */
            break;
        case 1:
            done = 1;
            uv_mutex_unlock(&sv->raft_lock);
            break;
        case -1:
            uv_mutex_unlock(&sv->raft_lock);
            return h2oh_respond_with_error(req, 400, "TRY AGAIN");
    }
} while (!done);

Example from ticketd of the peer thread. When an appendentries response is received from a Raft peer, we signal to the client thread that an entry might be committed.

e = raft_recv_appendentries_response(sv->raft, conn->node, &m.aer);
uv_cond_signal(&sv->appendentries_received);

Redirecting the client to the leader

When we receive an entry log from the client it's possible we might not be a leader.

If we aren't currently the leader of the raft cluster, we MUST send a redirect error message to the client. This is so that the client can connect directly to the leader in future connections. This enables future requests to be faster (ie. no redirects are required after the first redirect until the leader changes).

We use the raft_get_current_leader function to check who is the current leader.

Example of ticketd sending a 301 HTTP redirect response:

/* redirect to leader if needed */
raft_node_t* leader = raft_get_current_leader_node(sv->raft);
if (!leader)
{
    return h2oh_respond_with_error(req, 503, "Leader unavailable");
}
else if (raft_node_get_id(leader) != sv->node_id)
{
    /* send redirect */
    peer_connection_t* conn = raft_node_get_udata(leader);
    char leader_url[LEADER_URL_LEN];
    static h2o_generator_t generator = { NULL, NULL };
    static h2o_iovec_t body = { .base = "", .len = 0 };
    req->res.status = 301;
    req->res.reason = "Moved Permanently";
    h2o_start_response(req, &generator);
    snprintf(leader_url, LEADER_URL_LEN, "http://%s:%d/",
             inet_ntoa(conn->addr.sin_addr), conn->http_port);
    h2o_add_header(&req->pool,
                   &req->res.headers,
                   H2O_TOKEN_LOCATION,
                   leader_url,
                   strlen(leader_url));
    h2o_send(req, &body, 1, 1);
    return 0;
}

Function callbacks

You provide your callbacks to the Raft server using raft_set_callbacks.

The following callbacks MUST be implemented: send_requestvote, send_appendentries, applylog, persist_vote, persist_term, log_offer, and log_pop.

Example of function callbacks being set:

raft_cbs_t raft_callbacks = {
    .send_requestvote            = __send_requestvote,
    .send_appendentries          = __send_appendentries,
    .applylog                    = __applylog,
    .persist_vote                = __persist_vote,
    .persist_term                = __persist_term,
    .log_offer                   = __raft_logentry_offer,
    .log_poll                    = __raft_logentry_poll,
    .log_pop                     = __raft_logentry_pop,
    .log                         = __raft_log,
};

char* user_data = "test";

raft_set_callbacks(raft, &raft_callbacks, user_data);

send_requestvote()

For this callback we have to serialize a msg_requestvote_t struct, and then send it to the peer identified by node.

Example from ticketd showing how the callback is implemented:

static int __send_requestvote(
    raft_server_t* raft,
    void *udata,
    raft_node_t* node,
    msg_requestvote_t* m
    )
{
    peer_connection_t* conn = raft_node_get_udata(node);

    uv_buf_t bufs[1];
    char buf[RAFT_BUFLEN];
    msg_t msg = {
        .type              = MSG_REQUESTVOTE,
        .rv                = *m
    };
    __peer_msg_serialize(tpl_map("S(I$(IIII))", &msg), bufs, buf);
    int e = uv_try_write(conn->stream, bufs, 1);
    if (e < 0)
        uv_fatal(e);
    return 0;
}

send_appendentries()

For this callback we have to serialize a msg_appendentries_t struct, and then send it to the peer identified by node. This struct is more complicated to serialize because the m->entries array might be populated.

Example from ticketd showing how the callback is implemented:

static int __send_appendentries(
    raft_server_t* raft,
    void *user_data,
    raft_node_t* node,
    msg_appendentries_t* m
    )
{
    uv_buf_t bufs[3];

    peer_connection_t* conn = raft_node_get_udata(node);

    char buf[RAFT_BUFLEN], *ptr = buf;
    msg_t msg = {
        .type              = MSG_APPENDENTRIES,
        .ae                = {
            .term          = m->term,
            .prev_log_idx  = m->prev_log_idx,
            .prev_log_term = m->prev_log_term,
            .leader_commit = m->leader_commit,
            .n_entries     = m->n_entries
        }
    };
    ptr += __peer_msg_serialize(tpl_map("S(I$(IIIII))", &msg), bufs, ptr);

    /* appendentries with payload */
    if (0 < m->n_entries)
    {
        tpl_bin tb = {
            .sz   = m->entries[0].data.len,
            .addr = m->entries[0].data.buf
        };

        /* list of entries */
        tpl_node *tn = tpl_map("IIIB",
            &m->entries[0].id,
            &m->entries[0].term,
            &m->entries[0].type,
            &tb);
        size_t sz;
        tpl_pack(tn, 0);
        tpl_dump(tn, TPL_GETSIZE, &sz);
        e = tpl_dump(tn, TPL_MEM | TPL_PREALLOCD, ptr, RAFT_BUFLEN);
        assert(0 == e);
        bufs[1].len = sz;
        bufs[1].base = ptr;
        e = uv_try_write(conn->stream, bufs, 2);
        if (e < 0)
            uv_fatal(e);

        tpl_free(tn);
    }
    else
    {
        /* keep alive appendentries only */
        e = uv_try_write(conn->stream, bufs, 1);
        if (e < 0)
            uv_fatal(e);
    }

    return 0;
}

applylog()

This callback is all what is needed to interface the FSM with the Raft library. Depending on your application, you might want to save the commit_idx to disk inside this callback.

persist_vote() & persist_term()

These callbacks simply save data to disk, so that when the Raft server is rebooted it starts from a valid state. This is necessary to ensure safety.

log_offer()

For this callback the user needs to add a log entry. The log MUST be synced to disk before this callback can return.

log_poll()

For this callback the user needs to remove the eldest log entry [4]. The log MUST be synced to disk before this callback can return.

This callback only needs to be implemented to support log compaction.

log_pop()

For this callback the user needs to remove the youngest log entry [5]. The log MUST be synced to disk before this callback can return.

[4]The log entry at the front of the log
[5]The log entry at the back of the log

Receving traffic from peers

To receive Append Entries, Append Entries response, Request Vote, and Request Vote response messages, you need to deserialize the bytes into the message's corresponding struct.

The table below shows the structs that you need to deserialize-to or deserialize-from:

Message Type Struct Function
Append Entries msg_appendentries_t raft_recv_appendentries
Append Entries response msg_appendentries_response_t raft_recv_appendentries_response
Request Vote msg_requestvote_t raft_recv_requestvote
Request Vote response msg_requestvote_response_t raft_recv_requestvote_response

Example of how we receive an Append Entries message, and reply to it:

msg_appendentries_t ae;
msg_appendentries_response_t response;
char buf_in[1024], buf_out[1024];
size_t len_in, len_out;

read(socket, buf_in, &len_in);

deserialize_appendentries(buf_in, len_in, &ae);

e = raft_recv_requestvote(sv->raft, conn->node, &ae, &response);

serialize_appendentries_response(&response, buf_out, &len_out);

write(socket, buf_out, &len_out);

Membership changes

Membership changes are managed on the Raft log. You need two log entries to add a server to the cluster. While to remove you only need one log entry. There are two log entries for adding a server because we need to ensure that the new server's log is up to date before it can take part in voting.

It's highly recommended that when a node is added to the cluster that its node ID is random. This is especially important if the server was once connected to the cluster.

Adding a node

  1. Append the configuration change using raft_recv_entry. Make sure the entry has the type set to RAFT_LOGTYPE_ADD_NONVOTING_NODE
  2. Once node_has_sufficient_logs callback fires, append a configuration finalization log entry using raft_recv_entry. Make sure the entry has a type set to RAFT_LOGTYPE_ADD_NODE

Removing a node

  1. Append the configuration change using raft_recv_entry. Make sure the entry has the type set to RAFT_LOGTYPE_REMOVE_NODE
  2. Once the RAFT_LOGTYPE_REMOVE_NODE configuration change log is applied in the applylog callback we shutdown the server if it is to be removed.

Membership callback

The notify_membership_event callback can be used to track nodes as they are added and removed as a result of configuration change log entries. A typical use case is to create and destroy connections to nodes, using connection information obtained from the configuration change log entry.

Log Compaction

The log compaction method supported is called "Snapshotting for memory-based state machines" (Ongaro, 2014)

This library does not send snapshots (ie. there are NO send_snapshot, recv_snapshot callbacks to implement). The user has to send the snapshot outside of this library. The implementor has to serialize and deserialize the snapshot.

The process works like this:

  1. Begin snapshotting with raft_begin_snapshot.
  2. Save the current membership details to the snapshot.
  3. Save the finite state machine to the snapshot.
  4. End snapshotting with raft_end_snapshot.
  5. When the send_snapshot callback fires, the user must propagate the snapshot to the peer.
  6. Once the peer has the snapshot, they call raft_begin_load_snapshot.
  7. Peer calls raft_add_node to add nodes as per the snapshot's membership info.
  8. Peer calls raft_node_set_voting to nodes as per the snapshot's membership info.
  9. Peer calls raft_node_set_active to nodes as per the snapshot's membership info.
  10. Finally, peer calls raft_node_set_active to nodes as per the snapshot's membership info.

When a node receives a snapshot it could reuse that snapshot itself for other nodes.

Roadmap

  • Batch friendly interfaces - we can speed up Raft by adding new APIs that support batching many log entries
  • Implementing linearizable semantics (Ongaro, 2014)
  • Processing read-only queries more efficiently (Ongaro, 2014)

References

Ongaro, D. (2014). Consensus: bridging theory and practice. Retrieved from https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf

raft's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

raft's Issues

Not implemented log_clear callback leads to not synchronized logs

You mentioned some of the callbacks that need to be implemented in the README. Sadly in our case we must also implement the log_clear callback for the case we load a snapshot with log_load_from_snapshot. Because in this case the whole log should be cleared (is empty afterwards) before applying the snapshot. Initially I thought that this will be done by using the callbacks log_pop and/or log_poll but it isn't. Only log_clear is implicitly called by using log_clear_entries internally of log_load_from_snapshot.

In our case we still had 1 entry in our own log where the raft internal log was empty after loading a snapshot.

Either this is a bug in the documentation (because this callback is needed if working with callbacks) or the implementation of log_load_from_snapshot should handle the case where the callback is not available.

In addition the description of the callback is not very clear to me:

    /** Callback called for every existing log entry when clearing the log.
     * If memory was malloc'd in log_offer and the entry doesn't get a chance
     * to go through log_poll or log_pop, this is the last chance to free it.
     */

I do not see the requirement to individually handle single items. In my implementation I simply clear my whole log at once. I am using an std::queue with smart enough element class that manages its internal memory. Therefore I simply call: log_.clear();.

Documents are old

There are some new breaking changes in the current version of raft which is very different than the one used in ticketd. For example, things like log_get_node_id changed the way how users should interact with the library.

Can you update the documents or update the raft code in ticketd?

leader election without log involved?

Hi Willem,

Your implementation on Raft is pretty cool for solving the replicated finite state machine problem. If I only focus on the leader election, no log needs to maintain (e.g., no raft_recv_entry() call from clients), which means we only depend on the term parameter for the election, could I still use it as a leader election?

Thanks,
Shifeng

When handling appendentries_response, where do the non-raft fields come from?

Hi,

I've got my implementation to the point where the cluster is now stable, in that a leader can be elected and maintain its authority over the cluster. So far so good :)

However now that I'm implementing raft_recv_appendentries_response I've hit something of a snag. There are two fields (highlighted in the header file as being non-raft) that I don't know how to provide:

/* Non-Raft fields follow: */

/* Non-Raft fields follow: */
/* Having the following fields allows us to do less book keeping in
 * regards to full fledged RPC */
/* This is the highest log IDX we've received and appended to our log */
int current_idx;
/* The first idx that we received within the appendentries message */
int first_idx;

What should I be putting in current_idx and first_idx and where should I get it from?

Inconsistet use of "connected" field

Struct field raft_server_private_t.connected is set in the only place into RAFT_NODE_STATUS_CONNECTED state. But this field is compared in the other place with the const RAFT_NODE_STATUS_DISCONNECTING.
Moreover, connection enum has 2 more states.
So, I suppose that there's some missed functionality which is set RAFT_NODE_STATUS_DISCONNECTING and RAFT_NODE_STATUS_CONNECTING values.

Batchize API

raft_recv_entry -> raft_recv_entries
raft_append_entry -> raft_append_entries
applylog -> apply_logs
log_offer -> offer_logs
log_poll -> poll_logs
log_pop -> pop_logs

Meaning of entry_idx in log_offer

I was assuming that the log_offer callback always appends to the log. However, the callback has an entry_idx parameter. Does this mean that log_offer might require overwriting an old entry in the log, and writing beyond the current head of the log?

On a related note: are the newest/oldest log entries for log_pop/log_poll determined by entry_idx, or the order in which log_offer is called for entries?

How to add SOVERSION to libraft.so?

raft is a dependency of our app, when installing the major version of libraft is missing:
Error: Package: daos-server-0.9.0-2.2819.gcdecd456.el7.x86_64 (/daos-server-0.9.0-2.2819.gcdecd456.el7.x86_64)
Requires: libraft.so()(64bit)

The version of libraft.so is missed in this message, it should be: libraft.so.$n

How can we add the raft SOVERSION to it, probably in the Makefile?
.PHONY: shared
shared: $(OBJECTS)
$(CC) $(OBJECTS) $(LDFLAGS) $(CFLAGS) -fPIC $(SHAREDFLAGS) -o $(BUILDDIR)/libraft.$(SHAREDEXT)

Thx.

Possible issues

I have done the following testing:

  1. I applied one log item to three nodes, and persisted.
  2. delete persist data from one node (not leader) and start raft (other nodes keep alive).
  3. raft log will not be applied for this node, there is some log on this node:
    raft: AE no log at prev_idx 1
    send: {"myid": 1, "message_type": "append_entries_response", "term": 1, "success": 0, "current_idx": 0, "first_idx": 2}

Then it will be loopback forever.

raft_send_appendentries_all logic

version 0.7.0
The code below,when "raft_send_appendentries" return not 0,the cycle will return. So if raft_node A is offline, B need send appendentries to A and C ,send A callback return -1,here return, C will not receive heatbeat?C will request vote?

int raft_send_appendentries_all(raft_server_t* me_)
{
raft_server_private_t* me = (raft_server_private_t*)me_;
int i, e;

__log(me_, NULL, "%s:%d num node=%d",__FILE__, __LINE__, me->num_nodes);
me->timeout_elapsed = 0;
for (i = 0; i < me->num_nodes; i++)
{
    if (me->node == me->nodes[i] || !raft_node_is_active(me->nodes[i]))
        continue;

    e = raft_send_appendentries(me_, me->nodes[i]);
    if (0 != e)
        return e;
}

return 0;

}

Once a leader always a leader...

Hi,

Okay, sorry for raising all these issues :)

This one I'm also a little unsure of. Basically I've got the nodes communicating but I have not wired up the ReceiveAppendEntries. So what I expect to see is that they have a vote, one of them wins the vote and becomes the leader and then they start sending AppendEntries. However the other node can't see this so it should timeout and restart the voting process - which it does.

When the first node gets the first vote request, it votes no because it is already in the term. It then gets another vote request for the the next term which it then correctly votes yes to.

But here's the weird thing - after it votes yes, it goes back to sending AppendEntries. This means after a few terms, all my nodes are sending them as they all think they're the leader. Here's some of the output:

Node1 is requesting a vote on term 1
Response:  Term [1] Vote Granted [0]

Timer hit: Send append entries!

Node1 is requesting a vote on term 2
Response:  Term [1] Vote Granted [1]

Timer hit: Send append entries!

I would expect that after a Leader grants a vote, it should step down as the Leader right? I may actually have this totally wrong, and maybe this is not the behaviour I should expect.

If this is the right behaviour, then I guess we need to update:

int raft_recv_requestvote(raft_server_t* me_, int node, msg_requestvote_t* vr,

What do you think?

Cheers,

Pete

PS How do you enable the built in logging that I just noticed? :)

TestRaft_server_recv_requestvote_response_increase_votes_for_me problem

in test_server.c

void TestRaft_server_recv_requestvote_response_increase_votes_for_me(
CuTest * tc
)
{
void *r = raft_new();
raft_add_node(r, NULL, 1, 1);
raft_add_node(r, NULL, 2, 0);
raft_set_current_term(r, 1);
CuAssertTrue(tc, 0 == raft_get_nvotes_for_me(r));

vote for myself ,so vote nums is 1

raft_become_candidate(r);

msg_requestvote_response_t rvr;
memset(&rvr, 0, sizeof(msg_requestvote_response_t));
rvr.term = 1;
rvr.vote_granted = 1;

recv vote response, so vote nums increases to 2

raft_recv_requestvote_response(r, raft_get_node(r, 2), &rvr);

maybe this should be 2

CuAssertTrue(tc, 1 == raft_get_nvotes_for_me(r));

}

wonder is my understanding error, or a mistake ,thanks!

Leader goes into infinite send_snapshot loop

When preparing to send AppendEntries, followers that lag behind last_snapshot_idx will not have a snapshot sent instead of AE. However, this state will persist as not sending AE means there's also no way to track their current index.

Currently the way to deal with this is for the application to manually call raft_node_set_next_idx() after snapshot has been installed. I think ideally the library should deal with it, although this may be good enough if documented.

null pointer dereference on raft_periodic

call of raft_periodic with the only non-self node added causes the library to crash there:

    if (1 == raft_get_num_voting_nodes(me_) &&
        raft_node_is_voting(raft_get_my_node((void*)me)) &&
        !raft_is_leader(me_))
        raft_become_leader(me_);

because raft_get_my_node return null.
null pointer check should be added.

Better Leader Commit?

in raft_recv_appendentries_response(raft_server_t* me_, raft_node_t* node, msg_appendentries_response_t* r)

int votes = 1; 
int point = r->current_idx;
int i;
for (i = 0; i < me->num_nodes; i++)
{
    if (me->node == me->nodes[i] || !raft_node_is_voting(me->nodes[i]))
        continue;

    int match_idx = raft_node_get_match_idx(me->nodes[i]);

    if (0 < match_idx)
    {
        raft_entry_t* ety = raft_get_entry_from_idx(me_, match_idx);
        if (ety->term == me->current_term && point <= match_idx)
            votes++;
    }
}
if (raft_get_num_voting_nodes(me_) / 2 < votes && raft_get_commit_idx(me_) < point)
    raft_set_commit_idx(me_, point);

I add an if before the loop in the code based on two reasons:

  • we first check if commit index < point to avoid unnecessary loop
  • the raft paper said if we are going to change leader's commit index to point, we only need to make sure that log[point].term == current_term

int point = r->current_idx;
raft_entry_t* ety = raft_get_entry_from_idx(me_, point)
if (raft_get_commit_idx(me_) < point && ety->term == me->current_term)
{
    int votes = 1;
    int i;
    for (i = 0; i < me->num_nodes; i++)
    {
        if (me->node == me->nodes[i] || !raft_node_is_voting(me->nodes[i]))
            continue;
        int match_idx = raft_node_get_match_idx(me->nodes[i]);
        if (match_idx >= point)
            vote++;
    }
    if (raft_get_num_voting_nodes(me_) / 2 < votes)
        raft_set_commit_idx(me_, point);
}

2 unimplemented functions from raft.h

Hi, I was wondering whether these two functions are considered obsolete or conversely prepared for the future use.

/**
 * @return this server's node ID */
int raft_get_my_id(raft_server_t* me) { return 0; }

/**
 * @return 1 if node is leader; 0 otherwise */
int raft_node_is_leader(raft_node_t* node) { return 0; }

They have no implementation as for now.

FSM log entries are applied lazily

This might be considered a issue or simply a design choice, but the fact raft_recv_appendentries_response does not invoke raft_apply_all after updating the commit index and rather relies on the next call of raft_periodic to do that, means that a client call to raft_msg_entry_response_committed might return 1 even if the FSM hasn't yet applied the log entry.

It feels this might cause confusion to users or timing-related issues, see for example the scenario I described in issue #12 of ticketd.

Please let me know if my reading is correct. Thanks!

heartbeat rpc might delete uncommited logs?

consider this situation, leader sends a few logs(first rpc) to follower, before the response return to the leader, a heartbeat rpc(second rpc) arrives this follower, the follower will delete first rpc's logs because the second rpc's prev_log_idx is the same as in first rpc, when the first rpc returns to the leader, it says logs replicated successfully, but in fact the logs have been deleted, so i think maybe we should not delete the logs when it is heartbeat.

in raft_server.c : raft_recv_appendentries: 385-389

if (ae->n_entries == 0 && 0 < ae->prev_log_idx && ae->prev_log_idx + 1 < raft_get_current_idx(me_)) 
    { 
        assert(me->commit_idx < ae->prev_log_idx + 1); 
        raft_delete_entry_from_idx(me_, ae->prev_log_idx + 1); 
    }  

log_get_from_idx not including entries that have wrapped around?

Suppose that the circular log buffer is of size 3 and has the following layout:

[<entry>, NULL, <entry>]
    ^              ^
   back          front

base index is 3 (so <entry> at front has index 3 and <entry> at back has index 4)

then it seems that a call to log_get_from_idx(3) instead of returning an array of 2 entries which includes entries 3 and 4, returns an array of just 1 entry which includes only entry 3.

This can be reproduced with the following (failing) test:

void TestLog_get_from_idx_with_wrapping(CuTest * tc)
{
    void* queue = llqueue_new();
    void *r = raft_new();
    raft_cbs_t funcs = {
        .log_pop = __log_pop,
        .log_get_node_id = __logentry_get_node_id
    };
    raft_set_callbacks(r, &funcs, queue);

    void *l;
    raft_entry_t e1, e2, e3, e4;

    memset(&e1, 0, sizeof(raft_entry_t));
    memset(&e2, 0, sizeof(raft_entry_t));
    memset(&e3, 0, sizeof(raft_entry_t));
    memset(&e4, 0, sizeof(raft_entry_t));

    e1.id = 1;
    e2.id = 2;
    e3.id = 3;
    e4.id = 4;

    l = log_alloc(3);
    log_set_callbacks(l, &funcs, r);

    raft_entry_t* ety;

    /* append append append */
    CuAssertIntEquals(tc, 0, log_append_entry(l, &e1));
    CuAssertIntEquals(tc, 0, log_append_entry(l, &e2));
    CuAssertIntEquals(tc, 0, log_append_entry(l, &e3));
    CuAssertIntEquals(tc, 3, log_count(l));

    /* poll poll */
    CuAssertIntEquals(tc, log_poll(l, (void*)&ety), 0);
    CuAssertIntEquals(tc, ety->id, 1);
    CuAssertIntEquals(tc, log_poll(l, (void*)&ety), 0);
    CuAssertIntEquals(tc, ety->id, 2);
    CuAssertIntEquals(tc, 1, log_count(l));

    /* append */
    CuAssertIntEquals(tc, 0, log_append_entry(l, &e4));
    CuAssertIntEquals(tc, 2, log_count(l));

    /* get from index 3 */
    int n_etys;
    raft_entry_t* etys;
    etys = log_get_from_idx(l, 3, &n_etys);

    CuAssertPtrNotNull(tc, etys);
    CuAssertIntEquals(tc, n_etys, 2);

    CuAssertIntEquals(tc, etys[0].id, 3);
    CuAssertIntEquals(tc, etys[1].id, 4);
}

It seems the bug is log_get_from_idx() when it calculates the length of the array to be returned:

    if (i < me->back)
        logs_till_end_of_log = me->back - i;
    else
        logs_till_end_of_log = me->size - i;

In the else case the length should actually be me->size - i + me->back, to include the most recent entries that have wrapped around. However it's not clear how to return a pointer without performing some allocation, because of course pointers don't wrap around.

Does it make sense? Is this a bug?

wonder of function raft_recv_entry

first, thanks for your excellent work. I wonder under what situation function raft_recv_entry will be called? this function seems not mentioned in raft paper?

when to release the raft_entry_data_t buf resource

Hi:
Thank you for this wonderful project!
I have a few questions about the entry log I don't quite understand.

typedef struct
{
    void *buf;

    unsigned int len;
} raft_entry_data_t;

int the raft.h use raft_entry_data_t store entry.when to release the buf ? ?
when to call the log_pop and log_poll (not have conflicting entries) ??
And if we use variable-length entry how to handle it ??

TestRaft_leader_recv_entry_resets_election_timeout error

in test_server.c ,function TestRaft_leader_recv_entry_resets_election_timeout may exists some error .

void TestRaft_leader_recv_entry_resets_election_timeout(
CuTest * tc)
{
void *r = raft_new();
raft_set_election_timeout(r, 1000);
raft_set_state(r, RAFT_STATE_LEADER);

if we change 900 to 1, this test wont pass,because raft_periodic function will call raft_send_append_entries_all function when msec_since_last_period > request_timeout (default 200), which will reset the timeout_elapsed

raft_periodic(r, 900);

/* entry message */
msg_entry_t mety = {};
mety.id = 1;
mety.data.buf = "entry";
mety.data.len = strlen("entry");

add a new assert,this is true and still can pass the tests

CuAssertTrue(tc, 0 == raft_get_timeout_elapsed(r));

/* receive entry */
msg_entry_response_t cr;

this does not change the timeout_elapsed at all

raft_recv_entry(r, &mety, &cr);
CuAssertTrue(tc, 0 == raft_get_timeout_elapsed(r));

}

Log compaction

Hi,

Our implementation is pretty complete now - few more hurdles!

We are running raft in embedded terminals that communicate locally on a wired/wireless network. typically there would be 2-10 terminals participating in a cluster.

Each terminal applies entries from the log via applylog() to their local sqlite db but there is no mechanism at present to ever truncate that log. As the system is used it will become unmanageably large - the log is loaded on startup so we can provide log entries to followers if necessary, but loading this log is already becoming slow.

In this application there is a concept of agreeing on a time period. This is a point at which all earlier log entries are redundant; they might be applied to the local db but they are no longer relevant anyway. For example, entries from yesterday might be unimportant and candidates for removing.

My plan to compact the log is to append a log entry when this period is reached, this log entry will include the entry_idx. The other terminals commit this, when they apply they truncate all entries up to and including this entry.

Thoughts?

Neil

Unused raft_term_t term in log_load_from_snapshot

In raft_log.c

int log_load_from_snapshot(log_t *me_, raft_index_t idx, raft_term_t term)
{
    log_private_t* me = (log_private_t*)me_;

    log_clear_entries(me_);
    log_clear(me_);
    me->base = idx;

    return 0;
}

The term is unused.

So could be safely deleted from source code?

TestRaft_follower_recv_appendentries_delete_entries_if_conflict_with_new_entries problem

in Test_server.c, function TestRaft_follower_recv_appendentries_delete_entries_if_conflict_with_new_entries my be some problem.

void TestRaft_follower_recv_appendentries_delete_entries_if_conflict_with_new_entries(
CuTest * tc)
{
msg_appendentries_t ae;
msg_appendentries_response_t aer;

void *r = raft_new();
raft_add_node(r, NULL, 1, 1);
raft_add_node(r, NULL, 2, 0);

raft_set_current_term(r, 1);

char* strs[] = {"111", "222", "333"};

termid of entries in log are 1

raft_entry_t *ety_appended = __entries_for_conflict_tests(tc, r, strs);

pass a appendentry that is newer  */
msg_entry_t mety = {};

memset(&ae, 0, sizeof(msg_appendentries_t));
ae.term = 2;

prev_log_idx is 1 and prev_termid is 1 ,so there is no conflict with entries in log. maybe ae.prev_log_term should change to 2

ae.prev_log_idx = 1;
ae.prev_log_term = 1;

this entry termid is 0, so conflict with previous one

/* include one entry */
memset(&mety, 0, sizeof(msg_entry_t));
char *str4 = "444";
mety.data.buf = str4;
mety.data.len = 3;
mety.id = 4;
ae.entries = &mety;
ae.n_entries = 1;

raft_recv_appendentries(r, raft_get_node(r, 2), &ae, &aer);
CuAssertTrue(tc, 1 == aer.success);
CuAssertTrue(tc, 2 == raft_get_log_count(r));
CuAssertTrue(tc, NULL != (ety_appended = raft_get_entry_from_idx(r, 1)));
CuAssertTrue(tc, !strncmp(ety_appended->data.buf, strs[0], 3));

}

wonder if this is a mistake or my understanding error? thanks !

Log callback called every time raft_periodic is called

Hi,

In my code I set election_timeout and request_timeout to say 5,000ms. I then call raft_periodic every 1,000ms. In my own callback ( the one that calls raft_periodic) it prints PING! every time it is executed.

Previously, I would see five of these pings before the timeout was hit and raft started sending out vote requests. However in the latest version, every time I call raft_periodic it also calls the log function.

So two questions:

  1. What does the log function actually do? It's not documented as far as I can tell...
  2. Why is it being called each time?

Thanks!

Cheers,

Pete

Dist folder with amalgamated source file per release

It would be nice if you create a dist folder per release with the amalgamated source file included so that when you're upgrading between releases you simply need to add $(RaftDir)\dist as include directory and off you go 🍻

why use raft_server_t* in all functions? instead just use raft_server_t

Hi Willemt,
I like your work so much. I have a question here:

void raft_set_current_term(raft_server_t* me_, int term)
{
raft_server_private_t* me = (void*)me_;
me->current_term = term;
}

Like this,
why not just use raft_server_t, since it is already void*.
I don't quite understand. Could you please tell me?

Thank you

Empty AppendEntries requests are sent during non-idle periods

The current implementation unconditionally sends empty an AppendEntries request every time the heartbeat interval (request_timeout) expires.

However sending empty/heartbeat AppendEntries is only required if no "actual" non-AppendEntries request was sent recently (where here "recently" essentially means within request_timeout milliseconds).

Invalid term returned after calling raft_recv_requestvote

Hi,

When called raft_recv_requestvote, the msg_requestvote_response_t that I get back has a garbled term. For example I had a long running app on term 180. I started up the test app and it started at term 1. The test app received a RequestVote message from the long running app. However then the test app replied with a No vote and with term 428825 (or similar - I don't have it in front of me right now).

I haven't yet been able to track down where this number is coming from...

Cheers,

Pete

PS I sent you an email a week or so go but didn't get a reply - not sure if it got stuck in a spam filter somewhere...

field is not used

msg_appendentries_response_t.first_idx field is not used in any algorithm's actions, so could be safely deleted from sources

RequestVote structure not fully intialised

Hi,
I'm new to the library, so please feel free to close this issue if this is not relevant. When I was sending a vote request, I noticed that last log term was always a large random number. I was expecting it to be zero at this stage.

In this piece of code:

https://github.com/willemt/raft/blob/master/src/raft_server.c#L422

int raft_send_requestvote(raft_server_t* me_, int node)
{
    raft_server_private_t* me = (raft_server_private_t*)me_;
    msg_requestvote_t rv;

    __log(me_, "sending requestvote to: %d", node);

    rv.term = me->current_term;
    rv.last_log_idx = raft_get_current_idx(me_);
    if (me->cb.send_requestvote)
        me->cb.send_requestvote(me_, me->udata, node, &rv);
    return 0;
}

rv isn't zero'd and rv.last_log_term is never explicitly set.

Thanks in advance!

Cheers,

Pete

disk write and leader timeout

synchronously writing log to disk can blocks leader's main loop, and causes it to timeout, then the cluster has to reelect a new leader, this impacts cluster availability, I think in leader, there should have a thread to flush log to disk, an index field indicating at which point the log is flushed, the the committed index should be based on this, for follower, log is synchronously written.

--
update:

also it could be better to apply log in asynchronous manner to not block main loop.

double null pointer with diffrent error codes

function raft_recv_appendentries_response contains 2 checks about null pointer node with different error codes but it's not clear witch one should be used:
`
if (!node)
return -1;

...
/* stop processing, this is a node we don't have in our configuration */
if (!node)
return 0;
`

May be the second check is redundant?

unused error codes

Functions raft_send_requestvote and raft_send_appendentries return 0 in any case.
At the same time they call callbacks which return errorcodes.
May be the output from callbacks should be transmitted to the functions output?

Log interface decoupling

Here's a major refactoring step to consider.

Currently the server logic and log implementation are quite coupled, to the point where the log implementation in many cases is responsible to trigger operations on the Raft side.

Decoupling would not just clean things up, but also make it possible to replace the log implementation. For example, it may be useful to implement a disk based log that does not require all entries to be in memory at all time and can fetch them from disk on demand.

TestRaft_server_recv_requestvote_response_increase_votes_for_me problem

in Test_server.c , function TestRaft_server_recv_requestvote_response_increase_votes_for_me may have some error.
void TestRaft_server_recv_requestvote_response_increase_votes_for_me(
CuTest * tc
)
{
void *r = raft_new();
raft_add_node(r, NULL, 1, 1);
raft_add_node(r, NULL, 2, 0);

this action will increase term id to 2;

raft_set_current_term(r, 1);   
CuAssertTrue(tc, 0 == raft_get_nvotes_for_me(r));

this action will increase term id to 2;

raft_become_candidate(r);   
msg_requestvote_response_t rvr;
memset(&rvr, 0, sizeof(msg_requestvote_response_t));
rvr.term = 1;
rvr.vote_granted = 1;
raft_recv_requestvote_response(r, raft_get_node(r, 2), &rvr);   

this action wont increase vote because rcr.termid =1 < current term id;

CuAssertTrue(tc, 1 == raft_get_nvotes_for_me(r));

}
wonder is this a mistake or my understanding error?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.