GithubHelp home page GithubHelp logo

rjmcguire / ra Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rabbitmq/ra

0.0 3.0 0.0 3.21 MB

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.

License: Other

Makefile 0.17% Erlang 99.83%

ra's Introduction

Ra: a Raft Implementation for Erlang and Elixir

What is This

Ra is a Raft implementation by Team RabbitMQ. It is not tied to RabbitMQ and can be used in any Erlang or Elixir project. It is, however, heavily inspired by and geared towards RabbitMQ needs.

Design Goals

  • Low footprint: use as little resources as possible, avoid process tree explosion
  • Able to run thousands of ra clusters within an Erlang node
  • Provide adequate performance for use as a basis for a distributed data service

Project Maturity

This library is under heavy development. Breaking changes to the API and on disk storage format are likely.

Status

The following Raft features are implemented:

  • Leader election
  • Log replication
  • Cluster membership changes: one node (member) at a time
  • Log compaction (with limitations and RabbitMQ-specific extensions)
  • Snapshot installation

There are two storage backends:

  • ra_log_memory: an in-memory log backend useful for testing
  • ra_log_file: a disk-based backend

Use Cases

This library is primarily developed as the foundation for replication layer for mirrored queues in a future version of RabbitMQ. The design it aims to replace uses a variant of Chain Based Repliction which has two major shortcomings:

  • Replication algorithm is linear
  • Failure recovery procedure requires expensive topology changes

Internals

Identity

Identity is a somewhat convoluted topic in ra consisting of multiple parts used for different aspects of the system.

  1. Cluster Id

    Each ra cluster is assigned an id that needs to be unique within the erlang cluster it is running on.

  2. Node Id

    The node id is a tuple of a ra node's locally registered name and the erlang node it resides on. This is the primary id used for membership in ra and needs to be a persistent addressable (can be used to send messages) id. A pid() would not work as it isn't persisted across process restarts. Although typically each ra node within a ra cluster is started on a separate erlang node ra supports nodes within the same cluster sharing erlang nodes. Hence we cannot simply re-use the cluster id as the registered name.

  3. UID

    Each ra node also needs an id that is unique to the local erlang node and unique across incarnations of ra clusters with the same cluster id. This is used for interactions with the write ahead log, segment and snapshot writer processes who use the ra_directory to lookup the current pid() for a given id. It is also, critically, used to provide a identity for the node on disk.

    This is to handle the case where a ra cluster with the same name is deleted and then re-created with the same cluster id and node ids shortly after. In this instance the write ahead log may contain entries from the previous incarnation which means we could be mixing entries written in the previous incarnation with ones written in the current incarnation which obviously is unacceptable. Hence providing a unique local identity is critical for correct operation. We suggest using a combination of the locally registered name combined with a time stamp of some sort.

Example config:

Config = #{cluster_id => <<"ra-cluster-1">>,
           node_id => {ra_cluster_1, ra1@snowman},
           uid => <<"ra_cluster_1_1519808362841">>
           ...},

Raft Extensions and Deviations

ra aims to fit well within the erlang environment as well as provide good adaptive throughput. Therefore it has deviated from the original Raft protocol in certain areas.

Replication

Log replication in Ra is mostly asynchronous, so there is no actual use of rpc calls. New entries are pipelined and followers reply after receiving a written event which incurs a natural batching effects on the replies. Followers include 3 non-standard fields in their replies:

  • last_index, last_term - the index and term of the last fully written entry. The leader uses this to calculate the new commit_index.

  • next_index - this is the next index the follower expects. For successful replies it is not set, or is ignored by the leader. It is set for unsuccessful replies and is used by the leader to update it's next_index for the follower and resend entries from this point.

To avoid completely overwhelming a slow follower the leader will only pipeline if the distance between the next_index and match_index is below some limit (currently set to 1000). Follower that are considered stale (i.e. the match_index is less then next_index - 1) are still sent an append entries message periodically, although less frequently than a traditional Raft system. This is done to ensure follower liveness. In an idle system where all followers are in sync no further messages will be sent.

Failure detection

Ra doesn't use Raft's standard approach where the leader periodically sends append entries messages to enforce it's leadership. Ra is designed to support potentially thousands of ra clusters within an erlang cluster and having all these doing their own failure detection has proven unstable and also means unnecessary use of network. As leaders will not send append entries unless there is an update to be sent it means followers don't (typically) set election timers.

This leaves the question on how failures are detected and elections are triggered.

Ra tries to make use as much of native erlang failure detection facilities as it can. The crash scenario is trivial to handle using erlang monitors. Followers monitor leaders and if they receive a 'DOWN' message as they would in the case of a crash or sustained network partition where distributed erlang detects a node isn't replying the follower then sets a short, randomised election timeout.

This only works well in crash-stop scenarios. For network partition scenarios it would rely on distributed erlang to detect the partition which could easily take up to a minute to happen which is too slow.

The ra application provides a node failure detector that uses monitors erlang nodes. When it suspects an erlang node is down it notifies local ra nodes of this. If this erlang node is the node of the currently known ra leader the follower will start an election.

Copyright and License

(c) 2017, Pivotal Software Inc.

Double licensed under the ASL2 and MPL1.1. See LICENSE for details.

ra's People

Contributors

kjnilsson avatar hairyhum avatar dcorbacho avatar michaelklishin avatar dumbbell avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.