GithubHelp home page GithubHelp logo

Comments (9)

phillipCouto avatar phillipCouto commented on May 22, 2024

Is the raft protocol really necessary if you have one master and multiple slaves using AXFR transfer requests?

Or you thinking of having a pool of masters that can be used for HA?

from hickory-dns.

bluejekyll avatar bluejekyll commented on May 22, 2024

Yes, I haven't fully designed this yet, but my thought here is that instead of having a dedicated master, there would be a Raft group. The protocol itself, I was thinking of implementing directly over DNS, and utilizing either DNS over TLS or DNSCrypt for IXFR, rather than AXFR. The IXFR would effectively be the log transfer as described in the Raft protocol. Leader election could be performed, and actually be recorded in the SOA for the zone. B/C of long TTLs, it would probably make sense to have some multiple of the TTL be used for the leader election, with a significant weight be put on the existing leader to continue as the leader.

Most of this is in my head, so I haven't fully designed it. The goal would be to basically have a zero-downtime architecture, with low maintenance associated with keeping the nodes in sync.

Initially though, I might delay the full Raft implementation in favor of a simpler NOTIFY and IXFR style syncing with a dedicated master. I designed the journal specifically to make IXFRs efficient.

from hickory-dns.

phillipCouto avatar phillipCouto commented on May 22, 2024

Ah ok makes sense! Election would only happen if the existing leader is no longer responding. Or at least that is my assumption from other projects stating their implementation of Raft. That way network traffic would mostly be pings or data streams of updates unless the leader goes offline for what I would think would be some defined number, not sure of the TTL is effective. The leader election would really just affect who is the authority to update the zone correct? So the TTL could be a good starting point but when trust-dns goes production I can administrators wanting to tweak the timeout periods for triggering an election to provide faster recovery windows without increasing their DNS queries due to a reduced TTL on the SOA.

from hickory-dns.

bluejekyll avatar bluejekyll commented on May 22, 2024

Election would only happen if the existing leader is no longer responding.

I would probably design it this way, as well. It's been a little while since I read the paper.

The leader election would really just affect who is the authority to update the zone correct?

Yes, this is all about Updates. A big goal for this project for me is to reduce the overhead of managing dynamic zones.

I can see a lot of options for adjusting TTL/Leader election, etc., but one thing that would probably deviate from the Raft design, is that I might allow slaves to respond to queries, at least up to the TTL for the record and last ping from the current master. Or something along these lines.

from hickory-dns.

wkornewald avatar wkornewald commented on May 22, 2024

Is it really necessary to have Raft consensus here or could this be eventually consistent (e.g. via some gossip protocol)? The latter makes deployments easier (you don’t have to explicitly choose 3/5/7/9 primary consensus servers and it automatically works for geo-distributed setups), you can have a reliable two-server cluster (for smaller setups), and it has very high availability (in the worst case, updates will take longer to propagate; the system always self-heals even in situations where Raft can’t form a majority). If writes really mustn’t be lost it could be sufficient to have writes confirmed by at least 1 or 2 other servers and by default that value could be dynamically adjusted based on the currently available cluster size (for easier deployment out-of-the-box with default config).

from hickory-dns.

bluejekyll avatar bluejekyll commented on May 22, 2024

Thanks for the feedback. I think gossip could be really neat as an option. I think we'd still want a leader to be elected so that we can guarantee serializability across the systems.

What's the form of discovery you’re imagining with gossip? multicast?

from hickory-dns.

wkornewald avatar wkornewald commented on May 22, 2024

What kind of use-case do you have in mind where serializability would be CRITICAL and eventual consistency would be unacceptable (even with optional support to wait for all replicas to update within some timeout)? I can't really think of any cases where e.g. concurrent updates would conflict and couldn't be resolved in a (for humans) meaningful way.

Again, my priority is maximum availability (AP of CAP), minimum resource usage, simple deployment/operation and implementation simplicity/correctness.

With Raft/Paxos you lose the AP guarantee at least for writes (reads could in theory still return potentially outdated results within a network partition). Maybe gossip could additionally be used to propagate updates from the leader via followers in case the servers can still reach each other indirectly under a weird network partition (maybe unlikely, though). Regarding deployment simplicity, some algorithm could automatically and maximally spread the leaders over the whole cluster and across failure domains, so you don't have to explicitly choose/maintain the set of 3/5/7/9 nodes for the leader group. The leaders group size could automatically adapt to the cluster size (a 6 server cluster could pick a leader group of 5 and if you lose 2 servers the leader group could reduce to 3 nodes to sustain yet one more loss; even with 2 nodes, one of the nodes could get two votes to have a 50% chance of sustaining a server loss). The set of leaders could automatically be moved to some other servers on failure (i.e. leader group consists of nodes A, C, G; if G fails, A and C agree to replace G with E, so you can sustain one more loss). This is especially practical if all nodes are full replicas of each other (i.e. there is no sharding involved) and thus each follower already has all information to quickly promote to a leader. For best placement of leaders, each node would be annotated with at least the outermost failure domain (i.e. the data center or continent or just the rack) or even better, the failure domain could be hierarchical to allow good automatic spreading (i.e. across as many continents as possible and within that across as many data centers as possible, etc.) and provide better information for the leader-group-size selection algorithm (if you have 1000 nodes in two data centers you might want to pick fewer nodes than in the 9 nodes over 9 data centers case). Well, this might be going too far and I think some purely eventually consistent solution would work well enough. However, with Raft you need to make the leader group explicit and allow adding/removing leader group members at runtime and manually restoring to an operational state in case the majority is lost during a catastrophic failure. Eventual consistency makes that a lot simpler since you don't need to care about this stuff and just replace failed nodes while the cluster keeps running.

The discovery could optionally be done via multicast (Serf also supports that), but since not every network supports multicast just specifying a few initial node IPs/hostnames should be sufficient and then all nodes would share their list of known nodes belonging to that cluster and constantly keep each other up to date. Or were you thinking of something different?

from hickory-dns.

bluejekyll avatar bluejekyll commented on May 22, 2024

What kind of use-case do you have in mind where serializability would be CRITICAL and eventual consistency would be unacceptable (even with optional support to wait for all replicas to update within some timeout)?

That's a great question. This depends on what's being stored, entirely. I think it would be ok for the things I'm working on to have devices order some set of DNS endpoints in a consistent manner for themselves. All the future use cases I'm considering there would only be one originating source of data for dynamic updates, so it really only needs to be serializable from the data sources perspective.

my priority is maximum availability (AP of CAP)

Ok, that's great to know. For AP Gossip would be great. I just think it's important that where conflicts are detected that we have some strong notion of how to resolve them. If the data is indeed persistent (say IP address of non-remote device) then the original data would always effectively be the same, delta some of the TTL or potentially other options. I think we want to consider mobile devices as well though, where their movement may cause things like IP's to be updated and change more frequently, I hate to use time as the disambiguator, but maybe there's a way that we could incorporate serial numbers into the update scheme as a form of Lamport Clock (Or maybe even Vector Clocks in some way?) to help with that.

With Raft/Paxos you lose the AP guarantee at least for writes

I won't respond to all your points here, but I do agree with you about all of the downsides and have thought about similar options for coming up with resolutions to them.

The leaders group size could automatically adapt to the cluster size (a 6 server cluster could pick a leader group of 5 and if you lose 2 servers the leader group could reduce to 3 nodes to sustain yet one more loss; even with 2 nodes, one of the nodes could get two votes to have a 50% chance of sustaining a server loss).

Yes, this is something I've been considering as well. It's difficult though. If there is a regional network partition, it's entirely possible you end up splitting the group into subsets that are still big enough to receive updates losing consistency.

(i.e. there is no sharding involved)

I've actually thought about sharding as a potential solution to the AP issue. Assuming your not trying to maintain a single global Zone, it might be possible to offer a higher degree of availability for Zones if those Zones are managed in a regional manner. Meaning, there are leaders chosen for each Zone, and ideally that Zone is only updated within the same region (I'm using region here loosely, could be a datacenter, could be bigger or smaller). That way the Zone could still receive updates during a network partition, where the network the Zone is in, is also the network providing the data (obviously other networks would not be receiving updates during the partition, but would recover elegantly once the network partition is resolved).

The discovery could optionally be done via multicast (Serf also supports that), but since not every network supports multicast just specifying a few initial node IPs/hostnames should be sufficient and then all nodes would share their list of known nodes belonging to that cluster and constantly keep each other up to date. Or were you thinking of something different?

No, I wasn't thinking of anything different from that, that's why I asked. My guess is that we probably end-up wanting to have some set of discovery nodes (a global zone, updated very infrequently) and then discover the rest through maybe a tree of zones that describe each region of nodes or something. I agree about the downside of multicast, but for LANs, that might be a nice option (though we need to address some issues in the mDNS implementation of trust-dns).

I like the idea of starting with AP as the initial goal, and then potentially offering more advanced configuration options in the future that would make things more consistent.

from hickory-dns.

wkornewald avatar wkornewald commented on May 22, 2024

What kind of use-case do you have in mind where serializability would be CRITICAL and eventual consistency would be unacceptable (even with optional support to wait for all replicas to update within some timeout)?

That's a great question. This depends on what's being stored, entirely. I think it would be ok for the things I'm working on to have devices order some set of DNS endpoints in a consistent manner for themselves. All the future use cases I'm considering there would only be one originating source of data for dynamic updates, so it really only needs to be serializable from the data sources perspective.

Yes, that's what I meant. Raft would help with concurrent updates sent to two-or-more nodes (a single node can always order concurrent requests) affecting the same DNS entry with conflicting values (what kind of valid use-case would that be?). Raft would also help if you have atomic updates that touch multiple entities at the same time (like moving money between accounts), but I don't think we have that case for DNS updates.

Another use-case is when we want to wait for the whole cluster to have written the update before some other process like a Let's Encrypt DNS-01 challenge can continue. This can be sufficiently emulated with eventual consistency by waiting for all nodes to confirm the update within a reasonable timeout (based on cluster size, maybe).

Also, we could slightly improve consistency guarantees by first catching up with the rest of the cluster on node start before dealing with incoming DNS requests (so e.g. Kubernetes/Nomad/etc. would only take that node online once it's up-to-date).

my priority is maximum availability (AP of CAP)

Ok, that's great to know. For AP Gossip would be great. I just think it's important that where conflicts are detected that we have some strong notion of how to resolve them. If the data is indeed persistent (say IP address of non-remote device) then the original data would always effectively be the same, delta some of the TTL or potentially other options. I think we want to consider mobile devices as well though, where their movement may cause things like IP's to be updated and change more frequently, I hate to use time as the disambiguator, but maybe there's a way that we could incorporate serial numbers into the update scheme as a form of Lamport Clock (Or maybe even Vector Clocks in some way?) to help with that.

Yeah, of course. With eventual consistency I meant that all nodes reach the same final state even under conflicting updates, no matter in which order the updates arrive.

Maybe a Lamport Clock is good enough. It just wouldn't acceptably order changes which take several seconds (or much longer) to propagate - e.g. due to cluster size or network problems. That could be solved with some sufficiently precise time-based solution (not just blind comparison of system time, of course). I haven't yet read the respective papers, but maybe one of these algorithms can provide the best of both worlds:

Hybrid Logical Clocks (https://cse.buffalo.edu/tech-reports/2014-04.pdf) are used by CockroachDB (mentioned here: https://github.com/cockroachdb/cockroach/blob/master/docs/design.md) and there's even a Rust crate (https://lib.rs/crates/hybrid-clocks). However, I don't know about the current patent situation and if the owners have an aggressive or open policy. HLC and HybridTime both had a patent pending when I first heard about them. Would have to check again to make sure it's fine to use that algorithm, but CockroachDB doesn't seem to be running into problems with its use of HLC.

There's also AugmentedTime: https://pdfs.semanticscholar.org/2f05/8c7bfe3ddce90f9715842b2b3a915b1e0862.pdf

This post compares both hybrid clock approaches a little bit and also links to another HLC post: http://muratbuffalo.blogspot.com/2015/10/analysis-of-bounds-on-hybrid-vector.html

At first I also looked into Bloom Clock, but that seems to require some additional work to sensibly deal with hash collisions (https://news.ycombinator.com/item?id=20095559) and I'm not sure if it has nice time-assisted ordering or if it's primarily a more space-efficient Vector Clock.

(i.e. there is no sharding involved)

I've actually thought about sharding as a potential solution to the AP issue. Assuming your not trying to maintain a single global Zone, it might be possible to offer a higher degree of availability for Zones if those Zones are managed in a regional manner. Meaning, there are leaders chosen for each Zone, and ideally that Zone is only updated within the same region (I'm using region here loosely, could be a datacenter, could be bigger or smaller). That way the Zone could still receive updates during a network partition, where the network the Zone is in, is also the network providing the data (obviously other networks would not be receiving updates during the partition, but would recover elegantly once the network partition is resolved).

That would work as a special case if you do indeed have to introduce Raft and eventual consistency isn't sufficient, but a sharded, multi-leader solution sounds quite complicated and I still hope there won't even be a pressing need for Raft in a DNS cluster. ;)

from hickory-dns.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.