comcast / sirius Goto Github PK

A distributed system library for managing application reference data

Home Page: http://comcast.github.io/sirius/

License: Apache License 2.0

Scala 98.34% Shell 1.66%

sirius's Introduction

Sirius

Sirius is a library for distributing and coordinating data updates amongst a cluster of nodes. It handles building an absolute ordering for updates that arrive in the cluster, ensuring that cluster nodes eventually receive all updates, and persisting the updates on each node. These updates are generally used to build in-memory data structures on each node, allowing applications using Sirius to have direct access to native data structures representing up-to-date data. Sirius does not, however, build these data structures itself -- instead, the client application supplies a callback handler, which allows developers using Sirius to build whatever structures are most appropriate for their application.

Said another way: Sirius enables a cluster of nodes to keep developer-controlled in-memory data structures eventually consistent, allowing I/O-free access to shared information.

Next Steps

Questions, Comments, Bugs and More

sirius's People

Contributors

Stargazers

Watchers

sirius's Issues

Enhance API to add indication that initialisation of Sirius has finished

If the WAL is very large it can take a very long time for Sirius to read back the contents at initialisation. Thus it is often necessary for the application to have some means of knowing when that has completed before making its services available. In the reference app, and within the limitations of the current API, that requires polling the isOnline() method. Since this is likely to be a relatively common aspect of any Sirius app, it would be more convenient if there were simpler ways to do so, such as:

a blocking waitUntilOnline() method, potentially with a timeout value
other types of blocking objects that are compatible with general 'wait for event' APIs, such as Selector - that approach might be impractical and/or out of scope
a callback method registered with the Sirius implementation

Subset of follower nodes missing data

We recently detected a situation where several follower nodes were found to be missing data that could be found in the uberstore of one of the Paxos-participating nodes. The affected nodes were all in the same data center, but not all nodes in that data center were affected (7 out of 28 in that data center). Further, there seemed to be two sets of affected nodes, similar in the degree to which they were affected (e.g. nodes A, C, F, G were missing 540 of a particular type of event while B, D, E, were only missing 167 of that event). However, all of the missing events took place around the same time.

Our cluster topology involves three data centers and has three parts to it:

The first part is composed of three Paxos-participating nodes, only one of which generates events that go out to the cluster. The other two nodes are for failover. All three nodes are in the same datacenter.

The second part is composed of what we call repeater nodes. Their responsibility is to distribute updates from the Paxos-participating nodes (in a different datacenter) to the client-facing nodes they share a datacenter with. That is to say, the sirius cluster config for repeater nodes lists only the Paxos-participating nodes, and the sirius cluster config for client facing nodes lists only the repeater nodes. There are three repeater nodes in each datacenter.

The third part is composed of the nodes serving customer traffic.

I was able to obtain a copy of the uberstore directory from one of the Paxos-participating nodes (145) and one of the affected nodes (141). Using the waltool, I determined a sequence range that encompassed the missing events and extracted that same range from each uberstore:

As you can see, there are some individual events missing as well as a large chunk that's missing (546935891-546943916)

Some more information about our setup:

Sirius 1.2.6
Sirius Config
Our ingest patterns tend to be bursty.
We rebuild the WAL once a month on average (with each release).

Please let me know if there is anything else you would like to know.

Should we have native Scala external interfaces?

Right now we have made sure that Sirius is usable from a native Java application (and hence, likely, from any JVM-targeted language), which I agree is important.

However, since Sirius is implemented in Scala, it does create an artificial awkwardness if it is used from a native Scala application, as we're essentially missing out on idiomatic Scala-isms in the interface in that case.

Should we have two sets of external interfaces? One targeted at Scala-native users, and one targeted at Java and everything else? Please discuss.

remove deprecated akka.event-handlers property

Somewhere in Sirius akka.event-handlers is used, and should be replaced with akka.loggers. I noticed this when running the groovy reference application and

[WARN] [02/17/2014 16:42:04.209] [main] [EventStream(akka://sirius-system)] [akka.event-handlers] config is deprecated, use [akka.loggers]

was logged shortly after starting sirius.

Update wiki with scala doc references

The wiki has a lot of links to source code. The links probably should point to the scala docs when they are available

Old deletes need to be purged during Compaction

Before 'live' compaction, DELETE events older than 7 days (by default) were compacted out during WalTool compaction.

When Sirius switched to using live compaction, that bit of logic was left out. So if a DELETE event is not followed by a subsequent PUT then it lives in the Uberstore forever.

For our usage, this has led to more than 165,000,000 stale DELETE events in the Uberstore. That represents almost 3x the live PUTs and 15gb out of the 40gb Uberstore.

Sirius needs to re-add the ability to purge DELETE events that are older than a specified age while doing compaction.

Does SiriusFactory belong in api or api.impl?

Isn't SiriusFactory part of the Sirius API, and thus should be exposed from the api package, not as part of the impl package?

Richer data model: local out-of-band objects

There are various enhancements to the data model that could be useful in different types of Sirius app, that can probably be implemented without changing the underlying key-value abstraction for the consensus protocol.

One relatively straightforward one is to allow object(s) to be passed across the enqueue/handle interface without them being part of the consensus protocol. This is useful if the back-end requires additional data to efficiently process a (K, V) pair received by the handler, e.g., it might need a handle for the client that submitted a request to the app front-end.

This would save developers from having to stash such objects in a queue in the front-end and then retrieve them in the back-end by matching the key (or something similar). It also provides an alternative way to pass values back from back-end to front-end than returning a value via a Future. Finally, it allows the app front-end to be implemented as fully asynchronous ('fire and forget'), since it no longer needs to check the status of enqueued operations in order to respond to clients.

Obviously only the node that performed the enqueue operation would receive the out-of-band data, but presumably only that node needs the additional data to respond to the client.

Followers sometimes stop following if too far out of date

We are seeing an issue where a following server stops getting updates from the cluster it is following. It seems to happen when the following server gets a few days behind (due to limited connectivity or due to replacing the WAL file with older data).

Once the following server stops following, it will not start following again. The only known fix is to replace the WAL file with a more up-to-date WAL file and restart.

Our working theory is that compaction clears out an entry from the WAL on primary server, and the follower then can't catch up because of that.

I can provide example WAL files that exhibit the issue (both primary server and follow WAL files). However they are quite large.

Allow RequestHandler object to be changed

In some use cases it may be desirable for the application developer to change the RequestHandler object that is used by Sirius. For example, if the application has multiple phases that it changes between, and each phase has a different handler behaviour.

It is possible to do this with an app-level proxy handler, but there may be advantages to supporting it natively in Sirius, especially if it is a common use case. At the minimum there should be a clear statement of the semantics of how RequestHandler methods are called that allows the app developer to safely implement a proxy, e.g., does Sirius guarantee (now and in the future) that callbacks are always made from a single thread and thus are serialised?

fix scaladoc warnings

I am not entirely sure if this is possible or not, but would be nice if we could get rid of them.

Model leader election as simple FSM

Election in the Leader actor is actually quite a simple FSM. This could be modeled very easily with context.become() (or with a real FSM, though that might be overkill). It would reduce the amount of variable state in the Leader somewhat. This change would also make it significantly clearer what the leader does with incoming messages in each case.

Delete purging can remove necessary events from the uberstore

Unfortunately, removing out-of-date deletes can be hazardous in one case: if there are other updates on that same key within the same segment.

To illustrate, imagine a case where a key is used exactly twice: a PUT is applied, then a DELETE is immediately applied afterwards. These occur within the same segment. Since no future updates occur, these will never be compacted away. However, if the DELETE is older than the MAX_DELETE_AGE threshold, it could be purged -- exposing the PUT, which would suddenly be the most recent update.

We should either (1) not remove DELETE operations if there are preceding PUT operations in a segment, or (2) implement internal segment compaction.

Stale ActorRef eviction is broken

ActorRef eviction is designed in the following way:

keep track of when latest received Pong messages and actor resolutions have occurred
removing references who haven't been updated beyond a certain threshold from the map
fire off an attempt to re-resolve these actor refs

The first piece currently isn't functioning as desired. The map is keyed on an Actor's path, and the Pong message is received from the membership actor /user/sirius/membership, while the resolution path and the map in the MembershipAgent are keyed on the supervisor's path /user/sirius. In short, Pong messages aren't resetting the timeout threshold, and all Refs in the MembershipAgent are recreated (by default) every 40 seconds or so.

This is safe, but it should be fixed in order to work as intended.

Prohibit null RequestHandler

The SiriusFactory object should require that a non-null RequestHandler be passed to the createInstance method. That would catch the not-uncommon mistake of passing a null handler, which subsequently causes hard to debug errors.

Use ActorSelection instead of ActorRef in Leader

Similar to #98 , dump ActorRef resolution for ActorSelection, but this time in the context of the elected leader in the Leader actor. At first, using a resolved reference seems like a good idea in the Leader -- we care whether there is an actor alive at the other end. However, we're also doing our own liveness testing for the elected leader, and doubling up on that just leads to confusion.

Also, while we care whether there is a live remote leader, we don't care which instantiation it is -- and stale ActorRefs with old ids can be problematic.

Slow node eventually DDOS' itself attempting to catch up with others

When a single node in a given cluster is slow (in our case, had a network interface operating @ 10Mb/s when its neighbors were operating @ 1Gig) it eventually gets into a state where it starts falling further and further behind other nodes in the same cluster in terms of processing updates. As this single node continues to fall further behind it starts to attempt to 'catchup' with its friends - further exacerbating its existing slowness by DDOSing itself by requesting catch up information from those same friends. This slow node then starts causing queues to build on the other nodes till eventually one or more of the nodes suffers a FULL java GC - which for our installation (50Gig Heap) causes the entire JVM to stop for 2minutes. Causing additional queues to fill and pushing the entire cluster to fall apart.

So:

Slow node in a cluster (in our case caused by an interface @ 10mb/s when others are 1Gig) starts falling significantly behind friends with data updates
as node falls behind it starts asking friend nodes in the cluster for updates to catch up.
This in turn causes the slow node to DDOS of itself by flooding a slow interface with catch up traffic
Which then causes queues on the other boxes to start filling with catch up messages bound for the slow node
Queues are essentially unbounded eventually landing the entire cluster in a really bad state.

SiriusResult can only wrap subclasses of RuntimeException

This might be overly restrictive, especially since (for example) IOException is not a subclass of RuntimeException. Is there a reason SiriusResult can't just wrap Exception? I get that RuntimeException is the superclass of unchecked exceptions, but it's not obvious to me if or why that would be relevant.

Support cross-building for Scala 2.12+

Sirius currently doesn't support running with Scala 2.12 and due to some dependencies doesn't appear that it can be easily cross-built to support Scala 2.12. This is preventing me from updating any of my other dependencies that have moved to supporting Scala 2.12, such as money. I'd like to explore what would be necessary cross-build Sirius for both Scala 2.11 and 2.12+.

Support HTTP for Sirius following instead of Akka remoting

Sirius relies on Akka remoting in order for nodes to catchup with each other. This has a few drawbacks:

Akka remoting protocol is not stable between versions of Akka
Akka remoting requires bidirectional communication between nodes

I propose adding support into Sirius for HTTP-based following where Sirius can be configured to listen on a given port and accept HTTP requests for log ranges. Following nodes would be configured with the URI through which to request those log ranges, which could be behind a load balancer.

I've already implemented this behavior in my Spring application by disabling Akka-based catchup and tapping into the Akka actor system directly to read log subranges and to replay them and it works quite well. The Spring controllers support streaming which allows the followers to request the entire log from a given sequence.

make scaladoc handle linking to external javadoc or scaladoc properly

Could not get the scaladoc parameter -doc-external-doc to work properly. As of creating this issue, this affects PaxosSupervisor and SiriusConfiguration. To see the linking warnings, run sbt doc at the root of the project.

AskTimeoutExceptions during startup

We sometimes see akka.pattern.AskTimeoutException during startup, and generally a restart fixes the problem or the timeouts will occur several times and then cease.
The timeouts occur while executing:

sirius.enqueuePut(event.getKey().fullKey(), data).get();

As you can see, we enqueuePut, and then immediately block until the future returns. Here is a thread dump I took while waiting for the future.

The main thread is waiting for the future, but there don't seem to be any other threads doing anything sirius related. I had expected to see our main thread waiting on a sirius implementation thread to finish its work, but I don't see any threads like that.

This node's sirius cluster config only contained /user/sirius.

This is happening at startup, but after com.comcast.xfinity.sirius.api.Sirius.isOnline() returns true. Sometime around then, we see this in the logs:

2017-07-14 19:46:24,297 WARN s=localListingInfoWebService-root_out env="ape" [sirius-system-akka.actor.default-dispatcher-12] Sirius(akka://sirius-system): SiriusSupervisor Actor received unrecognized message IsInitializedResponse(true)

Perhaps this warning is relevant to the issue.

Prevent modification of internal Sirius copy of a value byte array

The current implementation of Sirius passes a byte array as the value argument when calling the RequestHandler handlePut method. After the call to handlePut returns, Sirius proceeds to use that array as-is in subsequent operations, including sending it as the value to other Sirius cluster nodes. If the callee manipulated the array those other nodes will see the modified value. This causes very confusing behaviour.

Sirius should prevent modification of the value array from having any impact on subsequent operations. The best way to do so is TBD, but SHOULD (in the RFC sense) be more than just an API-level statement that the caller shouldn't modify the array. Given very limited read-only abstractions (anything other than ByteBuffer?), the best solution is probably to copy the array before passing it. Since the value array is typically quite small this should have negligible impact. [Aside: if one cared about giant values it might make sense to add an API that uses ByteBuffers in order to also potentially leverage direct mapped arrays]

Remove implicit coupling between SiriusConfiguration.HOST and ClusterConfig

It’s currently the case that whatever a given Sirius node identifies as in its own configuration — hostname, ip, localhost, etc — that name must match the way that it’s addressed by other nodes. This comes as a direct result of Akka doing the same. If a node fired up on 127.0.0.1 catches a message sent to localhost, it drops it on the floor. So it’s not surprising that Sirius works this way, since the cluster config consists of akka addresses, but it is unfortunate.

Perhaps the cluster config should change. Instead of akka paths, it could simply contain hostnames, or ip addresses, some resolvable address. From there, based on either a configuration property (use_ip, use_dns, etc), as well as the specified protocol (enable_ssl or no), sirius could turn that into the proper akka path. This conjunction of configuration properties would also be used when setting up remoting on the node in the first place, so that the node would answer on the correct address/hostname/etc.

However it happens, basically: clients shouldn’t have to know how to construct akka paths and it would be really nice if the specification of other nodes in the cluster were less finicky.

PaxosStateBridge does not propagate SiriusResults from the RequestHandler

Currently, when the PaxosStateBridge receives a Paxos decision, it acks the update back to the originating client with a SiriusResult.none() before that update has been persisted or applied to the brain via the RequestHandler. This means there is no way to tell if an exception was thrown while trying to apply it, and there is no way to have a return value from the RequestHandler's enqueuePut or enqueueDelete methods.

Possible solutions:

Delay the ack at least until the update has been run through the RequestHandler, and reply with the return value of handlePut/handleDelete
Change the RequestHandler interface so that enqueuePut/enqueueDelete return a Future<Unit>, to make it clear there is no information being propagated out

nodetool requires akka.tcp protocol that SiriusShortNameParser doesn't support

When using nodetool, if I pass in an "akka://..." nodeId, I get the following error message:

akka.remote.RemoteTransportException: No transport is loaded for protocol: [akka], available protocols: [akka.tcp]

However, SiriusShortNameParser expects nodeIds to start with "akka://":

java.lang.IllegalArgumentException: akka.tcp://.../user/sirius does not appear to be a vaild Akka address or Sirius node short name

Reduce size of uberstore indices

An anonymous reviewer of an upcoming paper about Sirius suggests:

Design point: you don't need to keep an index of all records, just a sampling of them
(every 10th or 100th or 1000th); you'd use less RAM and once you seeked to the
closest preceding index entry in the log you'd read forward serially.

Use ActorSelections for remote nodes in Membership

Sirius currently attempts to resolve ActorRefs for each node in the MembershipActor, and store those refs for use by any actor within the system for communicating with remote nodes. These refs aren't necessarily long-lived, and maintaining them is a pain.

Instead of resolving ActorRefs and maintaining them in the MembershipMap, we should instead store ActorSelections. When an actor needs to initiate a session of communication with a remote node, it can resolve a ref from the selection -- or, more likely, simply send a message via the selection and implicitly get the ref for future use when the remote node responds.

Verify copyright in wiki

see the footer on any wiki page. I copied the copyright notice from our source files. This needs sign off from @comcast-jonm

comcast / sirius Goto Github PK

sirius's Introduction

Sirius

Next Steps

Questions, Comments, Bugs and More

sirius's People

Contributors

Stargazers

Watchers

Forkers

sirius's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs