Sidereal Events - Scalable real-time queries over streaming CloudEvents

Sidereal Events is a standalone HTTP server providing reactive queries over streaming CloudEvents. When used in conjunction with Change Data Capture, it can turn any database into a real-time database.

A query in Sidereal Events is a disjunction of conjunctions, encoded in disjunctive normal form, e.g. (a AND b) OR (c AND d AND e) OR ..., of key/value terms, e.g. (key1=value1 AND key2=value2) OR (key3=value3 AND key4=value4) OR .... Queries are internally indexed by every conjunction, by their terms (e.g. key2=value2 is an index entry), as an inversion to the usual practice of data being indexed by their fields. This query indexing allows Sidereal Events to scale efficiently to thousands of concurrent queries while still ingesting tens of thousands of events per second.

Sidereal Events is still experimental!

Sidereal Events is still very much a work in progress, and isn't yet at a state where it can be considered "production ready". Some aspects of it might change in a breaking manner in the near future. If you'd like to try Sidereal Events out, please proceed with caution.

Running from source

You'll need a Java Development Kit (JDK) supporting Java 17 or greater. Sidereal Events builds and runs correctly using any of

Sidereal Events is compiled using Gradle, and this repository includes the Gradle Wrapper, so only a suitable JDK needs to be explicitly installed to build this project.

Run the server with the command

./gradlew run --args="serve"

This will start the Sidereal Events Server listening for HTTP connections over TCP port 8232. Additional command-line flags can be passed in the string argument to Gradle's --args argument. Run

./gradlew run --args="serve --help"

for a list of these flags, or see the Server Configuration section for more details.

Automated tests can be run with

./gradlew test

Building a runnable JAR

Gradle can build directly-runnable JAR Files by running

./gradlew jar

The resulting JAR will be saved to ./sidereal/build/libs/sidereal.jar, and run directly using

java -jar ./sidereal/build/libs/sidereal.jar serve

Building a native executable

Native builds are made possible by the Gradle Plugin for GraalVM Native Image Building. As a prerequisite for its use, you will need to

Install a GraalVM JDK, which will also include the Native Build tooling.
Export the location of the installed GraalVM JDK with the environment variable GRAALVM_HOME.

On macOS, if your version of GraalVM covers a specific version of Java for which another JDK is not available, e.g. you have installed GraalVM 21 and do not have any other JDK installed for Java 21, you can set GRAALVM_HOME by using the java_home tool. For example,
```
export GRAALVM_HOME=$(/usr/libexec/java_home -v 21) 
```

Once GraalVM has been installed, and its installation path has been stored in the GRAALVM_HOME environment variable, you can build a native executable with

./gradlew nativeBuild

and the resulting artifact would be saved to ./sidereal/build/native/nativeCompile/sidereal.

Automated tests can be run against a native executable by running

./gradlew nativeTest

Licensing

Sidereal Events is available under the terms of the MIT License as per the LICENSE file. Other licensing information is presented using SPDX License IDs embedded in the source files.

More information regarding the technical practice of maintaining license information is available in docs/development/licenses.md.

Sending events

Events are sent to named "channels" by requesting an HTTP POST containing JSON against the path

/channels/CHANNEL-NAME

where CHANNEL-NAME is a percent-encoded string containing at least 1 character. For example, a CHANNEL-NAME of "with/slash" would be encoded as

/channels/with%2fslash

The name meta is used internally to report query registration, and does not receive events from external sources. Attempts to send an event to the meta channel will result in an HTTP 403 response.

The Content-Type of the data sent to this endpoint can be one of

application/json
application/cloudevents+json
application/cloudevents-batch+json

with the contents interpreted according to the HTTP Protocol Binding for CloudEvents. Of note is how Content-Type: application/json implicitly describes the CloudEvent datacontenttype as application/json.

The CloudEvent specification expects each combination of "source" and "id" to be globally unique. Sidereal Events internally keeps track of these "source" and "id" combinations it has received, over a configurable time horizon with a configurable number of remembered events. If a producer tries to send a "source", "id" combination to a channel that has already received this combination, Sidereal Events will report the publication as having been successful, but will not deliver the event to consumers. Within the constraints of the time horizon and maximum remembered "source", "id" combinations, this makes event publishing an idempotent operation.

Note that the same "source", "id" combination can be published to multiple channels. Each channel will send the data for the same "source", "id" combination at least once.

Receiving events

Consumers receive an event stream by requesting an HTTP GET against the same path used to send events. The content body of this HTTP GET behaves according to the Server-sent event specification, with the event name being "data".

For example, if an event is sent with

POST /channels/events HTTP/1.1
Content-Type: application/json
Ce-Source: //somewhere
Ce-Id: some-id
Ce-Type: some.type

...

Then the event will be present in a GET to the same path:

GET /channels/events HTTP/1.1
...

HTTP/1.1 200 OK
Connection: keep-alive
transfer-encoding: chunked
Content-Type: text/event-stream; charset=utf-8

event: connect
data: {"timestamp":"...","clientID":"..."}

... After the event is sent
event: data
id: %2F%2Fsomewhere+some-id
data: {"source":"//somewhere","id":"some-id","type":"some.type","data":{...}}

The contents of the Server-sent event are formatted according to the JSON Event Format for CloudEvents.

Filtering received events

Sidereal Events is designed to efficiently support deep-content filtering of its JSON input across thousands of connected clients, with multiple query terms as part of a logical disjunction of conjunctions (e.g. (a AND b AND c) OR (d AND e) OR ...). This filtering is enabled by passing the terms of the filter as an HTTP query string. For example, if a consumer were to connect to a channel using

GET /channels/example?one="one"&two="two"&three=3 HTTP/1.1
...

And the producer were to send the events

POST /channels/example HTTP/1.1
Content-Type: application/cloudevents-batch+json

[
{
    "source": "somewhere",
    "id": "1",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": "two",
        "three": 3,
        "four": "four"
    }
}, {
    "source": "somewhere",
    "id": "2",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": "two",
        "three": 3,
        "four": "five"
    }
}, {
    "source": "somewhere",
    "id": "3",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "two": "two",
        "three": 3,
        "four": "five"
    }
}, {
    "source": "somewhere",
    "id": "4",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "two",
        "two": "two",
        "three": 3
    }
}, {
    "source": "somewhere",
    "id": "5",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": "one",
        "three": 3
    }
}, {
    "source": "somewhere",
    "id": "6",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": 2,
        "three": 3
    }
}, {
    "source": "somewhere",
    "id": "7",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": "two",
        "three": 4
    }
}, {
    "source": "somewhere",
    "id": "8",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": "two",
        "three": "three"
    }
}
]

then a client receiving events for the "example" channel would see

GET /channels/example?one="one"&two="two"&three=3 HTTP/1.1
...

HTTP/1.1 200 OK
Connection: keep-alive
transfer-encoding: chunked
Content-Type: text/event-stream; charset=utf-8

event: connect
data: {"timestamp":"...","clientID":"..."}

event: data
id: somewhere+1
data: {"source":"somewhere","id":"1","type":"com.example.sidereal","specversion":"1.0","data":
data: {"one":"one","two":"two","three":3,"four":"four"}}

event: data
id: somewhere+2
data: {"source":"somewhere","id":"2","type":"com.example.sidereal","specversion":"1.0","data":
data: {"one":"one","two":"two","three":3,"four":"five"}}

for the following reasons:

1 would match because data["one"] == "one", data["two"] == "two", data["three"] == 3. The contents, or even presence, of data["four"] has no effect on the given filter.
2 would match because data["one"] == "one", data["two"] == "two", data["three"] == 3. Similar to 1, the contents, or even presence, of data["four"] has no effect.
3 would not match because data["one"] is not present.
4 would not match because data["one"] == "two" when we expected data["one"] == "one".
5 would not match because data["two"] == "one" when we expected data["two"] == "two".
6 would not match because data["two"] == 2 when we expected data["two"] == "two".
7 would not match because data["three"] == 4 when we expected data["three"] == 3.
8 would not match because data["three"] == "three" when we expected data["three"] == 3.

`OR` in Queries (Disjunctions of Conjunctions)

Disjunctions of conjunctions are made possible by using the now-historical ; separator character. This separator has a lower affinity for logical terms than the & separator character. For example, if a consumer were to connect to a channel using

GET /channels/example?one="one"&two="two"&three=3;one=1&two=2&three="three" HTTP/1.1
...

this would have similar results to connecting twice with both

GET /channels/example?one="one"&two="two"&three=3 HTTP/1.1
...
GET /channels/example?one=1&two=2&three="three" HTTP/1.1
...

but with the added benefits of

Only requiring one HTTP connection in the server-sent events interface
Only reporting data once, even if multiple conjunctions are matched

Deep JSON Member Access

By default, if a key does not start with / or ../, it is assumed to be a literal key within the "data" object of the event. For example, a query string of the form

some.key="value"

is interpreted to match

{
    "source": "...",
    "id": "...",
    "type": "...",
    "specversion": "1.0",
    "data": {
        "some.key": "value"
    }
}

Access to keys within JSON documents is made possible by using JSON Pointers. As an example, to match "some", then "key" in

{
    "source": "...",
    "id": "...",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "some": {
            "key": "value"
        }
    }
}

you would use a query string

?/some/key="value"

Note that ~ in a valid key path component must be replaced with ~0, and / in a valid key path must be replaced with ~1. The replacement of ~ with ~0 should occur before replacing / with ~1 so that the encoding ~1 is not accidentally rewritten as ~~1. For a key path of

data["with/slash"]["with~tilde"]

the JSON Pointer encoding would be

?/with~1slash/with~0tilde

Arrays can be accessed with positive integers as the "key" in the reference. For example, the following data

{
    "source": "...",
    "id": "...",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "some": {
            "key": [
              "first",
              "second",
              "third",
              "fourth"
            ]
        }
    }
}

would match the following query:

?/some/key/2="third"

You can mix in deeper JSON access even using array access. The following data

{
  "source": "...",
  "id": "...",
  "type": "com.example.sidereal",
  "specversion": "1.0",
  "data": {
    "some": {
      "key": [
        {
          "name": "first",
          "value": 1
        },
        {
          "name": "second",
          "value": 2
        }
      ]
    }
  }
}

would match the following query:

?/some/key/0/name="first"

Keys are matched starting from the "data" key in the resulting CloudEvent by default. As an extension to JSON Pointers, if the query string starts with .. and the remainder is a JSON Pointer, the key is matched starting from the object root. As an example, to match the CloudEvent type in

{
    "source": "...",
    "id": "...",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "some": {
            "key": "value"
        }
    }
}

you would use a query string

?../type="com.example.sidereal"

Query Values

Values are encoded according to their JSON representation. Only null, booleans, numbers, and strings are supported as match values. If a value cannot be decoded as null, a boolean, or a number, and does not start with a filter operator prefix, it is assumed to be a string.

Filter Operators

Sidereal Events supports more filters than just field equality. The following additional operators are available, but many with caveats on the number of operators per query.

Logical Not, with a value prefix of !. This can be used multiple times in a single query. As an example, ?../type=!"com.example.sidereal", or ..%2Ftype=%21%22com.example.sidereal%22 if using strict percent-encoding.
Array Contains, with a value prefix of [. This can be used multiple times in a single query. As an example, ?../type=["com.example.sidereal", or ?..%2Ftype=%5B%22com.example.sidereal%22 if using strict
percent-encoding.
Less Than, with a value prefix of <. This can only be used once in a single query, and precludes the use of Less Than or Equal and Starts With operators. It may be used in conjunction with Greater Than or Equal and Greater Than only if these operators are used with the same key. As an example, ?../type=<"com.example.sidereal", or ?..%2Ftype=%3C%22com.example.sidereal%22 if using strict percent-encoding.
Less Than or Equal, with a value prefix of <=. This can only be used once in a single query, and precludes the use of the Less Than and Starts With operators. It may be used in conjunction with Greater Than or Equal and Greater Than only if these operators are used with the same key. As an example, ?../type=<="com.example.sidereal", or ?..%2Ftype=%3C%3D%22com.example.sidereal%22 if using strict percent-encoding.
Greater Than or Equal, with a value prefix of >=. This can only be used once in a single query, and precludes the use of the Greater Than and Starts With operators. It may be used in conjunction with Less Than and Less Than or Equal only if these operators are used with the same key. As an example, ?../type=>="com.example.sidereal", or ?..%2Ftype=%3E%3D%22com.example.sidereal%22 if using strict percent-encoding.
Greater Than, with a value prefix of >. This can only be used once in a single query, and precludes the use of the Greater Than or Equal and Starts With operators. It may be used in conjunction with the Less Than and Less Than or Equal operators only if these operators are used with the same key. As an example, ?../type=>"com.example.sidereal", or ?..%2Ftype=%3E%22com.example.sidereal%22 if using strict percent-encoding.
Starts With, with a value prefix of ~. This can only be used once in a single query, can only be used with string values, and precludes the use of the Less Than, Less Than or Equal, Greater Than or Equal, and Greater Than operators. As an example, ?../type=~"com.example.sidereal", or ?..%2Ftype=%7E%22com.example.sidereal%22 if using strict percent-encoding.

Server Configuration

Sidereal Events accepts configuration through command-line flags and environment variables.

Flag: --server-port
Environment Variable: SIDEREAL_SERVER_PORT
Type: Integer
Default Value: 8232

Sidereal Events will listen for HTTP connections over this TCP port.
Flag: --source-name
Environment Variable: SIDEREAL_SOURCE_NAME
Type: String
Default Value: //name.djsweet.sidereal

CloudEvents emitted by Sidereal Events will use this string as the "source" metadata.
Flag: --log-level
Environment Variable: SIDEREAL_LOG_LEVEL
Type: One of trace, debug, info, warn, or error
Default Value: info

Sets the minimum logging level. Log levels are defined in a hierarchy, with trace being the lowest and error being the highest. If this is set to info, then all logs at a level of INFO, WARN, and ERROR are generated, but TRACE and DEBUG are ignored.
Flag: --router-threads
Environment Variable: SIDEREAL_ROUTER_THREADS
Type: Integer
Default Value: Number of logical CPU threads reported by the operating system.

Sidereal Events will spawn this many operating system threads to route events to consuming queries.
Flag: --translator-threads
Environment Variable: SIDEREAL_TRANSLATOR_THREADS
Type: Integer
Default Value: Number of logical CPU threads reported by the operating system.

Sidereal Events will spawn this many operating system threads to translate CloudEvents into its internal indexing representation.
Flag: --web-server-threads
Environment Variable: SIDEREAL_WEB_SERVER_THREADS
Type: Integer
Default Value: Twice the number of logical CPU threads reported by the operating system.

Sidereal Events will spawn this many operating system threads to service HTTP requests.
Flag: --max-body-size-bytes
Environment Variable: SIDEREAL_MAX_BODY_SIZE_BYTES
Type: Integer
Default Value: 10,485,760 (10MB)

Sidereal Events will reject HTTP bodies with a content length greater than this value, sending an HTTP 413 when the request body is too large according to this value.
Flag: --max-idempotency-keys
Environment Variable: SIDEREAL_MAX_IDEMPOTENCY_KEYS
Type: Integer
Default Value: 1,048,576

Sidereal Events will retain this many "source", "id" combinations in a set before discarding the oldest values. Setting this value too low may cause duplicate publishes of events to become non-idempotent, but setting this value too high will result in excess memory usage.
Flag: --max-json-parsing-recursion
Environment Variable: SIDEREAL_MAX_JSON_PARSING_RECURSION
Type: Integer
Default Value: 64

Sidereal Events will recurse this deep when translating JSON into its internal indexed representation. At nested objects deeper than the configured value, Sidereal Events will use a stack-iterative algorithm that requires heap allocation. This value is chosen to trade off performance with StackOverflowError exceptions. While Sidereal Events dynamically configures itself to avoid StackOverflowErrors in other areas, it is not expected for JSON documents to contain thousands of levels of nesting, and thus it is left as a configurable value.
Flag: --max-outstanding-events-per-router-thread
Environment Variable: SIDEREAL_MAX_OUTSTANDING_EVENTS_PER_ROUTER_THREAD
Type: Integer
Default Value: 131,072

Sidereal Events keeps track of the number of events present "within" the system. An event must be delivered to all interested consumers before it is no longer tracked as being outstanding. If the number of outstanding events exceeds this number multiplied by the number of routing threads, Sidereal Events will respond to producers with an HTTP 429, establishing backpressure within the event routing path. The producers are expected to re-attempt the publication of their events after a brief period of waiting when encountering this HTTP 429.
Flag: --max-query-terms
Environment Variable: SIDEREAL_MAX_QUERY_TERMS
Type: Integer
Default Value: 32

Sidereal Events limits the number of filter terms available to consumers to prevent excessively large query indices. If a client attempts to use more filters than this configured value, Sidereal Events will reply with an HTTP \400.
Flag: --body-timeout-ms
Environment Variable: SIDEREAL_BODY_TIMEOUT_MS
Type: Integer
Default Value: 60,000

After all HTTP headers are received, Sidereal Events will wait up to this many milliseconds to receive an entire response body. If the response body is not fully received within this time, Sidereal Events will respond with an HTTP 408.
Flag: --idempotency-expiration-ms
Environment Variable: SIDEREAL_IDEMPOTENCY_EXPIRATION_MS
Type: Integer
Default Value: 180,000

Sidereal Events will remove its record of a "source", "id" combination from its internal tracking set after this many milliseconds. Any transmission of the same "source", "id" combination after the combination is removed from the tracking set will result in the data being sent to consumers.
Flag: --tcp-idle-timeout-ms
Environment Variable: SIDEREAL_TCP_IDLE_TIMEOUT_MS
Type: Integer
Default Value: 180,000

Sidereal Events will close a TCP connection after this many milliseconds of no activity. This prevents connections dropped without a TCP FIN or TCP RST from consuming resources.

Metrics

Metric observability for Sidereal Events is available by requesting GET /metrics. The response body follows Prometheus' Text-based format.

The currently exposed metrics are

sidereal_data_byte_budget
A gauge indicating the maximum size of any key/value pair in a query term. This is unlikely to change during normal execution, but may be lowered automatically if Sidereal Events encounters a StackOverflowError when routing an event to consuming queries.
sidereal_outstanding_events
A gauge indicating the number of events accepted by Sidereal Events, but not yet confirmed delivered to consumers.
sidereal_active_queries
A gauge indicating the number of queries being serviced, labeled per router.
sidereal_event_routing_seconds
A summary, without quantiles, of the time (in seconds) spent routing events to consuming queries, labeled per router.
sidereal_idempotency_key_cache_size
A gauge indicating the number of "source", "id" combinations saved in the tracking set, labeled per router.
sidereal_json_translation_seconds
A summary, without quantiles, of the time (in seconds) spent translating events into Sidereal Events' internal index representation, labeled per translator.

Support OR in queries

Right now, the only sort of query Thorium supports is a logical conjunction, or a bunch of individual terms AND'd together, like so:

a=1 AND b=2 AND c=3 AND d=4 ...

Thorium can indirectly support OR queries "just" by running a bunch of concurrent conjunction-only queries

a=1 AND b=2 AND c=3 AND d=4 ...
... (OR)
a=4 AND b=3 AND c=2 AND d=1 ...

But there are at least two major problems with this:

If any bit of data matches multiple queries, it's fully duplicated, even though it doesn't need to be duplicated
There's no real ordering guarantee (definitely not with the current Server Sent Events interface, but possibly so with the eventual WebSocket interface) between the queries, so it's difficult to impossible to coordinate results between these queries so that the result is always consistent.

However, if we move the "just a bunch of concurrent conjunctions" practice into a QueryServer, directly, we can inherently eliminate the duplication and sidestep the coordination problem.

If we somehow supported an encoding of OR using disjunctive normal form, we can trivially extract the AND terms into "independent" queries, but this is where the triviality ends and the low difficulty begins. Each of these "independent" queries will have to somehow identify the full disjunction, so that when we report a match to clients, we avoid reporting the same data multiple times for one specific query.

Extending the internal query protocol

Right now, we assume one clientID equals one query. This was expedient for the Server Sent Events interface, but won't work for either the upcoming WebSocket interface or for OR queries. But, we can fix this by:

Updating QueryResponderSpec to incorporate a query ID scoped to the already existent client ID, and using this throughout the QueryServer (Can be set entirely by consumers)
Updating every FullQuery typed variable to instead take an Iterable of FullQuery, to represent the disjunction of each conjunction in a FullQuery
Updating ChannelInfo to associate a QueryResponderSpec with an iterable of QueryPath, and using this new iterable to both add and remove the QueryResponderSpec to the underlying QuerySetTree
Adding a query IDs set to ReportData, so that consumers can report to their own clients which queries were affected. The "set" is doing some heavy lifting here: the same query ID can be mapped to multiple queries, and we only want to report data once per query ID. Using a "set" prevents duplicate sends to the same query ID.
Keeping a map of client ID to ReportData instances, which will also include the query IDs explicitly matched
Breaking out calls to trySendDataToResponder so that they are no longer called directly in the response loops

Encoding the disjunctions

Queries are currently encoded as a query string, and added to the URL of the Server Sent Events interface. We intend to keep this exact encoding for the WebSocket interface as well.

The original query string spec used to allow for the use of ';' in addition to '&' for query separators, but this is no longer the case. So, this used to be an expected sort of query string:

?q=1;a=2;c=3

And hypothetically you could say

?q=1&a=2&c=3;q=2&a=b&c=3;q=3&a=a&c=c ...

Which is to say, a mixture of '&' and ';' characters. This could be used to represent

(q=1 AND a=2 AND c=3) OR (q=2 AND a=b AND c=3) OR (q=3 AND a=a AND c=c)

which is the exact definition of disjunctive normal form.

Because the use of ';' was removed from URL Form Encoding in HTML5, we're definitely going to need to test whether Netty HTTP will allow for such an encoding, at least in terms of the URL passed to it. So long as we can split by ';' and then trigger the existing query string decoding on each segment after being split by ';'.

djsweet / sidereal Goto Github PK

sidereal's Introduction

Sidereal Events - Scalable real-time queries over streaming CloudEvents

Sidereal Events is still experimental!

Running from source

Building a runnable JAR

Building a native executable

Licensing

Sending events

Receiving events

Filtering received events

OR in Queries (Disjunctions of Conjunctions)

Deep JSON Member Access

Query Values

Filter Operators

Server Configuration

Metrics

sidereal's People

Contributors

Watchers

sidereal's Issues

Extending the internal query protocol

Encoding the disjunctions

query-tree

thorium

Recommend Projects

Recommend Topics

Recommend Org

Jobs

`OR` in Queries (Disjunctions of Conjunctions)

`query-tree`

`thorium`