GithubHelp home page GithubHelp logo

sidereal's Introduction

Sidereal Events - Scalable real-time queries over streaming CloudEvents

Sidereal Events is a standalone HTTP server providing reactive queries over streaming CloudEvents. When used in conjunction with Change Data Capture, it can turn any database into a real-time database.

A query in Sidereal Events is a disjunction of conjunctions, encoded in disjunctive normal form, e.g. (a AND b) OR (c AND d AND e) OR ..., of key/value terms, e.g. (key1=value1 AND key2=value2) OR (key3=value3 AND key4=value4) OR .... Queries are internally indexed by every conjunction, by their terms (e.g. key2=value2 is an index entry), as an inversion to the usual practice of data being indexed by their fields. This query indexing allows Sidereal Events to scale efficiently to thousands of concurrent queries while still ingesting tens of thousands of events per second.

Sidereal Events is still experimental!

Sidereal Events is still very much a work in progress, and isn't yet at a state where it can be considered "production ready". Some aspects of it might change in a breaking manner in the near future. If you'd like to try Sidereal Events out, please proceed with caution.

Running from source

You'll need a Java Development Kit (JDK) supporting Java 17 or greater. Sidereal Events builds and runs correctly using any of

Sidereal Events is compiled using Gradle, and this repository includes the Gradle Wrapper, so only a suitable JDK needs to be explicitly installed to build this project.

Run the server with the command

./gradlew run --args="serve"

This will start the Sidereal Events Server listening for HTTP connections over TCP port 8232. Additional command-line flags can be passed in the string argument to Gradle's --args argument. Run

./gradlew run --args="serve --help"

for a list of these flags, or see the Server Configuration section for more details.

Automated tests can be run with

./gradlew test

Building a runnable JAR

Gradle can build directly-runnable JAR Files by running

./gradlew jar

The resulting JAR will be saved to ./sidereal/build/libs/sidereal.jar, and run directly using

java -jar ./sidereal/build/libs/sidereal.jar serve

Building a native executable

Native builds are made possible by the Gradle Plugin for GraalVM Native Image Building. As a prerequisite for its use, you will need to

  1. Install a GraalVM JDK, which will also include the Native Build tooling.

  2. Export the location of the installed GraalVM JDK with the environment variable GRAALVM_HOME.

    On macOS, if your version of GraalVM covers a specific version of Java for which another JDK is not available, e.g. you have installed GraalVM 21 and do not have any other JDK installed for Java 21, you can set GRAALVM_HOME by using the java_home tool. For example,

    export GRAALVM_HOME=$(/usr/libexec/java_home -v 21) 

Once GraalVM has been installed, and its installation path has been stored in the GRAALVM_HOME environment variable, you can build a native executable with

./gradlew nativeBuild

and the resulting artifact would be saved to ./sidereal/build/native/nativeCompile/sidereal.

Automated tests can be run against a native executable by running

./gradlew nativeTest

Licensing

Sidereal Events is available under the terms of the MIT License as per the LICENSE file. Other licensing information is presented using SPDX License IDs embedded in the source files.

More information regarding the technical practice of maintaining license information is available in docs/development/licenses.md.

Sending events

Events are sent to named "channels" by requesting an HTTP POST containing JSON against the path

/channels/CHANNEL-NAME

where CHANNEL-NAME is a percent-encoded string containing at least 1 character. For example, a CHANNEL-NAME of "with/slash" would be encoded as

/channels/with%2fslash

The name meta is used internally to report query registration, and does not receive events from external sources. Attempts to send an event to the meta channel will result in an HTTP 403 response.

The Content-Type of the data sent to this endpoint can be one of

  • application/json
  • application/cloudevents+json
  • application/cloudevents-batch+json

with the contents interpreted according to the HTTP Protocol Binding for CloudEvents. Of note is how Content-Type: application/json implicitly describes the CloudEvent datacontenttype as application/json.

The CloudEvent specification expects each combination of "source" and "id" to be globally unique. Sidereal Events internally keeps track of these "source" and "id" combinations it has received, over a configurable time horizon with a configurable number of remembered events. If a producer tries to send a "source", "id" combination to a channel that has already received this combination, Sidereal Events will report the publication as having been successful, but will not deliver the event to consumers. Within the constraints of the time horizon and maximum remembered "source", "id" combinations, this makes event publishing an idempotent operation.

Note that the same "source", "id" combination can be published to multiple channels. Each channel will send the data for the same "source", "id" combination at least once.

Receiving events

Consumers receive an event stream by requesting an HTTP GET against the same path used to send events. The content body of this HTTP GET behaves according to the Server-sent event specification, with the event name being "data".

For example, if an event is sent with

POST /channels/events HTTP/1.1
Content-Type: application/json
Ce-Source: //somewhere
Ce-Id: some-id
Ce-Type: some.type

...

Then the event will be present in a GET to the same path:

GET /channels/events HTTP/1.1
...

HTTP/1.1 200 OK
Connection: keep-alive
transfer-encoding: chunked
Content-Type: text/event-stream; charset=utf-8

event: connect
data: {"timestamp":"...","clientID":"..."}

... After the event is sent
event: data
id: %2F%2Fsomewhere+some-id
data: {"source":"//somewhere","id":"some-id","type":"some.type","data":{...}}

The contents of the Server-sent event are formatted according to the JSON Event Format for CloudEvents.

Filtering received events

Sidereal Events is designed to efficiently support deep-content filtering of its JSON input across thousands of connected clients, with multiple query terms as part of a logical disjunction of conjunctions (e.g. (a AND b AND c) OR (d AND e) OR ...). This filtering is enabled by passing the terms of the filter as an HTTP query string. For example, if a consumer were to connect to a channel using

GET /channels/example?one="one"&two="two"&three=3 HTTP/1.1
...

And the producer were to send the events

POST /channels/example HTTP/1.1
Content-Type: application/cloudevents-batch+json

[
{
    "source": "somewhere",
    "id": "1",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": "two",
        "three": 3,
        "four": "four"
    }
}, {
    "source": "somewhere",
    "id": "2",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": "two",
        "three": 3,
        "four": "five"
    }
}, {
    "source": "somewhere",
    "id": "3",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "two": "two",
        "three": 3,
        "four": "five"
    }
}, {
    "source": "somewhere",
    "id": "4",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "two",
        "two": "two",
        "three": 3
    }
}, {
    "source": "somewhere",
    "id": "5",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": "one",
        "three": 3
    }
}, {
    "source": "somewhere",
    "id": "6",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": 2,
        "three": 3
    }
}, {
    "source": "somewhere",
    "id": "7",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": "two",
        "three": 4
    }
}, {
    "source": "somewhere",
    "id": "8",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "one": "one",
        "two": "two",
        "three": "three"
    }
}
]

then a client receiving events for the "example" channel would see

GET /channels/example?one="one"&two="two"&three=3 HTTP/1.1
...

HTTP/1.1 200 OK
Connection: keep-alive
transfer-encoding: chunked
Content-Type: text/event-stream; charset=utf-8

event: connect
data: {"timestamp":"...","clientID":"..."}

event: data
id: somewhere+1
data: {"source":"somewhere","id":"1","type":"com.example.sidereal","specversion":"1.0","data":
data: {"one":"one","two":"two","three":3,"four":"four"}}

event: data
id: somewhere+2
data: {"source":"somewhere","id":"2","type":"com.example.sidereal","specversion":"1.0","data":
data: {"one":"one","two":"two","three":3,"four":"five"}}

for the following reasons:

  • 1 would match because data["one"] == "one", data["two"] == "two", data["three"] == 3. The contents, or even presence, of data["four"] has no effect on the given filter.
  • 2 would match because data["one"] == "one", data["two"] == "two", data["three"] == 3. Similar to 1, the contents, or even presence, of data["four"] has no effect.
  • 3 would not match because data["one"] is not present.
  • 4 would not match because data["one"] == "two" when we expected data["one"] == "one".
  • 5 would not match because data["two"] == "one" when we expected data["two"] == "two".
  • 6 would not match because data["two"] == 2 when we expected data["two"] == "two".
  • 7 would not match because data["three"] == 4 when we expected data["three"] == 3.
  • 8 would not match because data["three"] == "three" when we expected data["three"] == 3.

OR in Queries (Disjunctions of Conjunctions)

Disjunctions of conjunctions are made possible by using the now-historical ; separator character. This separator has a lower affinity for logical terms than the & separator character. For example, if a consumer were to connect to a channel using

GET /channels/example?one="one"&two="two"&three=3;one=1&two=2&three="three" HTTP/1.1
...

this would have similar results to connecting twice with both

GET /channels/example?one="one"&two="two"&three=3 HTTP/1.1
...
GET /channels/example?one=1&two=2&three="three" HTTP/1.1
...

but with the added benefits of

  1. Only requiring one HTTP connection in the server-sent events interface
  2. Only reporting data once, even if multiple conjunctions are matched

Deep JSON Member Access

By default, if a key does not start with / or ../, it is assumed to be a literal key within the "data" object of the event. For example, a query string of the form

some.key="value"

is interpreted to match

{
    "source": "...",
    "id": "...",
    "type": "...",
    "specversion": "1.0",
    "data": {
        "some.key": "value"
    }
}

Access to keys within JSON documents is made possible by using JSON Pointers. As an example, to match "some", then "key" in

{
    "source": "...",
    "id": "...",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "some": {
            "key": "value"
        }
    }
}

you would use a query string

?/some/key="value"

Note that ~ in a valid key path component must be replaced with ~0, and / in a valid key path must be replaced with ~1. The replacement of ~ with ~0 should occur before replacing / with ~1 so that the encoding ~1 is not accidentally rewritten as ~~1. For a key path of

data["with/slash"]["with~tilde"]

the JSON Pointer encoding would be

?/with~1slash/with~0tilde

Arrays can be accessed with positive integers as the "key" in the reference. For example, the following data

{
    "source": "...",
    "id": "...",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "some": {
            "key": [
              "first",
              "second",
              "third",
              "fourth"
            ]
        }
    }
}

would match the following query:

?/some/key/2="third"

You can mix in deeper JSON access even using array access. The following data

{
  "source": "...",
  "id": "...",
  "type": "com.example.sidereal",
  "specversion": "1.0",
  "data": {
    "some": {
      "key": [
        {
          "name": "first",
          "value": 1
        },
        {
          "name": "second",
          "value": 2
        }
      ]
    }
  }
}

would match the following query:

?/some/key/0/name="first"

Keys are matched starting from the "data" key in the resulting CloudEvent by default. As an extension to JSON Pointers, if the query string starts with .. and the remainder is a JSON Pointer, the key is matched starting from the object root. As an example, to match the CloudEvent type in

{
    "source": "...",
    "id": "...",
    "type": "com.example.sidereal",
    "specversion": "1.0",
    "data": {
        "some": {
            "key": "value"
        }
    }
}

you would use a query string

?../type="com.example.sidereal"

Query Values

Values are encoded according to their JSON representation. Only null, booleans, numbers, and strings are supported as match values. If a value cannot be decoded as null, a boolean, or a number, and does not start with a filter operator prefix, it is assumed to be a string.

Filter Operators

Sidereal Events supports more filters than just field equality. The following additional operators are available, but many with caveats on the number of operators per query.

  • Logical Not, with a value prefix of !. This can be used multiple times in a single query. As an example, ?../type=!"com.example.sidereal", or ..%2Ftype=%21%22com.example.sidereal%22 if using strict percent-encoding.
  • Array Contains, with a value prefix of [. This can be used multiple times in a single query. As an example, ?../type=["com.example.sidereal", or ?..%2Ftype=%5B%22com.example.sidereal%22 if using strict
  • percent-encoding.
  • Less Than, with a value prefix of <. This can only be used once in a single query, and precludes the use of Less Than or Equal and Starts With operators. It may be used in conjunction with Greater Than or Equal and Greater Than only if these operators are used with the same key. As an example, ?../type=<"com.example.sidereal", or ?..%2Ftype=%3C%22com.example.sidereal%22 if using strict percent-encoding.
  • Less Than or Equal, with a value prefix of <=. This can only be used once in a single query, and precludes the use of the Less Than and Starts With operators. It may be used in conjunction with Greater Than or Equal and Greater Than only if these operators are used with the same key. As an example, ?../type=<="com.example.sidereal", or ?..%2Ftype=%3C%3D%22com.example.sidereal%22 if using strict percent-encoding.
  • Greater Than or Equal, with a value prefix of >=. This can only be used once in a single query, and precludes the use of the Greater Than and Starts With operators. It may be used in conjunction with Less Than and Less Than or Equal only if these operators are used with the same key. As an example, ?../type=>="com.example.sidereal", or ?..%2Ftype=%3E%3D%22com.example.sidereal%22 if using strict percent-encoding.
  • Greater Than, with a value prefix of >. This can only be used once in a single query, and precludes the use of the Greater Than or Equal and Starts With operators. It may be used in conjunction with the Less Than and Less Than or Equal operators only if these operators are used with the same key. As an example, ?../type=>"com.example.sidereal", or ?..%2Ftype=%3E%22com.example.sidereal%22 if using strict percent-encoding.
  • Starts With, with a value prefix of ~. This can only be used once in a single query, can only be used with string values, and precludes the use of the Less Than, Less Than or Equal, Greater Than or Equal, and Greater Than operators. As an example, ?../type=~"com.example.sidereal", or ?..%2Ftype=%7E%22com.example.sidereal%22 if using strict percent-encoding.

Server Configuration

Sidereal Events accepts configuration through command-line flags and environment variables.

  • Flag: --server-port
    Environment Variable: SIDEREAL_SERVER_PORT
    Type: Integer
    Default Value: 8232

    Sidereal Events will listen for HTTP connections over this TCP port.

  • Flag: --source-name
    Environment Variable: SIDEREAL_SOURCE_NAME
    Type: String
    Default Value: //name.djsweet.sidereal

    CloudEvents emitted by Sidereal Events will use this string as the "source" metadata.

  • Flag: --log-level
    Environment Variable: SIDEREAL_LOG_LEVEL
    Type: One of trace, debug, info, warn, or error
    Default Value: info

    Sets the minimum logging level. Log levels are defined in a hierarchy, with trace being the lowest and error being the highest. If this is set to info, then all logs at a level of INFO, WARN, and ERROR are generated, but TRACE and DEBUG are ignored.

  • Flag: --router-threads
    Environment Variable: SIDEREAL_ROUTER_THREADS
    Type: Integer
    Default Value: Number of logical CPU threads reported by the operating system.

    Sidereal Events will spawn this many operating system threads to route events to consuming queries.

  • Flag: --translator-threads
    Environment Variable: SIDEREAL_TRANSLATOR_THREADS
    Type: Integer
    Default Value: Number of logical CPU threads reported by the operating system.

    Sidereal Events will spawn this many operating system threads to translate CloudEvents into its internal indexing representation.

  • Flag: --web-server-threads
    Environment Variable: SIDEREAL_WEB_SERVER_THREADS
    Type: Integer
    Default Value: Twice the number of logical CPU threads reported by the operating system.

    Sidereal Events will spawn this many operating system threads to service HTTP requests.

  • Flag: --max-body-size-bytes
    Environment Variable: SIDEREAL_MAX_BODY_SIZE_BYTES
    Type: Integer
    Default Value: 10,485,760 (10MB)

    Sidereal Events will reject HTTP bodies with a content length greater than this value, sending an HTTP 413 when the request body is too large according to this value.

  • Flag: --max-idempotency-keys
    Environment Variable: SIDEREAL_MAX_IDEMPOTENCY_KEYS
    Type: Integer
    Default Value: 1,048,576

    Sidereal Events will retain this many "source", "id" combinations in a set before discarding the oldest values. Setting this value too low may cause duplicate publishes of events to become non-idempotent, but setting this value too high will result in excess memory usage.

  • Flag: --max-json-parsing-recursion
    Environment Variable: SIDEREAL_MAX_JSON_PARSING_RECURSION
    Type: Integer
    Default Value: 64

    Sidereal Events will recurse this deep when translating JSON into its internal indexed representation. At nested objects deeper than the configured value, Sidereal Events will use a stack-iterative algorithm that requires heap allocation. This value is chosen to trade off performance with StackOverflowError exceptions. While Sidereal Events dynamically configures itself to avoid StackOverflowErrors in other areas, it is not expected for JSON documents to contain thousands of levels of nesting, and thus it is left as a configurable value.

  • Flag: --max-outstanding-events-per-router-thread
    Environment Variable: SIDEREAL_MAX_OUTSTANDING_EVENTS_PER_ROUTER_THREAD
    Type: Integer
    Default Value: 131,072

    Sidereal Events keeps track of the number of events present "within" the system. An event must be delivered to all interested consumers before it is no longer tracked as being outstanding. If the number of outstanding events exceeds this number multiplied by the number of routing threads, Sidereal Events will respond to producers with an HTTP 429, establishing backpressure within the event routing path. The producers are expected to re-attempt the publication of their events after a brief period of waiting when encountering this HTTP 429.

  • Flag: --max-query-terms
    Environment Variable: SIDEREAL_MAX_QUERY_TERMS
    Type: Integer
    Default Value: 32

    Sidereal Events limits the number of filter terms available to consumers to prevent excessively large query indices. If a client attempts to use more filters than this configured value, Sidereal Events will reply with an HTTP \400.

  • Flag: --body-timeout-ms
    Environment Variable: SIDEREAL_BODY_TIMEOUT_MS
    Type: Integer
    Default Value: 60,000

    After all HTTP headers are received, Sidereal Events will wait up to this many milliseconds to receive an entire response body. If the response body is not fully received within this time, Sidereal Events will respond with an HTTP 408.

  • Flag: --idempotency-expiration-ms
    Environment Variable: SIDEREAL_IDEMPOTENCY_EXPIRATION_MS
    Type: Integer
    Default Value: 180,000

    Sidereal Events will remove its record of a "source", "id" combination from its internal tracking set after this many milliseconds. Any transmission of the same "source", "id" combination after the combination is removed from the tracking set will result in the data being sent to consumers.

  • Flag: --tcp-idle-timeout-ms
    Environment Variable: SIDEREAL_TCP_IDLE_TIMEOUT_MS
    Type: Integer
    Default Value: 180,000

    Sidereal Events will close a TCP connection after this many milliseconds of no activity. This prevents connections dropped without a TCP FIN or TCP RST from consuming resources.

Metrics

Metric observability for Sidereal Events is available by requesting GET /metrics. The response body follows Prometheus' Text-based format.

The currently exposed metrics are

  • sidereal_data_byte_budget
    A gauge indicating the maximum size of any key/value pair in a query term. This is unlikely to change during normal execution, but may be lowered automatically if Sidereal Events encounters a StackOverflowError when routing an event to consuming queries.
  • sidereal_outstanding_events
    A gauge indicating the number of events accepted by Sidereal Events, but not yet confirmed delivered to consumers.
  • sidereal_active_queries
    A gauge indicating the number of queries being serviced, labeled per router.
  • sidereal_event_routing_seconds
    A summary, without quantiles, of the time (in seconds) spent routing events to consuming queries, labeled per router.
  • sidereal_idempotency_key_cache_size
    A gauge indicating the number of "source", "id" combinations saved in the tracking set, labeled per router.
  • sidereal_json_translation_seconds
    A summary, without quantiles, of the time (in seconds) spent translating events into Sidereal Events' internal index representation, labeled per translator.

sidereal's People

Contributors

djsweet avatar

Watchers

 avatar

sidereal's Issues

Support OR in queries

Right now, the only sort of query Thorium supports is a logical conjunction, or a bunch of individual terms AND'd together, like so:

a=1 AND b=2 AND c=3 AND d=4 ...

Thorium can indirectly support OR queries "just" by running a bunch of concurrent conjunction-only queries

a=1 AND b=2 AND c=3 AND d=4 ...
... (OR)
a=4 AND b=3 AND c=2 AND d=1 ...

But there are at least two major problems with this:

  1. If any bit of data matches multiple queries, it's fully duplicated, even though it doesn't need to be duplicated
  2. There's no real ordering guarantee (definitely not with the current Server Sent Events interface, but possibly so with the eventual WebSocket interface) between the queries, so it's difficult to impossible to coordinate results between these queries so that the result is always consistent.

However, if we move the "just a bunch of concurrent conjunctions" practice into a QueryServer, directly, we can inherently eliminate the duplication and sidestep the coordination problem.

If we somehow supported an encoding of OR using disjunctive normal form, we can trivially extract the AND terms into "independent" queries, but this is where the triviality ends and the low difficulty begins. Each of these "independent" queries will have to somehow identify the full disjunction, so that when we report a match to clients, we avoid reporting the same data multiple times for one specific query.

Extending the internal query protocol

Right now, we assume one clientID equals one query. This was expedient for the Server Sent Events interface, but won't work for either the upcoming WebSocket interface or for OR queries. But, we can fix this by:

  1. Updating QueryResponderSpec to incorporate a query ID scoped to the already existent client ID, and using this throughout the QueryServer (Can be set entirely by consumers)
  2. Updating every FullQuery typed variable to instead take an Iterable of FullQuery, to represent the disjunction of each conjunction in a FullQuery
  3. Updating ChannelInfo to associate a QueryResponderSpec with an iterable of QueryPath, and using this new iterable to both add and remove the QueryResponderSpec to the underlying QuerySetTree
  4. Adding a query IDs set to ReportData, so that consumers can report to their own clients which queries were affected. The "set" is doing some heavy lifting here: the same query ID can be mapped to multiple queries, and we only want to report data once per query ID. Using a "set" prevents duplicate sends to the same query ID.
  5. Keeping a map of client ID to ReportData instances, which will also include the query IDs explicitly matched
  6. Breaking out calls to trySendDataToResponder so that they are no longer called directly in the response loops

Encoding the disjunctions

Queries are currently encoded as a query string, and added to the URL of the Server Sent Events interface. We intend to keep this exact encoding for the WebSocket interface as well.

The original query string spec used to allow for the use of ';' in addition to '&' for query separators, but this is no longer the case. So, this used to be an expected sort of query string:

?q=1;a=2;c=3

And hypothetically you could say

?q=1&a=2&c=3;q=2&a=b&c=3;q=3&a=a&c=c ...

Which is to say, a mixture of '&' and ';' characters. This could be used to represent

(q=1 AND a=2 AND c=3) OR (q=2 AND a=b AND c=3) OR (q=3 AND a=a AND c=c)

which is the exact definition of disjunctive normal form.

Because the use of ';' was removed from URL Form Encoding in HTML5, we're definitely going to need to test whether Netty HTTP will allow for such an encoding, at least in terms of the URL passed to it. So long as we can split by ';' and then trigger the existing query string decoding on each segment after being split by ';'.

Relative comparison operators shouldn't match different types

If we have the following query:

curl 'http://localhost:8232/channels/test?thing=>7'

And we POST the following data:

curl -X POST -H 'Content-Type: application/cloudevents+json' -d '{"source": "somewhere", "id": "1", "datacontenttype": "application/json", "data": { "thing": 8 }}' http://localhost:8232/channels/test

curl -X POST -H 'Content-Type: application/cloudevents+json' -d '{"source": "somewhere", "id": "2", "datacontenttype": "application/json", "data": { "thing": 6 }}' http://localhost:8232/channels/test

curl -X POST -H 'Content-Type: application/cloudevents+json' -d '{"source": "somewhere", "id": "3", "datacontenttype": "application/json", "data": { "thing": "6" }}' http://localhost:8232/channels/test

We get the following responses:

event: data
id: somewhere+1
data: {"source":"somewhere","id":"1","datacontenttype":"application/json","data":{"thing":8}}

event: data
id: somewhere+3
data: {"source":"somewhere","id":"3","datacontenttype":"application/json","data":{"thing":"6"}}

Internally is caused by all strings having a type tag greater than the numeric type tag. So, internally, every string is greater than every number. This doesn't seem like a terribly useful feature, especially since OR is now well supported.

Instead, the output should have just been

event: data
id: somewhere+1
data: {"source":"somewhere","id":"1","datacontenttype":"application/json","data":{"thing":8}}

Which is to say, the operator should have only been considered to match when the types matched.

Thankfully, this is only an issue when the <, <=, >=, > operators are in use. Every other operator will implicitly require the same kind of type (with the exception of !, which will match items of other types, but this is actually a feature in that situation).

Handle arrays in JSON Pointer selectors in queries

The JSON Pointer specification details the evaluation of a reference token against an array:

If the currently referenced value is a JSON array, the reference token MUST contain either:

  • characters comprised of digits (see ABNF below; note that leading zeros are not allowed) that represent an unsigned
    base-10 integer value, making the new referenced value the array element with the zero-based index identified by the
    token, or
  • exactly the single character "-", making the new referenced value the (nonexistent) member after the last array element.

Right now, we're not handling arrays at all when attempting to evaluate JSON pointers. Since the specification already has an expectation that arrays are supported, this is technically a bug.

We won't know, at pointer evaluation time, whether or not we're going to get an array, but we will know whether the reference token is a 32-bit integer. What we can do to support arrays in objects is consequently:

  1. Whenever we attempt to read a reference token, we validate whether it is all numeric characters (i.e. all 0-9, no decimals, negatives, or exponents)
  2. If the reference token is all numeric characters, we attempt to parse the characters as a 32-bit signed Int
  3. If the reference token successfully parses as a 32-bit signed Int, we save this Int in a Map<Int, String>, where the value for the key is the reference token itself.

This should all happen internally within the KeyPathReferenceCount so that consumers don't have to implement this themselves.

This project needs a new name.

There's a Chromium fork called "Thorium" that seems to have... or at least had... users. I'm not going to link to it here, because there's been some degree of controversy over it and its primary author. So,

  1. There's already a thing known as Thorium out in the wild
  2. Lots of people consider it problematic for reasons I'm not going to go into

I'm not taking any stance for or against the author of the product, at all, but the name "Thorium" has to go, and with it probably port 8232 as the default.

Exit, log cleanly when `--server-port` isn't available

Right now, if Thorium can't get a port, it prints a big old stack trace:

Exception in thread "main" java.net.BindException: Address already in use
        at java.base/sun.nio.ch.Net.bind0(Native Method)
        at java.base/sun.nio.ch.Net.bind(Net.java:555)
        at java.base/sun.nio.ch.ServerSocketChannelImpl.netBind(ServerSocketChannelImpl.java:337)
        at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:294)
        at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:141)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:562)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:600)
        at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:579)
        at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
        at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:260)
        at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
        at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasksFrom(SingleThreadEventExecutor.java:426)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:375)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:557)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:833)

but even worse, doesn't exit cleanly. Instead it just hangs around.

  1. We should exit cleanly on all main thread exceptions, including this one
  2. This exception should be translated into a proper log entry

User Guide

Right now, everything about using Thorium is hidden in sections of the README. That's OK as a start, but a proper user guide would be way more useful.

Internal Architecture Documentation

Thorium's architecture isn't entirely obvious, and could use non-code documentation.

query-tree

The query-tree project already has fairly comprehensive Javadocs. A project-specific README with a high-level overview of how the individual files all fit together, justification for implementation details in QueryTree and QPTrie, and usage examples may be sufficient here.

  • Add subproject specific README with high-level overview, design considerations, and usage examples

thorium

  • Overall theory of operation -- producers, consumers, where QueryTree fits in, where TranslatorVerticle and QueryRouterVerticle factor into the event pipeline
  • Relationship between QueryClientSSEServer and the verticles in QueryServer.kt
    Eventually we'll need one for the inevitable WebSocket implementation outlined in #3
  • Documentation of both why and how in Radix64LowLevelEncoder and Radix64HighLevelEncoder
  • KeyValueSizeLimits.kt and the kvp-byte-budget sub-command need a thorough explanation -- and a justification.
  • Expectations for future features -- why not a configuration file, why not natively support disjunctions (OR), etc.
  • Roadmap of intended future features

WebSocket consumer interface

I expect that real-world users would prefer a WebSocket interface for queries, in addition to the current Server-sent events interface.

TODO: Spec this

Enumerate active channels

Right now, it's not possible for anyone to introspect over whether or not a channel name is currently in use. Producers and consumers have to implicitly expect to use the same channel, with some out-of-band mechanism for agreeing on what that is.

Channels "come into existence" only by way of consumer requests. A producer can publish to a non-existent channel and still receive an HTTP 202.

So, a request to enumerate channels with active consumers shouldn't be too hard -- we already have that information. What we need in addition is a mechanism to keep track of channels that have had publishes within some time horizon.

TODO: Spec this out

Administrative API

Operators are going to desire a bunch of features we don't yet support:

  • The ability to enumerate connected clients, and introspect on their connection lifetime and possibly number of events matched, per-channel
  • The ability to terminate any connected client based on identifiers from the above enumeration
  • The ability to terminate all connected clients for a particular channel
  • The ability to purge the idempotency tracking set for any channel
  • The ability to purge the idempotency tracking set for all channels

It may eventually also be worth implementing dynamic reconfiguration without restart of certain parameters, but it's unclear how important that is.

Continuous Integration

All builds, tests, and benchmarks are local right now. At least some of these should run in a CI setup, possibly GitHub Actions.

  • Automatic builds for all branches
  • Automatic tests for all branches
  • Figure out if we can run benchmarks in less time, possibly with a 5-10 minute goal

GraalVM Native Image

Delivering an executable as an artifact would be pretty nice.

  1. Get a successful build with native-image -jar ./thorium/build/libs/thorium.jar
  2. Figure out how to use the Gradle plugin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.