GithubHelp home page GithubHelp logo

workiva / eva Goto Github PK

View Code? Open in Web Editor NEW
563.0 45.0 25.0 861 KB

A distributed database-system implementing an entity-attribute-value data-model that is time-aware, accumulative, and atomically consistent

License: Other

Makefile 0.06% Java 5.33% Clojure 93.59% Shell 0.78% Dockerfile 0.23% TSQL 0.01%

eva's Introduction

What is Eva?

Eva is a distributed database-system implementing an entity-attribute-value data-model that is time-aware, accumulative, and atomically consistent. Its API is by-and-large compatible with Datomic's. This software should be considered alpha for the purposes of quality and stability. Check out the FAQ for more info.

Getting Started

If you are brand new to Eva, we suggest reading through this entire readme to familiarize yourself with Eva as a whole. Afterwards, be sure to check out the Eva tutorial series, which break down and go over almost everything you will want to know.

Development

Required Tools

  1. Java Development Kit (JDK) v8
  2. Leiningen Build Tool

Example: Hello World

First we kick off the repl with:

lein repl

Next we create a connection (conn) to an in-memory Eva database. We also need to define the fact (datom) we want to add to Eva. Finally we use the transact call to add the fact into the system.

(def conn (eva/connect {:local true}))
(def datom [:db/add (eva/tempid :db.part/user)
            :db/doc "hello world"])
(deref (eva/transact conn [datom]))

Note: deref can be used interchangeably with the @ symbol.

Now we can run a query to get this fact out of Eva. We don't use conn to make a query but rather we obtain an immutable database value like so:

(def db (eva/db conn))

Next we execute a query that returns all entity ids in the system matching the doc string "hello world".

(eva/q '[:find ?e :where [?e :db/doc "hello world"]] db)

If we want to return the full representation of these entities, we can do that by adding pull to our query.

(eva/q '[:find (pull ?e [*]) :where [?e :db/doc "hello world"]] db)

Project Structure

  1. project.clj contains the project build configuration
  2. core/* primary releasable codebase for Eva Transactor and Peer-library
    1. core/src clojure source files
    2. core/java-src java source files
    3. core/test test source files
    4. core/resources non-source files
  3. dev/* codebase used during development, but not released
    1. dev/src clojure source files
    2. dev/java-src java source files
    3. dev/test test source files
    4. dev/resources non-source files
    5. dev/test-resources non-source files used to support integration testing

Development Tasks

Running the Test Suite

lein test

Configuration

Eva exposes a number of configuration-properties that can be configured using java system-properties. Some specific configuration-properties can also be configured using environment-variables.

The eva.config namespace, linked here, contains descriptions and default values for the config vars.

About the Eva Data Model

Entity-Attribute-Value (EAV)

EAV data-entities consist of:

  1. a distinct entity-id
  2. 1-or-more attribute-value pairs associated with a single entity-id

EAV data can be represented in the following (equivalent) forms:

  1. as an object or map:
    {:db/id 12345,
     :attribute1 "value1",
     :attribute2 "value2"}
  2. as a list of EAV tuples:
    [
      [12345, :attribute1, "value1"],
      [12345, :attribute2, "value2"]
    ]

Time-Aware

To make the EAV data-model time-aware, we extend the EAV-tuple into an EAVT-tuple containing the transaction-id (T) that introduced the tuple:

[
;;  E      A            V         T
   [12345, :attribute1, "value1", 500],
   [12345, :attribute2, "value2", 500]
]

Accumulative

To make the EAVT data-model accumulative, we extend the EAVT-tuple with a final flag that indicates if the EAV information was added or removed at the transaction-id (T).

[
;;  E      A            V         T    added?
   [12345, :attribute1, "value1", 500, true],
   [12345, :attribute2, "value2", 500, true]
]

Under this model, common data operations (create, update, delete) are represented like this:

  • Create: a single tuple with added? == true
[[12345, :attribute1, "create entity 12345 with field :attribute1 at transaction 500", 500, true]]
  • Delete: a single tuple with added? == false
[[12345, :attribute1, "create entity 12345 with field :attribute1 at transaction 500", 501, false]]
  • Update: a pair of deletion and creation tuples
[
 ;; At transaction 502
 ;;   invalidate the old entry for :attribute2
      [12345, :attribute2, "old-value", 502, false]
 ;;   add a new entry for :attribute2
      [12345, :attribute2, "new-value", 502, true]
]

The complete history of the database is the cumulative list of these tuples.

Atomic Consistency

Data-updates are submitted as transactions that are processed atomically. This means that when you submit a transaction, either all the changes in the transaction are applied, or none of the changes are applied.

Transactions

Transactions are submitted as a list of data-modification commands.

The simplest data-modification commands (:db/add, :db/retract) correspond to the accumulative tuples described above:

[
  [:db/retract 12345 :attribute2 "old-value"]
  [:db/add 12345 :attribute2 "new-value"]
]

When this transaction is committed it will produce the following tuples in the database history (where <next-tx> is the next transaction-number):

[
  [12345, :attribute2, "old-value", <next-tx>, false]
  [12345, :attribute2, "new-value", <next-tx>, true]
]

Using Object/Map form in transactions

In addition to the command-form, you can also create/update data using the object/map form of an entity:

[
  {:db/id 12345
   :attribute1 "value1"
   :attribute2 "value2"}
]

This form is equivalent to the command-form:

[
  [:db/add 12345 :attribute1 "value1"]
  [:db/add 12345 :attribute2 "value2"]
]

Schemas

Because all stored data reduces to EAVT tuples, schemas are defined per Attribute, rather than per Entity.

Schemas definitions are simply Entities that have special schema-attributes.

Defining the schema for :attribute1:

[
  {:db/id #db/id[:db.part/db]
   :db/ident :attribute1
   :db/doc "Schema definition for attribute1"
   :db/valueType :db.type/string
   :db/cardinality :db.cardinality/one
   :db.install/_attribute :db.part/db}
]

Taking each key-value pair of the example in turn:

  • :db/id #db/id[:db.part/db]: declares a new entity-id in the :db.part/db id-partition
  • :db/ident :attribute1: declares that :attribute1 is an alias for the entity-id
  • :db/doc "Schema definition for attribute1": human-readable string documenting the purpose of :attribute1
  • :db/valueType :db.type/string: declares that only string values are allowed for :attribute1
  • :db/cardinality :db.cardinality/one: declares that an entity may no-more-than one :attribute1. This means that for an given entity-id, there will only ever be one current tuple of [<entit-id> :attribute1 <value>]. Adding a new tuple with this attribute will cause any existing tuple to be removed.
  • :db.install/_attribute :db.part/db: declares that this :attribute1 is registered with the database as an installed attribute

Components

Running with Docker

The included docker compose can be used to spin up a completely integrated Eva environment. This includes:

To spin up said environment run the following commands:

make gen-docker-no-tests # to build Eva with the latest changes
make run-docker

To shut down the the environment use the following command:

make stop-docker

In order to open a repl container that can talk to the environment use:

make repl

And run the following to initially setup the repl environment:

(require '[eva.catalog.client.alpha.client :as catalog-client])
(def config (catalog-client/request-flat-config "http://eva-catalog:3000" "workiva" "eva-test-1" "test-db-1"))
(def conn (eva/connect config))

Finally, test that everything is working with an empty transaction:

(deref (eva/transact conn []))

A similar result to this should be expected:

{:tempids {}, :tx-data (#datom[4398046511105 15 #inst "2018-06-06T17:35:07.516-00:00" 4398046511105 true]), :db-before #DB[0], :db-after #DB[1]}

Additional Resources

FAQ

Is this project or Workiva in any way affiliated with Cognitect?

No. Eva is its own project we built from the ground up. The API and high-level architecture are largely compatible with Datomic, but the database, up to some EPL code, was entirely built in-house. We have a list of the most notable API differences here.

Should I use Eva instead of Datomic?

If you are looking for an easy system to move to production quickly, almost certainly not. Eva is far less mature and has seen far less time in battle. Datomic Cloud is an amazing (and supported) product that is far easier to stand up and run with confidence. Eva is provided as-is.

What are the key differences between Eva and Datomic?

There are a handful of small API differences in the Peer, whereas the Clients are quite distinct. For example, a Connection in Eva is constructed using a baroque configuration map, not a string. From an operational standpoint, Datomic is far more turn-key. There are also likely some low-level architectural differences between the systems that will cause them to exhibit different run-time characteristics. For example, our indexes are backed by a persistent B^๐œ€-tree, whereas Datomic's indexes seem to exhibit properties more like a persistent log-structured merge-tree. For a more detailed list check here.

Why did Workiva build Eva?

Workiva's business model requires fine-grained and scalable multi-tenancy with a degree of control and flexibility that fulfill our own evolving and unique compliance requirements. Additionally, development on Eva began before many powerful features of Datomic were released, including Datomic Cloud and Datomic Client.

Why is Workiva open sourcing Eva?

The project by-and-large is nearly feature complete and we believe is generally technically sound. Workiva has decided to discontinue closed development on Eva, but sees a great deal of potential value in opening the code base to the OSS community. It is our hope that the community finds value in the project as a whole.

What will Workiva's ongoing contributions to Eva be?

Eva will likely still continue to be maintained and matured in 10% time and by previous contributors on their personal time.

Maintainers and Contributors

Active Maintainers

Previous Contributors

Listed, in transaction log style, in order of addition to the project:

eva's People

Contributors

erikpetersen-wf avatar fnumatic avatar migueldedios-wf avatar nathancoleman avatar rm-astro-wf avatar rmconsole-readonly-wk avatar rmconsole2-wf avatar rmconsole3-wf avatar rmconsole4-wk avatar rmconsole7-wk avatar tylerwilding-wk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eva's Issues

Document best practices for exposing transaction ids

One of the main benefits of using Eva is for its point-in-time, and services often want to expose a transaction identifier of some kind to clients. Document best practices for doing so.

For example:

  • Entity id seems worst to expose to service clients since it's useful for most clients to be able to compare transaction ids to know if something is earlier or later, and entity ids are (probably) not guaranteed to be ever-incrementing.
  • Entity number seems workable since customers can compare them, and the numbers can be directly converted to entity id for asOf calls. The downside is that it could complicate migrations away from an Eva namespace (e.g., if a new namespace is needed for some reason), but that's a broader concern that also effects exposed entity ids.
  • Separate "public" entity number (number incremented by service via a transaction function and stored on the transaction as an attribute) seems good since customers can compare them and they are not coupled to the current namespace. The downside is an extra query is required to resolve "public" entity number to transaction id/num.

Perhaps there are other considerations or other options? We would probably prefer entity number since it seems to work well in all known use cases and migration seems simple if ever required.

Investigate using GraalVM on static EVA sublibraries

GraalVM is a spiffy new vm/compiler for several different languages, including the JVM, which performs static analysis / optimization at the byte-code level.

https://www.graalvm.org/

Folks have reported being able to produce some pretty interesting results with clojure: https://www.innoq.com/en/blog/native-clojure-and-graalvm/

However, graal doesn't support dynamic class loading so some things, eg, transaction functions, cannot be supported. However, with a move toward more distinct libraries, we may be able to use isolate some static components (the indexes and state representation in the query engine come to mind) for use with some features of graal for (low-cost?) performance improvements.

Storage related documentation improvments

  • Brief discussion of storage implementation details, storage backend constraints, etc
  • Ballpark napkin-worthy size limitations on values, quantity limitations on queries, etc.
  • Write up several examples of how to DDoS the backend (and thereby, avoid doing so), from a service design perspective with emphasis on queries and transactions.

Don't send full tx-log-entry from transactor to peer on tx replies

Right now we send the entire transaction log entry to the Peer that submit a transaction when the transactor performs its directed response. We should remove this part of the transaction process and only respond with a tx-num, as we do for publishes. This would (sometimes substantially, in the case of many-datom transactions) reduce the amount of 'heavy' IO work for which the transactor is responsible, and move the burden onto the consuming Peer.

Since this is an over-the-wire protocol change, we will have to add it in a backwards-compatible manner to migrate off of the current scheme of including the entire log entry.

Per-Trace Metrics

For very complex operations in and/or involving Eva, we sometimes output a lot of individual tracing spans whose value to us is defined almost entirely in aggregate. For example, inside a query, we would love to know how much time total was spent fetching from storage, and perhaps how many individual requests were made; outside of that information, we don't care to see the spans.

Since we can tag tracing spans with data, we'd like to replace some tracing spans (and perhaps some metrics -- exploration required) with per-trace metrics collection.

This will give us two very nice features in service of our quest for more observability:

  1. We can introduce many highly-granular metadata into our traces, and
  2. We can have smaller traces overall.

Add cache churn metrics

Our caches should have sufficient metrics so we can identify churn (and set up alarms) on both our index node caches and index manager cache.

Runtime context capturing API

In addition to capturing the lexical context (i.e. place in the code) and associate it with metric/trace/log - we want to capture runtime information, for example database id (client).

Such "capturing" API should:

  • let us capture interesting things even if they are not used by some aspects
  • be universal across aspects
  • let specific aspects to decide what part of runtime information they want to report

Add documentation on transactor->peer sync

Eva is a strongly consistent system where there exists a small period of communication synchronization. We should create an artifact documenting and elaborating on this behaviour.

Improve query engine state management

There are some notable inefficiencies in the internal representations for query state. Several naive representations were experimented with, all of which had the potential to dominate execution time far beyond IO costs. However, there exist well-defined solutions that could be implemented to drastically reduce overall runtime of the query engine under many or most circumstances.

In particular, there are two elements of state that need improvement:

  1. The manner in which variable bindings are stored eagerly evaluates all cross products, requiring repeated linear traversals over potentially tens or hundreds of thousands of immutable maps.
  2. The manner in which the query engine tracks whether or not it has already evaluated a particular predicate.

To address (1), we intend to use or build a lazy relational algebra system - https://github.com/Workiva/lazy-tables

To address (2), we intend to explore feasible alternatives and choose one/more to implement.

The expectation is that we could drastically improve the query engine performance and possibly simplify several code paths.

Build a cache internal to pull spec evaluation in order to deduplicate subframe evaluation

Internally, during a pull evaluation, generated subframes may be duplicated an arbitrary number of times. Consider the spec [{:ref [*]}] being evaluated on 10000 distinct entities which all have a :ref to the same single entity. There will be 10000 distinct subframes generated for that entity, 9999 of which could be eliminated and the single output simply repeated in the construction of the pull's final output.

An internal cache of [eid spec], (or perhaps even something more granular, like [eid attr]) which could greatly reduce the total amount of IO required of pull in many cases.

Soften all-negated-clauses requirement

Discussions of adding negation to Datalog generally state that safety must be observed by the following rule:

  • Each variable in the head of the rule must occur in at least one positive antecedent.
  • However, this is not strictly necessary, in the case that there exist required bindings in the consequent of the rule. That is, variables known already to have bindings may be exempted from this requirement. Consider:
:find ?producer ?type
:in $ [?producer ...] [?type ...]
:where (not-join [?producer ?type]
                 [?s :example.schema/producer ?producer]
                 [?s :example.schema/type ?type])

Documentation on exception handling for transaction functions

What happens when a transaction function fails. What happens when you throw an exception in a transaction function. How is this reported? How should you deal with it? Best practices. Using ex-info to get data out with failed transactions. Etc.

Documentation on how to embed a data structure into Eva

What does it even mean to embed a data structure in Eva? Why would I want to do that? How should I go about doing it? What are common pitfalls, gotchas, etc.? Show an example, perhaps two different linked list implementations that make different tradeoffs.

Documentation on designing schema for the future

  • Brief discussion of the built-in attributes and their contracts, especially the core meta-attributes.
  • Present a variety of ways to paint yourself into a corner.
  • How to design your schema in a manner friendly to future schema modifications. Etc.

Query engine future work

There exists a ton of recent (and not so recent) literature on powerful ways to build, optimize, and utilize database query engines. This issue is a place to collect references to this literature.

Interesting set of work on query optimization/execution, with a particular focus on worst-case-optimal joins:

Haxl / Qaxl style query rewriting:

Higher order logic systems, built on top of datalog systems designed to act as a powerful basis and tools for building correct asynchronous / distributed systems:

Tracing individual transaction functions

We're able to get good aggregate statistics from covering the transaction process as a whole, however, the open nature of transactions implies multimodality wrt execution times.

We should consider building traces / spans isolated to the invocation of each class of transaction functions and/or utilizing tagging to give us an axis along which we can capture and partition our statistics so we can do an analysis of individual modes.

Build more robust readiness / liveness probes

On kubernetes we have the option of specifying new liveness and readiness probes. Our current http status endpoint is largely independent of our actual system state.

We should design/build more robust readiness liveness probes that are more representative of actual system state.

Schema design: branching trees & queries

How schema relationships can be visualized as a graph. How branches affect queries. How to design schema and/or queries to deal with this. "Traversing up a tree" in a query. "Traversing down the tree" in a query.

Pull's contract is potentially unintuitive with reverse+component attrs

(def conn (connect {:local true}))
(def schema [{
    :db/id #db/id[:db.part/db]
    :db/ident :example.schema/exampleRef
    :db/valueType :db.type/ref
    :db/isComponent true
    :db/cardinality :db.cardinality/one
    :db.install/_attribute :db.part/db
    :db/doc "Example"}])
@(transact conn schema)

(datoms (db conn) :vaet 0 :example.schema/exampleRef) ;=> (#datom[8796093023233 1024 0 4398046511106 true] #datom[8796093023234 1024 0 4398046511106 true])

(pull (db conn) [{:example.schema/exampleRef [:db/id]}] 0) ;=> :example.schema{:_exampleRef #:db{:id 8796093023233}}

That final pull should contain a collection of two db/ids, since there are two.

Timeout the transactor on long running transactions

This involves a few moving pieces:

  1. The transactor should have a default timeout for killing long-running transactions so that we can protect liveness of the logical instance in the event of unexpectedly long transactions.
  2. To support this cleanly, we should also provide a way, from the Peer / Client APIs, to override this behavior in situations where consumers do want to support very large transactions.

Odd behaviour on required variables for `or`

(q '[:find ?te ?a ?v ?c
     :in $ ?e [[?a ?v]]
     :where   [?te :example.schema/entity ?e]
     [?te :example.schema/attribute ?a]
     (or (and [(= :all ?v)]
              [(ground :all) ?v])
         (and [(not= :all ?v)]
              [?te :example-schema/ref-value ?v]))
     [?te :example.schema/logical-clock ?c]]
   [[1 :example.schema/entity :e]
    [1 :example.schema/attribute 3]
    [1 :example.schema/ref-value :vee]
    [1 :example.schema/logical-clock :clock]]
   :e [[3 :all]])

Results in:

IllegalArgumentException No implementation of method: :consequent of protocol: #'eva.query.datalog.protocols/Rule found for class: nil  clojure.core/-cache-protocol-fn

However, the following query is valid

(q '[:find ?te ?a ?v ?c
     :in $ ?e [[?a ?v]]
     :where   [?te :example.schema/entity ?e]
     [?te :example.schema/attribute ?a]
     (or (and [(= :all ?v)]
              [(ground :all) ?v])
         (and [(not= :all ?v)]
              [(some? ?te)] ;; difference
              [?te :example.schema/ref-value ?v]))
     [?te :example.schema/logical-clock ?c]]
   [[1 :example.schema/entity :e]
    [1 :example.schema/attribute 3]
    [1 :example.schema/ref-value :vee]
    [1 :example.schema/logical-clock :clock]]
   :e [[3 :all]])

Timothy Dean: That works because some? requires ?te to be bound (it's a function). Apparently this is convincing the compiler to make ?te a required binding for the rule generated by the or-clause. It probably shouldn't do that. Once ?te is a required binding, then it's no longer necessary to occur in every body of the rule.

Add specific exception for variable mismatches with `or`

When an or in a query has different sets of variables, you currently get fairly opaque exceptions raised from the query engine. We should special case this since it keeps coming up and provide a specific error message during query compilation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.