vmware / differential-datalog Goto Github PK

DDlog is a programming language for incremental computation. It is well suited for writing programs that continuously update their output in response to input changes. A DDlog programmer does not write incremental algorithms; instead they specify the desired input-output mapping in a declarative manner.

License: MIT License

Haskell 24.65% Makefile 0.06% Rust 27.00% Shell 1.43% Python 2.53% C 2.88% Java 37.86% Yacc 0.91% HTML 0.15% Dockerfile 0.05% Go 1.42% Nix 0.07% JavaScript 0.01% CSS 0.03% TypeScript 0.88% Vim Script 0.06%

programming-language datalog ddlog rust incremental

differential-datalog's Introduction

Differential Datalog (DDlog)

DDlog is a programming language for incremental computation. It is well suited for writing programs that continuously update their output in response to input changes. With DDlog, the programmer does not need to worry about writing incremental algorithms. Instead they specify the desired input-output mapping in a declarative manner, using a dialect of Datalog. The DDlog compiler then synthesizes an efficient incremental implementation. DDlog is based on Frank McSherry's excellent differential dataflow library.

DDlog has the following key properties:

Relational: A DDlog program transforms a set of input relations (or tables) into a set of output relations. It is thus well suited for applications that operate on relational data, ranging from real-time analytics to cloud management systems and static program analysis tools.
Dataflow-oriented: At runtime, a DDlog program accepts a stream of updates to input relations. Each update inserts, deletes, or modifies a subset of input records. DDlog responds to an input update by outputting an update to its output relations.
Incremental: DDlog processes input updates by performing the minimum amount of work necessary to compute changes to output relations. This has significant performance benefits for many queries.
Bottom-up: DDlog starts from a set of input facts and computes all possible derived facts by following user-defined rules, in a bottom-up fashion. In contrast, top-down engines are optimized to answer individual user queries without computing all possible facts ahead of time. For example, given a Datalog program that computes pairs of connected vertices in a graph, a bottom-up engine maintains the set of all such pairs. A top-down engine, on the other hand, is triggered by a user query to determine whether a pair of vertices is connected and handles the query by searching for a derivation chain back to ground facts. The bottom-up approach is preferable in applications where all derived facts must be computed ahead of time and in applications where the cost of initial computation is amortized across a large number of queries.
In-memory: DDlog stores and processes data in memory. In a typical use case, a DDlog program is used in conjunction with a persistent database, with database records being fed to DDlog as ground facts and the derived facts computed by DDlog being written back to the database.

At the moment, DDlog can only operate on databases that completely fit the memory of a single machine. We are working on a distributed version of DDlog that will be able to partition its state and computation across multiple machines.
Typed: In its classical textbook form Datalog is more of a mathematical formalism than a practical tool for programmers. In particular, pure Datalog does not have concepts like types, arithmetics, strings or functions. To facilitate writing of safe, clear, and concise code, DDlog extends pure Datalog with:
1. A powerful type system, including Booleans, unlimited precision integers, bitvectors, floating point numbers, strings, tuples, tagged unions, vectors, sets, and maps. All of these types can be stored in DDlog relations and manipulated by DDlog rules. Thus, with DDlog one can perform relational operations, such as joins, directly over structured data, without having to flatten it first (as is often done in SQL databases).
2. Standard integer, bitvector, and floating point arithmetic.
3. A simple procedural language that allows expressing many computations natively in DDlog without resorting to external functions.
4. String operations, including string concatenation and interpolation.
5. Syntactic sugar for writing imperative-style code using for/let/assignments.
Integrated: while DDlog programs can be run interactively via a command line interface, its primary use case is to integrate with other applications that require deductive database functionality. A DDlog program is compiled into a Rust library that can be linked against a Rust, C/C++, Java, or Go program (bindings for other languages can be easily added). This enables good performance, but somewhat limits the flexibility, as changes to the relational schema or rules require re-compilation.

Documentation

Follow the tutorial for a step-by-step introduction to DDlog.
DDlog language reference.
DDlog command reference for writing and testing your own Datalog programs.
How to use DDlog from Java.
How to use DDlog from C.
How to use DDlog from Go and Go API documentation.
How to use DDlog from Rust (by example)
Tutorial on profiling DDlog programs
DDlog overview paper, Datalog 2.0 workshop, 2019.

Installation

Installing DDlog from a binary release

To install a precompiled version of DDlog, download the latest binary release, extract it from archive, add ddlog/bin to your $PATH, and set $DDLOG_HOME to point to the ddlog directory. You will also need to install the Rust toolchain (see instructions below).

If you're using OS X, you will need to override the binary's security settings through these instructions. Else, when first running the DDlog compiler (through calling ddlog), you will get the following warning dialog:

"ddlog" cannot be opened because the developer cannot be verified.
macOS cannot verify that this app is free from malware.

You are now ready to start coding in DDlog.

Compiling DDlog from sources

Installing dependencies manually

Haskell stack:

wget -qO- https://get.haskellstack.org/ | sh

Rust toolchain v1.52.1 or later:
```
curl https://sh.rustup.rs -sSf | sh
. $HOME/.cargo/env
rustup component add rustfmt
rustup component add clippy
```
Note: The rustup script adds path to Rust toolchain binaries (typically, $HOME/.cargo/bin) to ~/.profile, so that it becomes effective at the next login attempt. To configure your current shell run source $HOME/.cargo/env.
JDK, e.g.:
```
apt install default-jdk
```
Google FlatBuffers library. Download and build FlatBuffers release 1.11.0 from github. Make sure that the flatc tool is in your $PATH. Additionally, make sure that FlatBuffers Java classes are in your $CLASSPATH:
```
./tools/install-flatbuf.sh
cd flatbuffers
export CLASSPATH=`pwd`"/java":$CLASSPATH
export PATH=`pwd`:$PATH
cd ..
```
Static versions of the following libraries: libpthread.a, libc.a, libm.a, librt.a, libutil.a, libdl.a, libgmp.a, and libstdc++.a can be installed from distro-specific packages. On Ubuntu:
```
apt install libc6-dev libgmp-dev
```
On Fedora:
```
dnf install glibc-static gmp-static libstdc++-static
```

Building

To build the software once you've installed the dependencies using one of the above methods, clone this repository and set $DDLOG_HOME variable to point to the root of the repository. Run

stack build

anywhere inside the repository to build the DDlog compiler. To install DDlog binaries in Haskell stack's default binary directory:

stack install

To install to a different location:

stack install --local-bin-path <custom_path>

To test basic DDlog functionality:

stack test --ta '-p path'

Note: this takes a few minutes

You are now ready to start coding in DDlog.

vim syntax highlighting

The easiest way to enable differential datalog syntax highlighting for .dl files in Vim is by creating a symlink from <ddlog-folder>/tools/vim/syntax/dl.vim into ~/.vim/syntax/.

If you are using a plugin manager you may be able to directly consume the file from the upstream repository as well. In the case of Vundle, for example, configuration could look as follows:

call vundle#begin('~/.config/nvim/bundle')
...
Plugin 'vmware/differential-datalog', {'rtp': 'tools/vim'} <---- relevant line
...
call vundle#end()

Debugging with GHCi

To run the test suite with the GHCi debugger:

stack ghci --ghci-options -isrc --ghci-options -itest differential-datalog:differential-datalog-test

and type do main in the command prompt.

Building with profiling info enabled

stack clean

followed by

stack build --profile

stack test --profile

differential-datalog's People

Contributors

Stargazers

Watchers

Forkers

justinpettit argc0 gfour laojade remysucre hessammehr lalithsuresh libinliu0189 jkljkm hartsock bef0 arronmabrey ddlog-dev dolphincc unshorn mihaibudiu ibuystuff fkalim z-zhiqiang vada-oxford hehuatang antoninbas martinweindel danbst falzberger paralax rabraham hacker0912 gitter-badger d4hines haroldlim yjiayu rayokota abhijitvnera krs85 graydon escapingbug c-nixon srenatus vestigej gatowololo kixiron eqv ajnsit longjohncoder apsaltis kichjang microsvuln itsenov-personal silvanshade qishen davidspies aziemchawdhary-gs jonsecchis desharchana19 silky smadaminov liamolucko sumeet debnil ddlog-dev2 shreyasdarkin gz patmosxx-v2 convolvatron nandbarkin booxter happinessyeah arkivm dgoldstein1 amytai lykahb andreabedini weiqiangt margaretdorothy68 qsguo price1999a epompeii kalpanadixit mjepronk joshuahhh hellblazer michaelrauh ajwdev hraesvel tobindekorne sunilmallireddy iamfork davidpichardie fuzavecta yhack offset64 sovereignj quinnwilton meme bnikolic doug-galvanick michael-swan ulfbissbort gongmingchen

differential-datalog's Issues

syntax to select tuple fields by number

x == (x,y).0
y == (x,y).1
z == (x,y,z).2
...

Should strings be allowed to contain newlines?

E.g., start on a line and end on another one.
Also, if that is allowed, can the newline be escaped?

Statically evaluate constant expressions

E.g., parts of Compile.hs assume this

String interpolation

There should be a way to perform string interpolation.
https://en.wikipedia.org/wiki/String_interpolation
I suggest the C# syntax:
$"The value between braces ${expression + other} is treated like an expression"

foreign keys for more compact FTL

Ben's version of FTL allows traversing cross-table pointers, e.g.:

let O = lspip.lsp.dhcpv6_options.option_args

To support this without using nested for loops, we'd need to add some notion of foreign keys and possibly other SQL constraints to the language

Complete the language reference document

Add examples
Add semantic constraints
Ref and & syntax
Standard library reference

`return` statement

to structure complex control flow

DDlog lints

Detect unused variables in rules and functions
Detect unused intermediate relations (i.e., relations that are neither output nor used to compute other relations).
Grouping by complex variables or variables wrapped in Ref<> is inefficient. Ideally, one should group by one or several small identifiers.
???

Tutorial section on FTL

Generate a meaningful erro message when type unification fails

An error message that looks like this is generated whenever type unification fails:

Expression parameterized(string, int) has unknown type in CtxTop

Optimizations

string interning
use arrangements for relations that are used as the first atom in multiple rules (a special case of common subexpression elimination)
use Box or Arc type for Value to save memory
avoid redundant cloning in the generated code
use per-key distinct when possible
extract "by-self" arrangement from distinct
32-bit weights
use unreachable() instead of panic!() in Compile.hs (see TODO in Compile.hs)

https://github.com/frankmcsherry/differential-dataflow/issues/113

Test and document parsec's string parser

We rely on parsec's standard parser for strings, which supports unicode and escaping.

TODO:

check and document its exact functionality
can we handle all OVSDB strings?

Syntax improvements

Rename int -> bigint
rename ground relation -> input relation
add return statement

Add a built-in method to convert a type to a string

Also, there should be a way for users to specify the conversion to strings for user-defined types.
This will probably require string concatenation at the very least.

extern function syntax

use extern keyword to make it explicit that a function is defined outside of Datalog

fix goldenVsFiles

goldenVsFiles behaves funny when more than one files has changed: it first complains about the first file only. Deleting the corresponding .expect file causes it to report that a new golden file has been generated (without an error). Finally, the third run reports a mismatch in the second golden file.

Allow the use of constants in patterns

Fix string interpolation syntax

Replace {} with ${}.

Mechanism to track provenance of output records

Support for namespaces

Seems important for very large Datalog programs

Test calling DDlog from Java

It should be possible to load a compiled DDlog program and execute transactions from Java.

This requires a couple of preparatory steps:

DDlog currently generates static libraries. We have to change the crate type to build .so instead.
DDlog still does not have a complete C API.

Background:

The generated library file is written to the target/release directory, e.g., tests/datalog_tests/path/target/release/libpath.a.
Associated header file is tests/datalog_tests/path/ddlog.h

Escape sequences in interpolated strings

One should be able to escape symbols such as {, }, $, | in an interpolated string.

Installation procedure

Command to install differential datalog binary.

Change integer literal syntax

Use conventional C syntax instead of Verilog syntax for integer literals

Tuple types with one element

Currently the compiler translates them to the element type.
Is this a good idea?

indentation for statements

statements are printed without any indentation

Cannot have antijoins with wildcards

Also, the error message is not very good.

error: file.dl:211:234-211:235: Argument _method must be specified in this context
 
RBasic_MethodLookup(_simplename, _descriptor, _type, _method) :- RDirectSuperclass(_type, _supertype), RBasic_MethodLookup(_simplename, _descriptor, _supertype, _method), not RBasic_MethodImplemented(_simplename, _descriptor, _type, _).

Support underscores in integere literals

0x123456_abcd_abcd_abcd. Allow this in:

command language
DDlog

@blp

Type inference using unification

Implement a proper type inference algorithm based on unification. The current ad hoc implementation is hacky and fragile. No idea why I did not do the right thing straight away...

TODOs

Basic Datalog (Leonid)

FTL (Mihai)

Generate Datalog schema from JSON
Generate input and output adapters
Adapters for external C functions invoked from FTL
Test workload

Future improvements

Debugging facility
CLI interface to ddlog
RPC interface to ddlog

Incomplete implementation of assignments in Compile.hs

We currently don't have a way to compile arbitrary l-values to Rust, e.g.,

Constructor{var x, var y} = z

is ok, but

Constructor{x, y} = z

where x and y are previously introduced variables is illegal.

Either implement the logic to support this translation (e.g., by assigning to intermediate variables) or modify validation logic to disallow such programs

Automatic string conversion for type variables

At the moment, the compiler refuses to convert type variables to strings.

This can be fixed by:

Infering that a function expects a type variable to be printable
In the generated Rust function, add the Display trait for this function
Whenever the function is called, check that its concrete arguments satisfy the trait
Implement Display trait for types that have a user-defined 2String method

OVN integration TODOs

TODO list for DDlog/OVN integration:

mpsc interface to ValMap

ValMap is a data structure that stores the content of output tables. It is updated by each change callback invoked by the differential inspect() operator. ValMap is protected by a mutex, potentially introducing contention in workloads where outputs are frequently updated. One possible solution is to maintain ValMap in a separate thread and use mpsc queue to communicate with this thread from workers. We must be careful though to flush the queue before transaction_commit() returns. Another (even better?) option is to keep a ValMap per worker and only merge them when the client requests a copy of an output table.

Implement bit slice logic in Compile.hs

This is a little tricky, as we use different representations depending on the size of a bit vector (BitUint or uXX) depending on its width. Bit slicing may involve conversion between these representations

Get rid of `let` syntax

We use let in FTL, and var in the rest of the language. There is not reason for these two forms of variable declaration.

No symbol table?

Currently one can have a type that depends on an undefined identifier

typedef t = x

Speed up compilation, reduce disk footprint of multiple datalog programs

Currently every test in tests/datalog_tests downloads and compiles its own Rust dependencies, as well as makes its own copy of differential_dataflow, taking a couple of gigabytes of space per test. There must be a way to share common parts across all tests.

Recursive type definitions

This is allowed by the parser

typedef t = (t, t)

differential-dataflow optimizations

use distinct_total instead of distinct where possible.
don't convert external collections to Variables in the nested scope (variables use a lot of memory)

Use P4 syntax for bitvector literals

https://p4.org/p4-spec/docs/P4-16-v1.0.0-spec.html#sec-literals

support for multi-file specs

The most important use case is combining auto-generated schema (e.g., from OVSDB) with hand-written code. We may be able to get away with using CPP or m4, but must make sure that line numbers in error message refer to original input files.

Allow different-width arguments of << and >>

... so that we can write x << 5 vs x << 32'd5

Allow access to enum fields

The Rust backend currently does not allow access to fields of types with multiple constructors (i.e., types compiled to enums), even if all constructors of a type have the same field

Rules that express all constants are not allowed

RHeapAllocation_Type(_heap, _type) :- var _heap = "<<main method array>>", var _type = "java.lang.String[]".

Language does not support disjunction

Set up CI

Should we set up travis CI?

differential-dataflow doesn't compile for compiler tests

test.log
stack test log attached. Might be related to #84 - might be fixed if one tracks down which git version of timely & abomonation worked, and track that in rust/differential_datalog/Cargo.toml

add support for constants

At the moment, constants can be simulated by 0-arity functions. The main downside is the need to use parenthesis every time a constant is referenced. Also, function names start with lower-case letters, whereas it is customary to capitalize constant names.

The Rust code generated for split does not compile

The unmerged tutorial has an example which does not seem to generate correct Rust code.
This is the dl code:

extern function split(s: string, sep: string): set<string>
function split_ip_list(x: string): set<string> =
   split(x, " ")

The generated code does not compile:

fn split_ip_list(x: &String) -> set<String> {
    split(x, (&r###" "###.to_string()))
}

since there is no set type in Rust.