knowsys / nemo Goto Github PK
View Code? Open in Web Editor NEWA fast in-memory rule engine.
Home Page: https://knowsys.github.io/nemo-doc/
License: Apache License 2.0
A fast in-memory rule engine.
Home Page: https://knowsys.github.io/nemo-doc/
License: Apache License 2.0
We need a data structure to represent a Trie. It should store the underlying data as a vector of IntervalColumn. Furthermore, some interface for Trie iterators together with a basic implementation is required. A trie iterator should allow you to vertically traverse a Trie and should be able to return a ColumnScan object which can be used for the OrderedMergeJoin.
Suppose you want to evaluate the rule
a(x, y) b(y, z) c(z) -> d(x, y)
Currently this would compute the temporary table tmp(x, y, z)
and then project this down to d(x, y)
.
This is inefficient because the result does not care about every valid z-value. Hence it would suffice to know that one such value exists and move on to the next x, y pair.
To illustrate, consider a database with a(1, 2), b(2, 5), b(2, 6), c(5), c(6)
. We'd obtain tmp(1, 2, 5)
and tmp(1, 2, 6)
, even though we only need to save d(1, 2)
.
The methods pos
and narrow
that are required for RangedColumnScan
s are currently not implemented for RleColumnScan
s.
These methods just panic currently but they should be made usable.
Implement difference
Implement project
Implement rename
The GenericColumnIterator currently iterated from the start to the end of a column. It should support the specification of different bounds and merely initialise these bounds to be 0
and len()
as a default choice.
The seek method of GenericColumnIterator currently sets the upper bound to the length of the column instead of the end of the interval.
Currently, we use tagged enums, like RangedColumnScanT
, to abstract the data type used internally in the object. However, using these objects is often cumbersome, since they have to "unpacked" into the corresponding -Enum
object through match statements like
match target_schema.get_type(layer_index) {
DataTypeName::U64 => {
for scan in &trie_scans {
if let RangedColumnScanT::U64(scan_enum) =
scan.get_scan(layer_index).unwrap()
{
column_scans.push(scan_enum);
} else {
panic!("type should match here")
}
}
...
},
DataTypeName::Float => {...},
DataTypeName::Double => {...},
This leads to code repetition and generally makes things harder to read/understand.
For the purpose of reordering a table, it was discussed to store floating point numbers as unsigned integers, giving you a two dimensional array of one type. Since this would invalidate the ordering with respect to the floating point value anyways, we might want to give up on storing data in Column<float>
, and instead store everything in Column<u64>
. This would allow us to replace the tagged enums with a "tagged struct", i.e.
struct ColumnScanT {
type: DatatypeName,
scan: ColumnScanEnum<u64>
}
Here, no unpacking is necessary, so the different arms of the match statement can be unified.
For operations which require the actual type (like addition, certain aggregates, ...) one would have to cast them back. But this should not incur any runtime costs. You do, however, lose the ability to quickly calculate conditions like "greater than 1.0f", since the data would not be sorted according to their floating point value. Another thing to consider is whether differently sized integer types should be used (probably u32 and u64). If yes, then tagged enums might still be needed, leaving the code as complex as before (but perhaps still reducing the number of types to consider). Otherwise one may waste storage space. A third option would be to decide this at compile time, instead of for each individual table at run time.
Technical Debt from #62.
We have a hacky binary for testing with hardcoded input files. We should at least give it a proper CLI where the input file can specified via an argument or where the input is just read from STDIN
. Further arguments might be of use (later).
The AdaptiveColumnBuilder
is currently implemented to decide the use of a RleColumn
vs a VectorColumn
based on the data that the AdaptiveColumnBuilder
receives.
The current solution may not be suitable when further column types shall be supported.
This issue provides a place for discussion on refactoring possibilities. Some discussion also already happened in #8.
Once a decision is made, the implementation can happen in the scope of this issue.
Maybe, it is worthwhile to postpone the discussion until another column type is actually to be supported by the AdaptiveColumnBuilder
. In this sense the issues acts as a mere reminder.
Parsing of rules provided as a text file
Implement join
Implement union
We need to discuss the planning and handling of Errors.
On a global library scale, we have crate::error::Error
now.
Introducing an error-enum for every module/column-type feels a bit too much.
This happened during the PR, which got resolved, but in general one needs to think about errors and handling of them.
The initial issue started on the discussion about error-handling in one of the columns (RLE), therefore the following concepts are using this module/structure.
crate::physical::columns::ColumnError
).crate::physical::columns::error::ColumnError
) - this might be a bit too much boilerplate thoughcrate::error
(e.g. crate::error::ColumnError
)crate::physical::error::PhysicalError
)crate::physical::error::ColumnError
)crate::error:Error
)In addition on every case a variant in crate::error::Error
should be in place too.
I think we are at a point, where this decision needs to be done for the whole library, to avoid many different island-solutions.
The result of this discussion should be documented somewhere too (either a wiki-page, a sticky discussion-thread, or just inside a CONTRIBUTING.md file)
Originally posted by @ellmau in #8 (comment)
Due to changes in the code the discussion turned obsolete in the PR, but it is still some question to be answered sooner than later.
Technical Debt from #62.
Support for constants and duplicate variables is currently implemented using filters in a very hacky way in the execution engine.
I did not look into the code in detail yet so I cannot make an informed proposal on how to improve this.
Maybe @aannleax has ideas on this already?
For our custom types Float
and Double
, we have to delegate quite a lot of trait implementations to the underlying f32
and f64
types, respectively. I played around with delegate and ambassador a bit but I did not achieve satisfying results here.
This is not a pressing issue at all but it would be nice to have if we can find a nice way to shorten those delegations.
Feel free to experiment here :)
Technical Debt from #62.
There is no one-to-one correspondence between the terms in model (obtained from parsing) and the internal datatypes (U64,Float,Double
). There is currently only an incomplete mapping from terms
to those datatypes.
What should we do here? I think this requires some more conceptual discussion.
To have a clearer separation of the logical and physical layer we want to put the physical layer into its own crate which is then used by the logical layer.
The only problem that needs solving here is code that is currently used by both layers. This is:
meta/timing.rs
meta/logging.rs
The logging dependency could be solved by pulling the log statements out of this global file to the places where it used. The original reason for having them in a separate file was to reduce clutter in the functions which do work. The redesign has to keep this is mind. One way of doing this could be to use Display
implementations to avoid verbose setup code.
This should wait until #70 is done since we currently have other dependencies relating to the type system right now.
As of now, all TrieScanEnum
objects resulting from database operations only represent a partial trie, i.e. a trie where there may exist paths not of maximal depth. It would be nice to have trie iterators which would look ahead to determine if the current value is actually "real".
This is the meta-issue for the untanglement milestone, it holds the results of the discussion on 2022-11-23.
The physical layer should be self-sustaining (i.e., it could potentially be a stand-alone crate):
String
s), where names are always set by the logical layeru32
, u64
, f32
, f64
, “u32
with dictionary”, and “u64
with dictionary”u32
and u64
, or (iii) they are “u32
with dictionary” and “u64
with dictionary” and they have the same dictionaryThe logical layer has:
Technical Debt from #62.
We use pretty much the same logic to implement the Display
trait of a trie and to dump it to a CSV file. We should unify this a bit more and extract common parts.
minimal example:
let mut builder: AdaptiveColumnBuilder<u32> = AdaptiveColumnBuilder::new();
let column = builder.finalize(); // panic (division by 0)
An empty column should be returned by the builder
I'm not sure whether this shall be checked on the ColumnBuilder
-level or the rleColumn
-level.
Trie
s should implement the Display
trait for easier debugging.
Consider a Trie
that has [1,2]
in the upper level with [2, 3]
and [3, 6]
as their children, respectively. In turn, the lowest level consists of [7,8]
, [8]
, [9]
, [9]
.
The output may look as follows:
1 2 7
8
3 8
2 3 9
6 9
The following features are still missing:
Data in index-based structures shall be sorted. As a logical step before physically sorting all the data, a Permutator is favourable.
A permutator is mentioned as one method in Issue #7
The prefix handling code in the rule parser is currently incomplete. While prefixes are taken into account on parsing, there is, e.g., currently no way to obtain the original IRIs from a PrefixedName
.
Depends on #58 being merged.
A further Column-implementation that stores data using incremental run-length encoding should be created. The data structure should store a list of records that each contain (1) a start data value v (of the Column's type T), (2) a length l, and (3) an increment i. Such a block encodes a sequence of l values, starting with v, with each further value being i larger (i can be 0 for sequences of repeating values). This implementation only is needed for integer types (currently only u64 and usize, but further types might be added later, e.g., to reduce memory use).
The AdaptiveColumnBuilder should be extended to use this encoding of columns whenever there is a significant memory reduction in doing so (this is determined heuristically, by starting to build RLE columns and tracking their size; if the expected size of the plain encoding less than some factor [2? 1.5?] the size of the RLE encoding, then the data is transcoded to plain vector and processing is continued with a vector; there might be other ways to guess what is best).
Due to the traits defined for the generic data type T, one cannot instantiate an AdaptiveColumnBuilder<u64>
, AdaptiveColumnBuilder<Float>
, or AdaptiveColumnBuilder<Double>
.
One of the defined "basic" datatypes should be possible to instantiate the `ColumnBuilder'.
There has been a conversation about the choice of the Datatype of the Dictionary
indices. It should be discussed on due time.
Apart from this discussion on
usize
/u64
(which we should probably decide on in the group) and the small changes above, this looks good, though.Thank you for the thoughtful review, did all the changes except the
usize
vsu32
/u64
issue
I see your concern. It is possible to refactor the Dictionary trait to be either u64 or u32.
My thoughts have been to utilize the native number representation of the machine.
usize
is not about being the native representation, its sole point is to cover the whole range of allowable pointer values. Even i386 has 64-bit-registers, sou64
could also be considered “native” there.
It is still something to consider when using indices, as indices are always usize
. I wanted to avoid unnecessary casting back and forth when I've designed the Dictionary
treat.
Originally posted by @ellmau in #68 (comment)
RuleParser
should have a finalize
method that moves out the parsed Data, Program, and Dictionary.
Depends on #58 being merged.
Encountered the following error while running cargo +nightly test
:
---- physical::dictionary::prefixed_string_dictionary::test::iri stdout ----
thread 'physical::dictionary::prefixed_string_dictionary::test::iri' panicked at 'assertion failed: `(left == right)`
left: `0`,
right: `4`', src/physical/dictionary/prefixed_string_dictionary.rs:440:13
failures:
physical::dictionary::prefixed_string_dictionary::test::iri
Define two traits for defined behaviour of operations:
Currently, the parser requires one large String
of everything to parse. Switch to nom
's incremental parsing, which should mostly be changing a few imports and possible handling some Incomplete
errors, and have a wrapper that can chunk large inputs.
Depends on #58 being merged.
Calculate a depency graph from a given rule sets.
After importing a csv-file, it is sometimes needed to sort the data before it is given to column-builders.
To keep the relations of the data, which is already represented in column-like vectors, one needs to implement a Permutator to sort all of the vectors in the same order.
In contrast to the usual implementation from crates.io Permutator, multiple-slice-spanning ordering shall be supported too. (i.e. sorting by the value of additional columns if the values in one column are identical).
Trie
s (and anything that implements the Table
trait) should be constructable from a list of columns or a list of rows, which essentially represent a table.
For example, consider the following table:
1 3 8
1 2 7
2 3 9
1 2 8
2 6 9
This may be given as input row-wise: [[1, 3, 8], [1, 2, 7], ...]
or column-wise: [[1, 1, 2, 1, 2], [3, 2, 3, 2, 6], ...]
The resulting Trie
would look like this:
1 2 7
8
3 8
2 3 9
6 9
The current project and reorder is very inefficient and accounts for a majority of runtime in the benchmarks performed so far.
Basically, this operation has to be reimplemented from the ground up, as the current approach did not turn out to be very good.
The semi-naive evaluation can be improved in several dimensions, improving its speed and memory consumption.
There are a few possibilities for optimizing the computation. Some of them may interact with each other. This list might be incomplete.
Stage2 should be able to perform Datalog reasoning using the semi-naive evaluation technique.
The seek() implementation in GenericColumnScan should support binary search. Right now, it is only using iteration. Binary search should be used while the interval of values to scan is not small (say of size >10; this should be defined as a constant somewhere). Small intervals (say below 10 elements) should always be scanned.
@aannleax pointed out some straightforward improvements to improve the performance of the RleColumn
.
The main improvement possibilities lie in the get
method of the RleColumn
as well as in the seek
method of the RleColumnScan
.
Both methods are implemented rather naively at the moment.
Namely, the following improvements should be done:
Instead of storing a vector of structs in the RleColumn
, the RleColumn
can store 3 vectors which in comparison to the RleElement
hold:
RleElement
,RleElement
plus one in context of the RleColumn
(in other words, the accumulated length of all elements before and including the current one)RleElement
.For example, the list [1, 2, 3, 7, 9, 11]
can be endoded with three vectors as follows:
[1, 7]
,[3, 6]
,[1, 2]
. The get
method should perform a binary search on the end indices. (This is straightforward after the above adjustment.)
The seek
method should perform a binary search on the values. This requires the assumption that the data encoded in the RleColumn
is sorted, which is probably a valid assumption. Note that this adjustment could be done without the other two.
The assumption of sorted data for the last task may be subject to discussion and may deserve its own issue.
Some benchmarks for the operations refer files like test-files/bench-data/xe.csv
that are not in the repository.
@aannleax could you add those files?
The quickcheck test uncovered than in certain cases, the RLE column returns Inf
or -Inf
in the get function instead of a large floating point number.
One example for this is:
let col = RleColumn::new(vec![Float::new(-3.4028235e38).unwrap(), Float::new(0.0).unwrap(), Float::new(3.4028235e38).unwrap());
assert_eq!(col.get(2), Float::new(3.4028235e38).unwrap()); // fails since col.get(2) returns Infinity
I think the cause for this problem is that the column is stored as RleColumn { values: [Float(-3.4028235e38)], end_indices: [3], increments: [Increment(Float(3.4028235e38))] }
and get(2)
just multiplies the increment by 2
, which probably already ends up to be Inf
. Adding Inf
to the start value then results in Inf
as well.
In general, for signed datatypes, the issue is that increments like MAX - MIN
overflow the value space of the underlying datatype.
Implement Select
The adaptive column builder starts building an RleColumn and uses a heuristic to decide whether to keep it or transform it into a VectorColumn at some point.
Currently the adaptive column builder uses two constants for this heuristic; COLUMN_IMPL_DECISION_THRESHOLD
to determine after how many RleElements a decision is made and TARGET_MIN_LENGTH_FOR_RLE_ELEMENTS
to require a minimum average length of the RleElements to pick a RleColumn over a VectorColumn.
Instead of using constants, we should allow to pass these values when creating the AdaptiveColumnBuilder. We should provide Default
implementations for both arguments with the current constant values. Furthermore COLUMN_IMPL_DECISION_THRESHOLD
should support a special value for making the decision only when finalize
is called.
After calling narrow
on the RLE column, the seek operation should be restricted to the given interval. Right now this does not work as expected since the search might end up in a RLE block with a negative increment, which is not supported.
Define how to access and manipulate data, based on the physical table representation.
Before the recent restructures, the RLE Scan was not used in the reasoning process but now it is.
Apparently, the RLE Scan is much slower than the generic column scan that was used before.
We should at least get the performance to the original level.
Technical Debt from #62.
When reading a CSV file, we pass an optional datatype for each column (U64,Float,Double
). If none of these datatypes is specified, we treat the contents of the column as string
s and internally represent them by u64
(with a dictionary).
Is this fallback solution what we want or should we rather introduce String
as a datatype? It currently feels like a little bit of a dirty hack to me. Let's discuss this!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.