1. Introduction
Datalog [
1], the canonical language for reasoning over relational databases and ground fact stores, is a declarative language used to evaluate sets of possibly recursive restricted horn clauses and programs, while remaining not Turing complete. Evaluating a program entails computing implicit consequences over a fact store, yielding new facts.
Materialization, or the physical storage of a program’s consequences, eliminates the need for reasoning during query answering. Maintaining this computation is essential for modern datalog use cases, as it relates to the broader problem of incremental view maintenance.
While the semi-naive evaluation method [
1] efficiently handles additions, deletions are often less efficient, as retracting a fact may naively imply deleting all data derived from it. The delete-rederive [
2] method addresses this issue by computing the materialization adjustment through the generation of new datalog programs, first calculating all possible deletions and then determining alternative derivations. The difference between these sets represents the actual facts to be deleted.
Using two distinct algorithms for additions and deletions results in different performance characteristics, potentially causing severe biases. For example, when a large portion of ground facts are deleted, such as more than ten percent, a not very realistic value, delete-rederive could be significantly more expensive than recomputing from scratch; since computing all overdeletions, and alternative derivations, might take longer than re-materialization in itself, for there can be cases where a small portion of ground facts have a high impact on the number of inferred facts.
A promising way to tackle incremental maintenance in a more uniform manner is to use differential dataflow, a programming model that efficiently processes and maintains large-scale, possibly recursive dataflow computations. Central to it is the notion of fine-grained tracking, with partially ordered timestamps, and processing differences between collections of data rather than entire collections themselves. This approach facilitates efficient updates in response to changes in the underlying data [
3].
In the context of datalog, differential dataflow (DD) presents an opportunity to address the performance challenges arising from handling additions and deletions. Contrary to traditional methods, such as semi-naive evaluation for additions and delete-rederive for deletions, differential dataflow provides a unified and efficient approach to incremental view maintenance.
The utilization of partially ordered timestamps and arrangements allows DD to precisely identify affected parts of the computation and to recompute only the necessary components. This leads to more efficient handling of incremental updates in datalog evaluation, as the system can focus on affected sub-computations rather than re-evaluating the entire program. Furthermore, there is also first-class support for both automatic parallelism and distributed computing, contributing to enhanced performance and scalability.
Distinct algorithms for additions and deletions in data processing can introduce severe biases and challenges, particularly in scenarios involving dynamic datasets. Differential dataflow offers a solution by treating additions and deletions as integral parts of the data evolution process, ensuring temporal consistency, enabling incremental computation, and facilitating a more accurate and comprehensive analysis of changing data.
DDLog [
4] is a novel attempt at building a datalog engine that utilizes DD. Similarly to the high-profile reasoner Souffle [
5], it is a compiler in which a datalog program becomes an executable low-level language program, C++ in Souffle’s case, and Rust for DDLog. The rationale for the language choice is that DD’s canonical implementation lives as a heavily optimized map-reduce-like framework written in Rust.
Notably, given that DDLog is a compiler, it is not suited for situations where either the program is expected to be dynamic, with rules being added or removed, or where new programs ought to be evaluated during the run time, therefore restricting its use case to the specific scenarios where such drawbacks are acceptable.
There has been no study evaluating the isolated benefit of DD to datalog evaluation. Therefore, the suitability of DD in this context remains unclear, emphasizing the importance of further research on its potential benefits and limitations in incremental view maintenance.
Contributions. In this work, we directly address the posited research question by developing a datalog interpreter utilizing DD. We then compare our implementation with other prototypical datalog interpreters, created from scratch, which share as many components as it is reasonable, in order to isolate the effect of DD in both runtime performance and memory efficiency. This allows us to more accurately and empirically assess how DD performs against more traditional approaches.
Unlike DDLog, which compiles a datalog program into its evaluation as a fixed DD program, our approach involves writing a single DD program capable of evaluating any datalog program. This eliminates the need for compilation and provides the additional benefit of incremental maintenance for both rule removals and additions.
Structure of the paper.
Background. A brief recapitulation of the general background, with datalog, its evaluation methods, and the delete-rederive method being formally introduced.
DifferentialEvaluation. DD and the translation of datalog evaluation to a dataflow are showcased and explained.
System. The developed interpreters are described alongside all optimizations and benchmark-relevant information.
Evaluation. An empirical evaluation of all reasoners over multiple different programs and datasets is undertaken.
2. Related Works
DDApplications and Related Projects. There are two relevant DD projects that are worth mentioning. One of them is Graspan, a parallel graph processing system that uses DD for the efficient incremental computation of static program analyses over large codebases.
Graspan models the program analysis problem as a reachability problem on a graph, where nodes represent program elements and edges represent the relationships between these elements. It leverages DD to incrementally update the analysis results in response to changes in the input graph, which can be due to code modifications or updates to the analysis rules. Graspan has demonstrated its ability to scale to large codebases and provide low-latency updates for various static analyses, including points-to analysis, control-flow analysis, and dataflow analysis.
Another project of interest is DBSP [
6], a recent development that started from the need for a more concise theoretical definition of DD. All of DBSP operators are based on DDs; however, its computational model is less powerful as it does not allow updates to past values in a stream, and it is also assumed that inputs arrive in time order. DBSP can express both incremental and non-incremental computations, with the former not being possible in DD.
Datalogengines. There are two kinds of datalog engines. The first encompasses those that compile a datalog program, usually to a systems-level programming language, and the second are interpreters, able to evaluate any datalog program.
Soufflé is a prominent example of a datalog compiler that translates datalog programs into high-performance C++ code. It incorporates several optimization techniques, such as parallel execution with highly specialized data structures [
7], and nearly optimal join ordering [
8]. Notably, its development has been an unparalleled source of articles on the engineering of reasoners.
DDLog, as previously mentioned, compiles datalog to DD, achieving efficient differential data updates for datalog programs. It demonstrates the applicability of DD in the context of declarative logic programming and incremental view maintenance.
The majority of reasoners recently developed have been interpreters, further split into distributed or shared memory systems. Out of the shared memory ones, the most notable are RDFox [
9], a highly specialized and performant reasoner that is tailored to semantic web needs, RecStep [
10], which builds on top of a highly efficient relational engine, and DCDatalog [
11], which builds upon the query optimizer DeALS [
12] and extends a work that establishes how some linear datalog programs could be evaluated in a lock-free manner, to general positive programs.
One of the most high-profile datalog papers of interest has been BigDatalog [
13], which originally used the query optimizer DeALs and was built on top of the very popular Spark [
14] distribution framework. Soon after, a prototypical implementation [
15] over Flink [
16], a distribution framework that supports streaming, Cog, followed. Flink, unlike Spark, supports iteration, so implementing reasoning did not need to extend the core of the underlying framework. The most successful attempt at creating a distributed implementation has been Nexus [
17], which is also built on Flink and makes use of its most advanced feature, incremental stream processing.
3. Background
Datalog [
1] is a declarative programming language. A program
P is a set of rules
r, with each
r being a restriction of tuple-generating dependencies:
with
k and
j as finite integers,
x as terms, and each
and
H as predicates. A term can belong either to a set of variables or constants. The set of all
is called the
body, and
H the head.
A rule r is said to be datalog if no predicate is negated and all variables in the head appear somewhere in the body, thereby not there being the possibility for existential variables to exist. Conversely, a datalog program is one in which all the rules are datalog.
Example 1 shows a simple, valid recursive program. The first rule denotes that for all x and y, if x is in an Edge relation with y, then it follows that x is in a TC relation with y, and the second for all x, y, and z, if x is in a TC relation with y and y is in a TC relation with z, then it follows that x is in a TC relation with z.
The programs denote implications over a store of ground facts. This store is called the extensional database, , and the result of evaluating a program over some is the , the intensional database.
Let , and for there to be a program P. We define the immediate consequence of P over as all facts that are either in or stem from the result of applying the rules in P to . The immediateconsequence operator is the union of and its immediate consequence. The , at the moment of the application of , is the difference of the union of all previous with the , therefore, consisting only of the inferred facts.
It is trivial to see that
is monotone, and given that both the
and
P are finite sets and that
at the start, at some point
since there will not be new facts to be inferred. This point is the
leastfixed point of
[
1]. Computing the
least fixed point as described, recursively applying the immediate consequence operator, is called naive evaluation, which is not often used in practice since in every iteration, not only does it infer new facts but also recomputes all previously inferred ones.
3.1. Semi-Naive Evaluation
The semi-naive evaluation algorithm [
1] is a widely used technique for improving naive evaluation, which directly addresses, but does not solve entirely, its major inefficiency, redundant recomputations of previously inferred facts. Given a datalog program
P and an
, the algorithm iteratively computes the
in the same manner as naive evaluation, with the addition of maintaining a set of new delta facts
that are generated in each iteration.
Given a program P with rules , with bodies , and heads , the delta program will generate one new rule for each relation in each rule body , in order to represent that only facts that have been recently inferred are to be taken into account for subsequent iterations.
Example 2. Semi-naive Evaluation Delta Program
With Example 1 as the baseline, 2 is its resulting delta program. While semi-naive evaluation indeed reduces the number of inferred redundant facts, it is particularly efficient for a certain class of simple datalog programs that are common in practice, namely linear programs, which are those in which each rule has at most one IDB relation in its body; therefore, generating only one delta rule per rule, instead of multiple, as in the example.
In spite of being asymptotically better than naive evaluation, there are substantial implementation challenges that need to be addressed in order to ensure that the overhead is not larger than the possible performance gains since it requires multiple indexes, each delta relation, and efficient set operations to keep track of the most recently inferred facts. This is of utmost importance when using semi-naive evaluation as a method to incrementally handle additions to the .
It often occurs that a materialization needs to be adjusted, either to additions or retractions of ground facts. Both semi-naive and naive evaluations are iterative. Thus additions can be dealt with by simply having their computations restarted, with the former having the entire as the initial set of delta facts instead of the empty set. The major goal of continuing the computation is such that it will be more efficient than restarting the materialization altogether.
3.2. Delete-Rederive
While both aforementioned evaluation methods provide mechanisms to incrementally adjust materialization to new ground facts, neither supports the retraction of ground facts, a problem that is significantly more involved, since a single fact might have multiple possible derivations.
The most used method is a bottom-up algorithm [
2] that relies on evaluating two new programs: one that computes all possible deletions that could stem from the deletion of the facts being retracted and then another that attempts to find alternative derivations to the overdeleted ones.
Given a program P with rules , with bodies , and heads , the overdeletion program will generate one new −rule for each in each rule body , in order to represent that if such a fact were to be deleted, then would not hold true.
Example 3. DRED Overdeletion Program
In Example 3, negative predicates represent overdeletion targets for Example 1. For instance, if Edge(2, 3) is being deleted, then TC(2, 3) will be deleted, and any other inferred fact that depends on it. Given that it is a regular datalog program, it can be efficiently evaluated with semi-naive evaluation or any other evaluation algorithm.
The next step is to compute the alternative derivations of the deleted facts since some overdeleted facts might still hold true. The alternative derivation program will generate one new +rule for each in P, with one extra − head predicate per body, representing an overdeleted fact. The + program requires the overdeleted facts to already not be present.
Example 4. DRED Alternative Derivation Program
The output relations from Example 4 represent the data that has to be put back into the materialization for Example 1. The rationale for alternative derivations is that, for , for instance, if the edge TC(3, 4) was overdeleted, because of there being Edge(1, 2) and TC(2, 3), if Edge(3, 4) was not deleted, by rule , then there is an alternative derivation for TC(3, 4).
As it can be seen, computing the maintenance of the materialization implies evaluating a program bigger than the materialization itself. However, due to the fact that it is evaluated with semi-naive evaluation, the asymptotic complexity remains the same. Nonetheless, in practice, deletion is often much slower than addition, as it can be trivially seen by the worst-possible scenario, in which all facts are deleted, whereby while materialization would be free, DRED would inquire an expensive fact-by-fact deletion operation.
3.3. Substitution-Based Evaluation
The most impactful aspect of all of the introduced evaluation mechanisms is the implementation of itself. The two most high-profile methods to do so are either purely evaluating the rules or rewriting them in some other imperative formalism, such as relational algebra, and executing it.
The substitution-based [
1] method is the simplest example of the former. A substitution
is a homomorphism
, such that
is a variable, and
is a constant. Given a not-ground fact, such as
,
applying the substitution
to it will yield the ground fact
.
Let r be a datalog rule of the form , where h is the head atom and are the body atoms. Let be the set of ground facts for the input relations.
The substitution-based method computes the immediate consequence of the rule r as follows:
Define the initial set of substitutions as , where is an empty substitution. For each body atom , find the set of ground facts that match .
Algorithm 1 is the formal specification of the substitution-based method. There is a noteworthy performance issue that arises due to the interaction between it and DRED. During the alternate derivation phase, the new program has one more body atom. This can be prohibitively more expensive to evaluate than the original program since one extra body atom implies one extra iteration, which could generate a polynomial number of substitutions due to the cartesian product nature of each step.
Algorithm 1: Substitution-based Immediate Consequence |
|
3.4. Relational Algebra Rewriting Method
The de-facto datalog evaluation method, which virtually all recent reasoners [
5,
10,
11,
13,
15,
17] abide by, is to rewrite datalog rules into relational algebra, a well-known technique, to efficiently compute their evaluation due to the extensive industrial and academic research poured into developing data processing frameworks that handle very large amounts of data, and the techniques that have arisen from those.
Relational Algebra [
18] explicitly denotes operations over sets of tuples with fixed arity, relations. It is the most popular database formalism that there is, with virtually every single major database system adhering to the relational modeland using SQL as a declarative syntax [
19,
20].
DD either implements or makes it trivial to do so, all relevant-to-datalog relational algebra operators, therefore providing convenient tools to manually specify the evaluation of a datalog program as a dataflow. It nonetheless does not directly make writing the interpreter more convenient, only a compiler.
4. Differential Evaluation
Differentialdataflow is a computational framework that generalizes incremental processing to times that are possibly partially ordered and specifically operates over generalized multisets.
Let
C be a multiset, referred to as a collection, with
being its value at a partially ordered time
t, and
being the monoid representing the multiplicity of some record
. We establish that the difference of some collection
C at time
t, named
, is defined as:
It also, therefore, holds that the value of
can be reconstructed by the following equivalence:
We utilize plain multiset semantics with signed integers as multiplicity.
Let
A and
B be collections, and
be some operator that maps a collection to some other collection or itself. Assuming
B to be the output of
applied over
A, computations in DD follow the following:
with
being proportional to
and not
. Stateful operators, such as the relational join, require more involved differentiation steps.
A core premise of the canonical DD implementation is in cleverly and efficiently maintaining and , specifically in the context of iterative dataflows, due to t being partially ordered.
Let’s assume that a datalog program is being evaluated, and five fact updates, labeled as arrive. In regular semi-naive evaluation, even though rule application might happen in parallel, will only be evaluated after ’s evaluation has finished, and the data used to compute each will always consist of all extensional and intensional (previously inferred) facts.
In contrast, program evaluation could be written as a DD dataflow with a (partially ordered) product order timestamp
with
t being the time of arrival of the update, and
a keeping track of iteration. Product order is defined as:
If we treat
,
,
,
, and
as differences with the following respective timestamps:
it is noticeable, from
Table 1 that neither
is visible from
nor that
is visible from
. This, in turn, has an important consequence on differential dataflow, where the computation of both
and
happen independently of each other, meaning both may be computed in parallel:
Within the context of datalog, the aforementioned evaluation semantics provide a full alternative to the way incremental datalog evaluation is currently performed; most specifically, the practical advantage of differential dataflow is that instead of using semi-naive evaluation and DRED, one can just describe the evaluation process as a dataflow, and have both additions and retractions handled in the same way, with efficient parallelism and symmetric handling of updates.
Differential Substitution-Based Method
We now present a translation of Algorithm 1 to DD by emulating sequentially iterations over each rule’s body with relational joins; notably, all relational algebra operators are available through a map- reduce-like API.
Figure 1 depicts the substitution-based method as a dataflow. Superscripts denote points of the dataflow that require further explanation. Furthermore, for clarity, we establish the shape of the data and the meaning of the
Var suffix, which both facts and substitutions eventually take up. A variable is used to express recursive or iterative computations. It allows one to define iterative operations and data dependencies in the dataflow graph, enabling the system to track and propagate changes across iterations efficiently with product timestamps. Each node either represents an operation, such as
join_map, which joins indexed collections and then applies a mapping function to the join output, or
flat_map, which gives a function that outputs an iterable, applies it over a collection, and flattens each element’s output to be part of a single collection.
We also note that this is a summarized description, where certain trivial or too-implementation-specific parts have been omitted. is the stream of empty substitutions indexed per rule identifier, which is pre-populated with one empty substitution per rule. We assume that rules have a unique identifier. Facts is the relation-indexed stream of facts and rules is the stream of rules, with two indexes, created with the operations with superscripts 1 and 2.
- 1.
The first rule index indexes rules first by their identifier and then by each of its body atoms, enumerating them sequentially, imposing an order of evaluation as the original algorithm.
- 2.
The second rule index indexes by identifier and body size, being necessary to ensure that only the substitutions that have been exhaustively expanded ought to be considered for application to the rule head.
- 3
In the first join, the function that is applied is one that applies substitutions to the input atoms, therefore, either creating new atoms with fewer variables as terms or the very same ones. This is equivalent to the necessary setup for step 1 of Algorithm 1 to occur, making use of index 1.
- 4.
The next join creates new substitutions based on the newly minted atoms. All current substitutions attempt to expand further, with the successful ones being emitted from the join.
- 5.
This is the last step of the algorithm, where all final substitutions are applied to the head of each rule, index 2, to then create new ground facts.
With the dataflow being specified, over the next section, we elaborate on the commonalities and differences with the other implementations.
5. System
In this section, we provide a technical overview of the implemented reasoners and what is shared between them, alongside a novel indexing technique for the substitution-based method, which, at the cost of increased memory usage, can significantly decrease the number of times the operation that occurs the most frequently, substitution extension, occurs.
The reasoner that uses the substitution-based method without DD is named Chibi; differential is the one that does. Both of these reasoners share the implementation of the three core elements: unification, substitution application, and in asserting that a fact is ground. All of the aforementioned operations are trivial, and each does not require more than ten or so lines of code. Unification is a computationally cheap operation; given an atom and a ground fact, the output is a new substitution that maps the variables of the right to the constants of the left one. All others are self-descriptive, with substitution application merely substituting an atom’s variables for the mapped variables in a substitution. Checking if a fact is ground is performed by ensuring that no terms are variables.
Chibi, differential, and relational all share the same memory layout for the core elements of datalog and storage. In Rust terms it is to be assumed that all referred data structures are standard library implementations unless stated otherwise. Furthermore, a step of rule application is always performed in parallel.
Constant: an enumeration of boolean, 64-bit integer, or string named typed values
Variable: an 8-bit integer, hence imposing a bound on the number of variables that a rule can have
Term: an enumeration of constant and variable
Atom: a struct with a vector of terms and a symbol that can be either a 64-bit integer or a string
Rule: a struct with an atom representing the head and a vector of atoms as the body
Storage: a Hash map of hash sets, with keys representing relation names, or id, and their respective hash sets containing vectors of typed terms, ground facts
The relational reasoner has one extra data structure, a btree index, which is used for sort-merge joins. Relational relies on naively translating datalog rules into relational algebra without any further optimizations whatsoever, aside from inserting all data that are to be joined in its index right before actually doing it. All relational operations and their evaluators were implemented from scratch. The point of this reasoner is to evaluate how performant the popular relational algebra evaluation can be in isolation, compared to the often forgotten substitution-based method.
Rule application until the least fixpoint is reached is performed with semi-naive evaluation [
21], with a program transformation. DRED is implemented as described in [
2], in two steps, with both the overdeletion and alternative derivation program being executed with semi-naive evaluation too. Both Chibi and relational use the same function for this, with differential evidently not using semi-naive evaluation nor DRED; given that it has its own iteration mechanism, heavily inspired by semi-naive evaluation, which already handles retractions.
Demand-Driven Multiple-Column-Based Indexing
There is possibly a very large performance cost of the substitution method, which can be exemplified in the specific scenario of DRED, which could render it unable to be used in practice. As it was introduced, substitutions are both incrementally expanded and built anew by iterating over every single body atom.
In the second step of DRED, an alternate derivation program is created. This program has one extra body atom, representing overdeletions of the head’s relation. This implies that this step could be prohibitively more expensive to evaluate than even evaluating the program due to the cartesian nature of the unification step, which implies iterating over the knowledge base once for every atom. This inefficiency can be demonstrated with the following example, in which the rule could be seen as the alternate derivation step of some rule: , with representing the overdeletion estimation from the previous step.
Let , and Algorithm 1 will have three iterations:
- 1.
- (a)
Current body atom: , :
- (b)
Fresh atoms-Applying all to yields
- (c)
Substitution extension:
- i.
unification: −R(?x, ?z) ∪ −R(a, c) =
- ii.
unification: −R(?x, ?z) ∪ −R(b, d) =
- 2.
- (a)
Current body atom: , :
- (b)
Fresh atoms-Applying all to yields ,
- (c)
Substitution extension:
- i.
unification: T(a, ?y) ∪ T(a, b) =
- ii.
unification: T(a, ?y) ∪ T(b, c) = none
- iii.
unification: T(a, ?y) ∪ T(c, d) = none
- iv
unification: T(b, ?y) ∪ T(a, b) = none
- v.
unification: T(b, ?y) ∪ T(b, c) =
- vi.
unification: T(b, ?y) ∪ T(c, d) = none
- 3.
- (a)
Current body atom: , :
- (b)
Fresh atoms-Applying all to yields ,
- (c)
Substitution extension:
- i.
unification: T(b, c) ∪ T(a, b) = none
- ii.
unification: T(b, c) ∪ T(b, c) =
- iii.
unification: T(c, d) ∪ T(c, d) = none
- iv.
unification: T(c, d) ∪ T(a, b) = none
- v.
unification: T(c, d) ∪ T(b, c) = none
- vi.
unification: T(c, d) ∪ T(c, d) =
With the final substitutions being: , therefore, inferring two atoms: and . The major source of inefficiency are calls to unification attempt, which yield no new substitution. The number of unification attempts could grow quadratically with each next body atom. The solution to this issue is straightforward; to avoid the cartesian product. We devise a novel indexing technique specifically tailored to be portable to DD.
Returning to the example, it is trivial to see that wasteful unification attempts can be prevented by joining on bindings. If is the left-hand side of unification, and , are the candidates, no candidate that does not already match all constants in would produce a substitution extension.
We name our approach Demand-driven Multiple-column-based Indexing because indexes are built on-demand to address the need of indices for joining substitutions, which can be over multiple constants, therefore, spanning over multiple columns in each iteration. For each rule we determine the column combinations that will be used in such a join, and maintain one globally shared index for each unique column combination. First, we demonstrate the technique over the same example and then provide a new version of Algorithm 1.
- 1.
- (a)
Current body atom: , :
- (b)
Fresh atoms-Applying all to yields
- (c)
Index 1-Index all fresh atoms with the positions of their constant terms as keys:
- (d)
Index 2-Index based on all distinct values of the column keys of index 1:
- (e)
Index 4-Join Index 1 with Index 2:
- i.
- ii.
- (f)
Attempt to unify:
- i.
unification: −R(?x, ?z) U −R(a, c) =
- ii.
unification: −R(?x, ?z) U −R(b, d) =
- 2.
- (a)
Current body atom: , :
- (b)
Fresh atoms-Applying all to yields ,
- (c)
Index 1-Index all fresh atoms with the positions of their constant terms as keys:
- (d)
Index 2-Index T based on all distinct values of the column keys of index 1
- (e)
Index 4-Join Index 1 with Index 2:
- i.
- ii.
- (f)
Attempt to unify:
- i.
unification: T(a, ?y) U T(a, b) =
- ii.
unification: T(b, ?y) U T(b, c) =
- 3.
- (a)
Current body atom: , :
- (b)
Fresh atoms-Applying all to yields ,
- (c)
Index 1-Index all fresh atoms with the positions of their constant terms as keys:
- (d)
Index 2-Index T based on all distinct values of the column keys of index 1:
- (e)
Index 4-Join Index 1 with Index 2:
- i.
- ii.
- (f)
Attempt to unify:
- i.
unification: T(b, c) U T(b, c) =
- ii.
unification: T(c, d) U T(c, d) =
From this new example, it can be seen that the indexing scheme is relatively simple, relying on creating new indices that would allow unification to never wastefully occur. We now structure it as Algorithm 2.
Let be a function mapping an atom to an array of integers representing the positions of constants within the atom’s terms, and another function, which maps an array of integers and an atom, to a subset of the atom’s terms c denoted by C.
The algorithm relies on two main indexes:
- 1.
, where is the subset of F such that all a have the same value.
- 2.
, where is a nested index, and is the subset of F such that all a have the same .
Algorithm 2: Substitution-based Immediate Consequence with Demand-driven Multiple-column-based Indexing. |
|
All indexing steps are in time and data complexity, save for index two, which has worst-case data complexity of , with representing the powerset of the number of terms in atom a, and F such that it has only atoms a. The product with the powerset arises due to how indexing occurs by mapping all unique combinations of constant terms of fresh atoms, which, in the worst case, could be exponential to the arity.
Figure 2 displays the DD version of Algorithm 2, which mostly remains exactly the same, save for new operations happening during the phase before iteration. We now clarify the points of interest in the new dataflow. There were no differences in the steps inside the iteration, aside from joins happening through the vector of constant positions and relation symbols instead of only relation symbols.
- 1.
The first
map operator remains the same, indexing rules by their identifier and body size, used to ensure that only fully expanded substitutions will be applied to rule heads. The same as superscript 2 in
Figure 1.
- 2.
The unique column combinations of the input ruleset are computed by this operator.
- 3.
This step joins the rule identifiers with the unique column combinations. This is only used at the very last join during iteration, to ensure that the output fact is indexed by the correct column combination.
- 4.
Equivalent to superscript 1 in
Figure 1.
- 5.
With superscript 2, the input fact stream can be immediately indexed by the necessary constant position combinations. This is performed by a join on relation symbol, which will index each fact by all column combinations.
- 6.
Facts. var, unlike in Algorithm 1’s dataflow, which was only indexed by relation, is now indexed by each unique column combination.
This dataflow is possibly much more efficient. An arrangement in DD is a pre-computed, indexed representation of a collection that allows for efficient querying and manipulation of the data. These arrangements play a crucial role in the performance of joins. By carefully choosing which arrangements to create and maintain, it is possible to keep joins efficient without unnecessarily wasting memory.
Most specifically, arrangements dictate the level of join efficiency. The fact that the join operator indexes the data by a more fine-grained key than the relation symbol, such as the relation symbol and positions occupied by constant values, allows it to be much more restrictive than the cartesian product.
6. Evaluation
Three thorough experiments were conducted in order to showcase the relative performance, scalability, and memory usage of all reasoners, with the intent being twofold: to evaluate the performance characteristics of DD in isolation of virtually all other elements, and to establish whether general algorithmic improvements, such as the demand-driven indexing scheme, are portable to DD.
Setup. The experiments were run on a Google Cloud-provisioned x86 machine of type e2-standard-16, with 16 intel skylake cores and 64 gigabytes of RAM. Each benchmark measurement was taken 70 times, with the 20 measurements of most variance removed and averaged out. All datasets, datalog programs, and reasoner implementations are available online [
22].
Datasets. In
Table 2, all datasets and program names, or acronyms, are shown. There are two areas of interest. The semantic web has very specific use cases for datalog and are the leading source of research in extending the datalog mathematical formalism, and in providing improvements to decades-old algorithms, such as DRED, with the backward-forward algorithm [
23]. Seeking ways to introduce tuple-generating dependencies to programs, with evaluation remaining tractable, has been one of the most active research directions, with highly influential papers establishing new families of datalog languages [
24] and thoroughly exploring their complexity classes alongside even further extensions [
25,
26,
27]. These advancements have been somewhat tested in practice, albeit with no full reference implementation having been specified. The most comprehensive and recent is closed-source [
28]. The leading datalog engine, in general, is also closed-source [
9] and is tailored specifically to the semantic web.
The second area of interest is purely mathematical synthetic graph benchmarks, which allow for generating infinitely scalable specific graph structures. All datasets, however, including LUBM [
29], are synthetic, with the difference being that there are multiple specific programs for RhoDFS.
LUBM is a classic inference benchmark dataset for both RhoDFS and OWL2RL rulesets. The data are divided into two parts, the TBox, terminological box, which holds an ontology able to describe universities, and the ABox, the assertional box, which asserts facts about universities using the terminology in the TBox. The RhoDFS ruleset, depicted on
Program A1, is relatively simple but complex in that there is only a single relation that is mutually recursive in every single rule. RhoDFS-s
Program A2 is an improved version of RhoDFS, which creates new relations for every single constant combination in the original program, avoiding every body atom implying a full dataset, and mimicking the relational selection. The last ruleset, OWL2RL, has over 100 rules and is by far the most complex, representing the lower bound of OWL2RL implications specific to the LUBM Tbox. More information on converting description logic entailments to datalog can be found in [
30].
RMAT1k is a graph generated by the
rmat profile of the GT [
31] graph generator, used to benchmark various other reasoners [
10,
13]. The dataset is a graph with ten times the number of edges as vertices, which follows an inverse power-law distribution.
RAND1k is also a graph generated with therand profile of GT. The dataset is comprised of a graph that has one thousand edges, with each having a 0.01 probability of being connected to every other. In spite of having a small number of nodes, it is incredibly dense, with the output of the transitive closure program having almost a hundred times more edges than the initial graph.
6.1. Runtime Comparison
Table 3 pictures the main benchmark, in which three measurements, Mat, +, and −, for every batch size, are recorded. All measurements are in
seconds. If the batch size is 75%, then Mat is the amount of time taken to materialize 75% of the data, using regular semi-naive evaluation, + is how much incremental materialization of 25% of the data, the remaining amount, also using semi-naive evaluation, took, and lastly, − is how much time DRED has taken to delete the 25% that has been added. This provides a comprehensive and thorough overview of the performance of DRED and semi-naive evaluation, compared to differential dataflow, which offers an alternative to both.
Notably, the selection of facts in + and − can dramatically influence the performance of both DRED and DD. However, conducting extensive performance estimations by running the algorithms on numerous random subsets of the data is impractical due to the extensive duration required to run the entire benchmark, coupled with the factorial number of possible permutations. Thus, we chose to select random subsets of the data that contained 50%, 25%, 10%, 1%, and 0.1% of its original size as update sizes.
We discuss the table over each dataset and its respective programs. First, for LUBM under the rdfs program, all differential reasoners exhibit a clear trend of decreasing update computation times as the batch size increases, with performing much better in general, up until updates get very small, possibly indicating that, at this level, indexing starts to have too big of an overhead. In the case of all other reasoners, the trend is very different, with all update times, curiously save for chibi, which is orders of magnitude slower than all other reasoners, not decreasing. This is unsurprising due to the very strong degree of recursiveness of the program; therefore, showcasing that neither DRED nor semi-naive evaluation provide significant speedups over rematerialization, with the best result being for , in which updates and deletions, in spite of being constant, are up to 40% faster.
All reasoners perform significantly better on rdfs-s, indicating the importance of the program. Chibi’s pathological performance issue is entirely gone with the new program, and its performance discrepancy with is almost eliminated, save for deletions, which remain several times slower than rematerialization.
In both the RAND-1k and RMAT-1k datasets, all differential reasoners consume at least twice as much memory as all other reasoners while performing similarly for initial materialization runtime. This posits an interesting counterpoint to the dominance in both memory usage and runtime shown with more complex programs. The reason for this discrepancy is that the TC program has a very large number of iterations, therefore causing a significantly greater flux in the dataflow, and since each iteration implies a new difference being stored, memory usage can grow at a fast pace.
In the most complex program, owl2rl, both chibi and diff are not able to finish materialization, with the former having had taken more than 1000 s, and the latter exceeding 64 gigabytes of RAM. Differential performs in the same manner as the previous programs, with decreasing update times and symmetry between additions and deletions. Both and rel exhibit decreasing deletion reasoning times in aggressive cliffs, with a small decrease in additions.
The transitive closure program is simple and linear, therefore being embarrassingly simple to incrementalize. For the RAND-1k dataset, differential reasoners once again perform in the same manner, with incremental behavior scaling linearly with the size of the data. The same behavior is shown for all other reasoners, with a caveat, where DRED only starts to be competitive once the update size is less than 10% of the original data. For RMAT-1k, reasoning times are much longer, showcasing a significantly more complex dataset, with all non-differential reasoners struggling to provide proportional update times save for update sizes of less than 1%.
In sum, diff and performed predictably irrespective of the dataset and program being run, always being faster and having proportionally decreasing reasoning times for updates while at the same time being symmetric. All other reasoners did not show the expected incremental behavior, neither for semi-naive evaluation nor DRED, unless the update size was small, which is not necessarily a hindrance in practice since, rarely if ever, a system will receive an update that is bigger than 10% of the original size of the data.
6.2. Peak Memory Usage Comparison
The results of the previous subsection cannot be seen in an entirely positive light without there being consideration for memory usage. DD relies on multiple in-memory indexes to keep track of all changes, and as it was seen, it entirely failed a benchmark due to running out of memory; thus, in this section, we analyze the results of measuring peak memory usage over the previous experiments.
Table 4 presents the peak memory usage for each of the methods and programs across different datasets. Memory usage is presented in megabytes. LUBM1 occupies 20 megabytes of disk space, RAND-1k and RMAT-1k occupy 100 kilobytes.
For LUBM1, with the ’rdfs’ and ’rdfs-s’ programs, all reasoners performed comparably with each other with respect to memory usage; however, as seen on the previous table, there are major differences in runtime performance between them; with the most extreme example being for chibi and , in which the former is over 1000× times slower, while using almost 50% more memory. Interestingly, diff performed significantly better for the owl2rl program, consuming 100 times less memory than chibi and rel. It is likely that this is due to the aforementioned aggressive compaction mechanism by the in-memory LSM trees. Notably, the indexed version of diff, , ran out of memory (OOM) for this program, indicating possible limitations of the indexing method for handling complex queries in large datasets, which conversely is not true in the case of ; therefore, being an issue with the DD implementation in itself.
While there are major differences in runtime among all reasoners, with some being orders of magnitude faster, the same cannot be said about memory usage; save for a very large program, there are no clear winners, implying that the memory requirements for DD in itself are not greater than regular reasoners, save for highly iterative dataflows, and remains proportional to the computation. The starkest example of this is for the owl2rl program, which, in spite of containing over a hundred rules, does not output much more data than rdfs/rdfs-s.
7. Conclusions
In this article, we introduced a novel datalog reasoner with two different algorithms, whose core value proposition is in using the promising but relatively obscure DD model of computation, and evaluated it against two other reference implementations that shared as many components as reasonable. The obscurity of the differential dataflow model of computation could stem from its specialization, research-oriented nature, learning curve, limited community support, niche use cases, documentation gaps, and its relatively recent emergence in the field of data processing. We also described an indexing method that significantly sped up an often-overlooked method of implementing reasoning, the substitution method, which was shown to have solved many pathological performance issues in benchmarks, at very little cost of extra memory. In all experiments, all DD-based reasoners implemented bested their nondifferential counterparts, showing unparalleled scalability over increasing update sizes, alongside virtually no performance differences between additions and retraction, while remaining competitive in memory usage. There are multiple ways in which the work could be expanded in the future, such as porting it over to support negation and more expressive variants of datalog and, most importantly, making it distributed, which DD provides out of the box. In summary, we present a new datalog reasoner based on the DD model, demonstrating its superior performance compared to other implementations, and proposed future directions for further development. The research focuses on performance optimization and scalability, with a keen eye on memory usage efficiency. The proposed datalog interpreter may face various limitations related to handling deletions, performance under different scenarios, scalability, complexity, and other factors. These limitations need to be carefully considered when evaluating the suitability of the interpreter for specific applications and use cases. We are actively engaged in the follow-up of this work and currently compare against no other tools. Future work includes a superior evaluation of the high-level point that monotonic aggregation is more powerful than ascenting merely the powerset lattice. The study has achieved its aim and objectives.