Modeling the Data Provenance of Relational Databases Supporting Full-Featured SQL and Procedural Languages

Tang, Deyou; Zhao, Rong; Lin, Yuebang; Zhang, Tangqing; Zhang, Pingjian

doi:10.3390/app13010064

Open AccessArticle

Modeling the Data Provenance of Relational Databases Supporting Full-Featured SQL and Procedural Languages

by

Deyou Tang

¹

,

Rong Zhao

¹,

Yuebang Lin

¹,

Tangqing Zhang

² and

Pingjian Zhang

^1,*

¹

School of Software Engineering, South China University of Technology, Guangzhou 510006, China

²

China Electronic Product Reliability and Environmental Testing Research Institute, Guangzhou 510610, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(1), 64; https://doi.org/10.3390/app13010064

Submission received: 18 October 2022 / Revised: 11 December 2022 / Accepted: 16 December 2022 / Published: 21 December 2022

Download

Browse Figures

Versions Notes

Abstract

:

Data provenance is information about where data come from (provenance data) and how they transform (provenance transformation). Data provenance is widely used to evaluate data quality, trace errors, audit data, and understand references among data. Current studies on data provenance in relational database management systems (RDBMS) still have limitations in supporting full-featured SQL or procedural languages. With these challenges in mind, we present a formal definition of provenance data and provenance transformation for relational data. Then, we propose a solution to support data provenance in relational databases, including provenance graphs and provenance routes. Our method not only solves the complicated problem of modeling provenance in DBMS but also is capable of extending procedural languages in SQL. We also present ProvPg, a PostgreSQL-based prototype database system supporting data provenance in multiple granularities. ProvPg implements extraction, calculation, query, and visualization of provenance. We perform TPC-H tests for ProvPg and PostgreSQL, respectively. Experimental results show that ProvPg addresses the vision of supporting data provenance with little extra computation overhead for the execution engine, which indicates that our model is applicable to lineage tracing applications.

Keywords:

data provenance; relational database; SQL; stored procedure; transformation

1. Introduction

Most data in applications are generated by a series of computations and transformations on the native data, making it difficult to trace the source. For example, SwissProt and Wikipedia are curated databases, and their contents are derived or edited from other sources, including curated databases. However, applications, such as scientific computing [1], data warehouse [2], grid-computing [3], curated databases [4], data flow management [5], and copyright management [6], are very concerned with data provenance, which refers to the origin, lineage, or source of data, which is essential for assessing data quality, as well as understanding the lineage and dependencies between data.

In relational databases, data are represented as relational tables, fields, tuples, or attribute values in terms of granularity. The creation and transformation of data depend on operations in the database management process, including the execution of SQL or stored procedural languages and the importing of external data. Adding the following functions to database management system (DBMS) helps to improve the usability of supporting data provenance in DBMS: (i) extracting provenance information (including provenance data and provenance transformation) in the data management process; (ii) supporting data provenance information at different levels of granularity; and (iii) supporting the full-featured SQL, stored procedure, and even other script languages for data processing in the real-life application scenarios. All of these functions play a crucial role in data interpretation, data debugging, and troubleshooting, which will efficiently enhance the data analysis capabilities of the database system.

However, existing research in relational data provenance is insufficient in supporting provenance in DBMS. The main challenges are: (i) Few studies support the relational data provenance of multiple granularities simultaneously. The current provenance system captures provenance information in coarse granularity using schema-based or fine granularity, such as tuple-based and value-based, and the coexistence of multi-granularity provenance in the same system lacks efficient models and methods. For example, many workflow systems supporting data provenance attach provenance information to coarse-grained files [7,8,9,10], while other customized systems attach provenance information to fine-grained attribute values [11,12,13]. (ii) Tuple-level and value-level data provenance attract more research than data transformation [14,15]. Despite their significant application value, systems based on these studies have limitations in many aspects, for example, the inconvenience of comprehensively understanding data provenance relation (especially the data transformation process). (iii) Few database systems provide support for data provenance except DBNotes [14], pSQL [16], ORCHESTRA [17], and ProvSQL [18]. Furthermore, a large percentage of studies are based on lineage [19,20], why-provenance, where-provenance [21], how-provenance [22], and other theories. For the convenience of theoretical research, they are confined to a simple SQL, whereas there are many complex SQL in practical application scenarios. Glavic [23,24] proposed a provenance representation model of perm-provenance and perm-transformation on the basis of lineage, which can support complex SQL queries. However, the model significantly relies on the SQL syntax structure, making the transformation too redundant and complicated to understand the dependencies between data and extend to languages other than SQL.

To address the above challenges, we present a provenance model that fully supports full-featured SQL and procedural query language. Our model describes data provenance information at multiple levels of relation tables, fields, tuples, and values, especially the provenance transformation process.

The main contributions of this paper are summarized as follows:

A novel method and model of data provenance are presented. The model describes data provenance information at multiple levels, including relational table schema, attribute fields, tuples, and values. Our model investigates provenance transformation, which has been neglected in previous research.
Data provenance supporting full-featured SQL and procedural languages is implemented. We combine basic operations, such as the projection and filter to represent provenance transformation, making it possible to support full-featured SQL or even procedural languages.

The rest of this paper is organized as follows. The next section introduces related research on data provenance techniques, Section 3 discusses the semantic representation of the relational data provenance, Section 4 describes the relational data provenance model, and Section 5 concludes the paper.

2. Related Research

Wang et al. [25] first proposed the concept of data provenance and applied it to heterogeneous databases. Earlier research mainly focused on fine-grained (tuple, value). Annotation propagation [26,27] or inverse operation were the typical methods used to track data provenance. Woodruff et al. [28] obtained provenance data by inverting the original query to solve the problems of data volume expansion and computational complexity. Later, Cui et al. [19,20] proposed the lineage approach, which utilized reverse queries to compute tuple-level provenance data and gave the first standardized definition of tuple-level data provenance in the relational database. Based on lineage, Buneman et al. [21] presented the tuple-, field-, and value-level provenance representations, called why-provenance and where-provenance, respectively, to track data provenance from different perspectives. Green et al. [22] introduced a “provenance semiring” structure of provenance representation, called how-provenance. How-provenance canonically explains the generation of tuples and plays a vital role in uncertain databases and computing data credibility. Glavic [23,24] proposed perm-provenance on the basis of lineage, which could adapt to complex SQL queries. In the meantime, perm-transformation was proposed with a relational expression tree structure to describe and compare provenance transformation. The utilization of the rewrite mechanism in provenance queries makes it possible that the query engine simultaneously computes provenance data and query results, along with the optimal utilization of the computation engine and the optimization for calculation efficiency. Lineage, why-, where-, and how-provenance, and extensions and optimizations of previous research [17,29] form the theoretical foundation for subsequent provenance models and applications.

The typical applications of data provenance are: (1) Orchestra [30] is a collaborative data-sharing system based on how-provenance in a P2P environment, which takes advantage of the characteristics of provenance transformation to synchronize data on different ends. (2) DBNotes [14], a system based on where-provenance, tracked provenance information by attaching zero or more notes to every value in a relation and propagating the attached notes to the result of queries. (3) Based on how-provenance, the Trio system [15] quantitatively calculates the credibility of data in the uncertainty database by combining and distilling data provenance and uncertainty theory. (4) SPIDER [31] proposes a method of “Route” as a description of the relationship between source and target data with schema mapping, tracking the transformation process of the target data, which is very useful for understanding and exploring schema mappings.

In recent years, most research built on previous studies and applied them to business contexts. Specifically, research directions include data provenance of distributed systems [32,33,34,35,36], compaction of data provenance information with full-featured query languages [37,38,39,40,41,42], etc. Studies on full-featured query language are intimately related to our study, and we highlight them below: ProvenanceCurious [43] is a tool for inferring data provenance in scripting languages. It demonstrates a provenance graph to represent a view of data dependencies, which helps to debug. This tool is implemented only with Python as the scripting language, without more details or a formal definition of the transformation process. Inspired by the ideas of program slicing and abstract interpretation, Miiller et al. [44] proposed an approach to translate the query language into an abstract Turing complete language (similar to Python), which can compute and track fine-grained (tuple, value) provenance data based on why- and where-provenance. This method embraces any SQL constructs, such as (correlated) sub-queries, recursive common table expressions, and built-in and user-defined functions, which are valuable in real-world database scenarios. Dietrich et al. [45] developed observational debuggers for SQL. These studies show that data lineage no longer solely relies on SQL, Datalog, or simple streaming processes, and the support of procedural languages with control structures is an urgent requirement.

3. Semantic Expression of Relational Data Provenance

Relational data provenance refers to the source of relational schemas or relational data (tables, fields, tuples, values) and all operations involved in changing these data in database management. This section first gives a canonical definition of provenance data based on where-prov and extends it to the tuple-level, called TupWhere-Prov. Then, we define provenance explaining transformation (PET) to introduce provenance transformation in SQL and procedural languages.

3.1. Contribution Semantics

In the view of query, relational schemas or relational data are the results of SQL execution in DBMS, and all inputs are the provenance data of the output. A super coarse-grained provenance has less significance and causes a large amount of provenance data storage. Therefore, separating the normalized data contributed to the output data from the input data as the provenance data represent a more rational approach. Currently, the provenance data research solves the division of provenance by contribution semantics, which includes three categories from a high-level perspective: input contribution semantics, copy contribution semantics, and influence contribution semantics [23]. In the input contribution semantics, all inputs are the provenance data. In the semantics of influence contribution, only the inputs that affect the existence of the outputs are the provenance data, and missing data items of these data will lead to unrecoverable provenance. For copy contribution semantics, only those data items (partial or complete) copied to the output are provenance data. For example, some items are only used for filtering data but not copied to the output. These data are provenance data in influence contribution semantics but are not subject to copy contribution semantics.

The investigation of the provenance transformation process is the evaluation of the contribution semantics of provenance data. This is difficult in data provenance systems as SQL queries and relational algebra operations are transformation units in the transformation process. Input contribution semantics ignore the internal details of transformations, and thus the process is a black box. Unlike input contribution semantics, copy and influence contribution semantics analyze the details of the modifications, and influence semantics contain richer details than copy semantics. However, the influence semantics are complicated and cumbersome for understanding the relationship between data in a complex query language.

Previous studies use how-prov and perm-transformations to express provenance transformation. However, how-prov’s K-relation expression contains little information about the provenance transformation. The relational algebraic expression of perm-transformations, a tree structure that significantly relies on SQL syntax, is complicated and cumbersome for complex queries. Therefore, this paper proposes provenance explaining transformation (PET) to investigate this in more detail. PET begins with several basic operations (such as projection and join), providing the ability to describe conversions in SQL and stored procedures. To eliminate the influence of the complex features of languages, such as SQL, and explain the provenance transformation process concisely and clearly, we choose where-prov, a kind of copy contribution semantics, as the theoretical basis of the provenance data and define PET based on it.

3.2. Terminology and Symbols

In order to facilitate the definition of methods and models, we introduce symbols and the relational algebra used in the subsequent sections.

Let

ϖ

be a collection of field names (or attribute names)

{A_{1}, \dots, A_{n}}

, U and V are finite subsets of

ϖ

,

D_{i} (i = 1, 2, \dots, n)

is a finite domain of attribute

A_{i}

. A record (or tuple)

t, t_{1}, t_{2}, t^{'}, u, \dots

is a function

U \to {D_{1}, \dots, D_{n}}

, written as

(A_{1} : D_{1}, \dots, A_{n} : D_{n})

. Then,

t \cdot A

is the value of field A, and t[U] restricts the tuple t to fields set

U

.

A relation or table

r : U

is a finite set of tuples over U. Let

ℜ

be a finite collection of relation names

{R_{1}, \dots, R_{n}}

. A schema R is a mapping

(R_{1} : U_{1}, \dots, R_{n} : U_{n})

from

ℜ

to finite subsets of

ϖ

. A database (or instance)

I : (R_{1} : U_{1}, \dots, R_{n} : U_{n})

is a function mapping each

R_{i} : U_{i} \in R

to a relation

r_{i}

over

U_{i}

. For relation table

r_{i}

and attribute set

A, B, C, \in U_{i}

,

t [A]

is the value of tuple t (

t \subseteq r_{i}

) on field set A and

r_{i} [A] = {t [A] | t \subseteq r_{i}}

.

We also define tuple locations as tuples tagged with relation names, written as (R, t). Similarly, field locations are written as a triple (R, t, A). If the context is clear, the triple (R, t, A) is simplified to

(R, t [A])

. Moreover,

t = u \oplus v

denotes the concatenation of tuple u and v, and the result is tuple t.

Constant data involved in SQL, including constants, constant tuples, such as VALUES (101, ‘abc’), or data in external text files, are uniformly replaced by the notation

⊥

to minimize the impact of the tedious constant data on the discussion. For instance,

“ A + f u n c (B, ⊥) - ⊥ ”

is a simplification of “A + func(B, 1) − 2”.

We use the expression Q(I) as a query for a database instance I and for any output tuple

t, t \in Q (I)

. R(I) is a relational table in the instance I. Particularly, a query is written as Q in the context where the instance I is clear. In query

Q = < q, R > R = {R_{1}, \dots, R_{n}}

, q is the language structure that corresponds to SQL, stored procedural language, and other data processing script languages.

R_{i}

is the relational table name involved in the query Q. In particular,

R_{i} (I)

only references the relational table

R_{i}

.

We use the following notation for (monotone) relational algebra queries:

Q ∷ = {t} | σ_{θ} (Q) | π_{U} (Q) | Q_{1} \infty Q_{2} | Q_{1} \cup Q_{2} | α_{A, B} (Q)

Here, {t} is a set of constant tuples, and the following symbols denote selection, projections, join (or natural join), union, and group and aggregation operation, respectively. The group and aggregation operation is represented as

α_{A, B} (Q) = g \oplus a g g r (G {[B]}^{*})

, where g

= e (A) (A \in G)

, e is an expression parameterized by the grouping filed A, g is the set of the output tuples, and

a g g r (G {[B]}^{*})

denotes that the aggregation function

a g g r

that takes the values of tuples in the group G over the aggregation field B as the input. The final output tuples are the concatenation performed on the output result of the calculation expression and aggregation operation.

To illustrate the semantics of our method with examples, we considered the suppliers–parts–projects database, reproduced for convenience in Figure 1, where

s_{i}

,

j_{i}

,

p_{i}

, and

h_{i}

are used to identify the tuples.

3.3. Representation of Provenance Data

Where-prov is a copy contribution semantics describing the relationship between the inputs and outputs by defining propagation rules of attribute values on algebraic operations. To support full-featured SQL and procedural language queries, we extend the annotation propagation rules of [46] to group aggregation operation and formally define where-provenance on the filter, projection, join, union, and group aggregation operations (See Definition 1).

Hereafter, we consider annotated relations and annotations propagating between locations in the database. Assume that location is denoted as triple

(R, t^{'}, A)

, which carries a set of zero or more annotations.

Selection. For $t \in σ_{θ} (R)$ , an annotation on $(R, t^{'}, A)$ propagates to $(σ_{θ} (R), t, A)$ if $t = t^{'}$ .
Projection. For $t \in Π_{U} (R)$ , an annotation on $(R, t^{'}, A)$ propagates to $(Π_{U} (R), t, U)$ , where $U \subseteq A$ and $t = t^{'} [U]$ .
Join. For $t \in R_{1} \infty R_{2}$ , annotations on $(R_{1}, t_{1}, A)$ and/or $(R_{2}, t_{2}, A)$ propagate to $(R_{1} \infty R_{2}, t, A)$ , where $t [U_{1}] = t_{1}$ and $t [U_{2}] = t_{2}$ , $U_{1}$ and $U_{2}$ are the attributes of $R_{1}$ and $R_{2}$ , respectively.
Union. For $t \in R_{1} \cup R_{2}$ , annotations on $(R_{1}, t_{1}, A)$ and/or $(R_{2}, t_{2}, A)$ propagate to $(R_{1} \cup R_{2}, t, A)$ where $t = t_{1}$ or $t = t_{2}$ .
Group Aggregation. For $t \in α_{B, C} (R)$ , annotations on $(R, t_{1}, B)$ or $(R, t_{2}, C)$ propagate to $(α_{B, C} (R), t, B C)$ if $t [B] = e (t_{1} [B])$ (or $t [C] = a g g r (t_{2} [C])$ ), where e and aggr denote a calculation expression and an aggregation function, respectively.

The above rules indicate that the annotation propagation is based on the copy behavior. If the context is clear, the triple

(R, t, A)

can be simplified to

(R, t [A])

.

Definition 1

(Where-prov). Consider a database instance I, Q is a query on I, and

t \in Q (I)

. Where-provenance is the value of t on field A according to Q and I, denoted as

W h e r e (Q, t [A])

, and is defined as follows:

Direct copy. $W h e r e ({t}, t [A]) = {t [A]}$
Selection. $W h e r e (σ_{θ} (Q), t [A]) = W h e r e {Q, t [A]}$
Projection. $W h e r e (Π_{B} (Q), t [A]) = \cup W h e r e (Q, u [B^{'}]), u \in Q, t [A] = e (u [B^{'}]), B^{'} \in B$
Join. $W h e r e (Q_{1} \infty Q_{2}, t [A]) = {\begin{matrix} W h e r e (Q_{1}, u [B]), u \in Q_{1} (I), A = B \in Q_{1} \\ W h e r e (Q_{2}, v [C]), v \in Q_{2} (I), A = C \in Q_{2} \end{matrix}$
Union. $W h e r e (Q_{1} \cup Q_{2}, t [A]) = {\begin{matrix} W h e r e (Q_{1}, u [A]), u \in Q_{1} (I) - Q_{2} (I) \\ W h e r e (Q_{2}, v [A]), v \in Q_{2} (I) - Q_{1} (I) \\ W h e r e (Q_{2}, u [A]) \cup W h e r e (Q_{2}, v [A]), \\ u = v \in Q_{2} (I) \cap Q_{1} (I) \end{matrix}$
Group Aggregation.

$W h e r e (α_{B, C} (Q), t [A]) = {\begin{matrix} \cup W h e r e (Q, u [B^{'}]), B^{'} \subseteq B, t [B^{'}] = e (u [B^{'}]) \\ \cup W h e r e (Q, u [C^{'}]), C^{'} \subseteq C, t [C^{'}] = a g g r (G {[C^{'}]}^{*}) \end{matrix}, u \in G \subseteq Q$

Definition 1 shows that the lineage of values propagates and outputs under the rules, making a copy semantic contribution to the final output value.

Example 1.

Consider a query finding all identifier numbers of suppliers and projects located in the same city. As show in Figure 2,

Q_{1}

is a typical SQL query, and the corresponding relational algebra is

\prod_{S n o, J n o} (S u p p l i e r s S \infty_{S . c i t y = P . c i t y} P r o j e c t s P)

.

According to Definition 1,

u_{1} (S 1, J 4)

has provenance:

W h e r e (Q_{1}, u_{1} [S n o]) = {s_{1} [S n o]}

and

W h e r e (Q_{1}, u_{1} [J n o]) = {s_{1} [J n o]}

.

Based on where-prov, it is not difficult to extend to tuple granularity. Intuitively, if the complete or partial copy of tuple u is used to compute the output tuple t, then u is the provenance data of t. In other words, when

u [B]

is the provenance data of

t [A]

, then u is the provenance data of t.

Definition 2 (TupWhere).

Consider a tuple t, its provenance

T u p W h e r e (Q, t)

is as follows.

T u p W h e r e (Q, t) = \cup {u | u [B] \in W h e r e (Q, t [A])}, \forall A \subseteq Q, B \subseteq A

.

According to Definition 2, the provenance of

u_{1} (S 1, J 4)

,

u_{2} (S 3, J 4),

and

u_{4} (S 5, J 3)

in the result of

Q_{1}

can be represented as

T u p W h e r e (Q_{1}, u_{1}) = {s_{1}, j_{4}}

,

T u p W h e r e (Q_{1}, u_{2}) = {s_{3}, j_{4}}

, and

T u p W h e r e (Q_{1}, u_{4}) = {s_{5}, j_{3}}

, respectively.

Using where-prov, if a tuple u is fully or partially copied to compute the output tuple t, then u is the provenance data of t. Compared with the value-level where-prov, a schema is at a high level of granularity. Similarly, if tuples of relation table

R_{1}

contribute to tuples of relation table

R_{2}

, we can intuitively infer that the schema U for

R_{1}

contributes to the schema T for

R_{2}

at the schema level. Consequently, the standardized definitions of field-level and schema-level provenance data can be obtained (Definitions 3 and 4).

Definition 3 (FProv).

The provenance data of field A:

F P r o v (A) = \cup {B | u [B] \in W h e r e (Q_{i}, t [A])}, i = 1, 2, \dots n,

where

Q_{i}

is any query containing A in the list of target fields.

Definition 3 demonstrates that all fields contributing to the value of every tuple on field

A

constitute the field-level provenance of field

A

. For example,

F P r o v (S n o) = {S u p p l i e s . S n o}

.

Definition 4 (SProv).

The provenance data of schema T:

S P r o v (T) = \cup {R_{i} | u \in T u p W h e r e (Q_{i}, t), U \in R_{i}, u \in U, t \in T}, i = 1, 2, \dots n,

where

Q_{i}

is any query inserting its querying results to table T.

Under the copy contribution semantics, if tuples of relation table

R_{i}

contribute to tuples of relation table

T

and

u \in T u p W h e r e (Q, t), t \in T, u \in R_{i}

, then

R_{i}

contributes to T at the schema level.

3.4. Representation of Provenance Transformation

In a closed relational database management system that supports data provenance, the process of generating relational data is called provenance transformation. Provenance transformation emerges at different levels of granularity, such as tuples, fields, and values. For the relational data of a given granularity, their provenance transformations form a transformation expression, explaining data creation. To accurately illustrate data provenance, this paper introduces provenance explaining transformation (PET), a method based on transformation expressions. Notably, PET can describe provenance transformation in queries and is exclusive to the provenance transformation of tuples and values.

3.4.1. Tuple and Value Provenance Transformation

Tuples are operating objects in computing the provenance transformation of tuples. Modern DBMS provides a variety of powerful function operations for complex queries. Essentially, such operations can be converted to operations on two-dimensional tables. Our purpose is to eliminate the influence of sophisticated features of query language and, more importantly, to support data provenance for any SQL-like and procedural language with a control structure. Hence, we utilize and standardize five basic operations to construct transformation expressions. We call them meta-transformation, including null transformation (ONull), projection transformation (OProj), group aggregation transformation (OAggr), join transformation (OJoin), and union transformation (OUnion). It is worth noting that these transformations are mainly relational algebraic operations, so it is easy to express SQL-like languages.

Definition 5.

Copy-based meta-transformations:

Null transformation $O N u l l$ , $O N u l l (Q) = Q$ .
Projection transformation $O P r o j$ , $O P r o j (Q) = Q^{'}, Q^{'} (I) = {t | t = u [e (A)], u \in Q (I), A \in Q}$ .
Group aggregation transformation $O A g g r$ , $O A g g r (Q) = Q^{'}$ , $Q^{'} (I) = {t | t = g \oplus a g g r (G {[C]}^{*}), G \in Q (I), \forall u \subseteq G, g = e (u [B])}$
Join transformation $O J o i n$ , $O J o i n (Q_{1}, Q_{2}) = Q^{'}, Q^{'} (I) = {t | t = u [A] \oplus v [B], u \in Q_{1} (I), v \in Q_{2} (I), A \subseteq Q_{1}, B \subseteq Q_{2}}$
Union transformation $O U i o n$ , $O U o i n (Q_{1}, Q_{2}) = Q^{'}, Q^{'} (I) = Q_{1} (I) \cup Q_{2} (I)}$

In the above meta transformations, the Null transformation operation, including the direct copy in Definition 1, filter operations, and other database operations not involving copy transformation do not modify the input tuples, so the input is the output. The projection transformation retains tuples in the specified fields. The group aggregation transformation divides input tuples into groups by specific attribute values, aggregates each group with specified attributes, and outputs the results, where each input group corresponds to an output tuple. The join transformation discovers matched tuples meeting the join condition, and the union transformation merges two input sets into an output. Both join and union operations are binary operations and extendable to multivariate operations. The tuple provenance transformation PET combines the above basic operations to express the provenance transformation process and complex SQL operations in the database. Using the above meta transformations, we define PET as follows.

Definition 6.

P E T (Q, t)

, provenance transformation of tuple t:

$P E T ({t}, t) = e x p r < R >, w h e r e {t} \subseteq R$
$P E T (O N u l l (Q), t) = e x p r $
$P E T (O P r o j (Q), t) = e x p r < O P r o j (P E T (Q, u)) >$
$P E T (O A g g r (Q), t) = e x p r < O A g g r (P E T (Q, u)) >$
$P E T (O J o i n (Q_{1}, Q_{2}), t) = e x p r < O J o i n (P E T (Q_{1}, u), P E T (Q_{2}, v))$
$P E T (O U n o i n (Q_{1}, Q_{2}), t) = {\begin{matrix} e x p r w h e r e t \in Q_{1} (I) - Q_{2} (I) \\ e x p r w h e r e t \in Q_{2} (I) - Q_{1} (I) \\ e x p r o r e x p r , e l s e \end{matrix}$
$P E T (O A g g r (O U n o i n (Q_{1}, Q_{2})), t) = e x p r < O A g g r (O U n o i n (P E T (Q_{1}, u), P E T (Q_{2}, v))) >$

In Definition 6,

e x p r < t f >

is an expression for the transformation instead of transformation operation

t f

itself. Take

Q_{1}

as an example. For

\forall u \subseteq Q_{1}

, the provenance transformation of tuple u can be expressed as

O J o i n (S u p p l i e s, P r o j e c t s)

, namely,

P e t (Q_{1}, u) = O J o i n (S u p p l i e s, P r o j e c t s)

.

It is worth pointing out that in the above transformations, only the

O A g g r

transformation can aggregate multiple tuples into one tuple. In this case, an output tuple may simultaneously derive from both

Q_{1}

and

Q_{2}

. Therefore, in the process of transforming queries, the output of

O U n o i n

must be used as the input of

O A g g r

. Namely,

O U n o i n

must be used as a subexpression in the expression of

O A g g r

. Otherwise,

O U n o i n

has no meaning in the transformation expression, and we can replace it with either

e x p r < P e t (Q_{1}, t) >

or

e x p r < P e t (Q_{2}, t) >

or both.

Value can be regarded as a tuple with only one field, so the value-level provenance transformation can be derived from the tuple-level provenance transformation. In this paper, we refer to this derivative version as VPET (Value PET). Similar to PET, the expression of VPET is also formed based on meta-transformation. In the context of VPET, the granularity of the data unit is a value rather than a tuple, which is the exact difference between VPET and PET.

Definition 7.

V P E T (Q, t [A])

. The value provenance transformation of tuple t on field A:

$V P E T ({t}, t [A]) = e x p r < A >, A \subseteq R$
$V P E T (O N u l l (Q), t [A]) = e x p r < V P E T (Q, t [A]) >$
Given $t [A] = e (u [B]), u \subseteq Q (I)$ , we have

$V P E T (O P r o j (Q), t [A]) = {\begin{matrix} e x p r < V P E T (Q, u [B]) >, i f e x p r < e (u [B]) > = e x p r \\ e x p r < O P r o j (V P E T (Q, u [B])) >, e l s e \end{matrix}$
Given B and C are group and aggregation fields, $B^{'} \subseteq B$ , $C^{'} \subseteq C$ , $\forall u \in G \subseteq Q (I)$ , $g = e (u [B])$ , we have

$V P E T (O A g g r (Q), t [A]) = {\begin{matrix} e x p r < O A g g r (V P E T (Q, u [C^{'}])) >, i f t [A] = a g g r (G {[C^{'}]}^{*}) \\ e x p r < V P E T (Q, u [B^{'}]) >, i f t [A] = e (u [B^{'}]), e x p r = e x p r < B^{'} > \\ e x p r < O P r o j (V P E T (Q, u [B^{'}])) >, t [A] = e (u [B^{'}]), e l s e \end{matrix}$
Given B and C are group and aggregation fields, $B^{'} \subseteq B$ , $C^{'} \subseteq C$ , $\forall s \in G \subseteq Q_{1} (I) \cup Q_{2} (I), g = (s [B])$ , we have

$V P E T (O A g g r (O U n i o n (Q_{1}, Q_{2})), t [A])$

$= {\begin{matrix} e x p r < O A g g r (V P E T (O U n i o n (Q_{1}, Q_{2}), s [C^{'}])) >, i f t [A] = a g g r (G {[C^{'}]}^{*}) \\ e x p r < V P E T (Q_{1}, s [B^{'}]) >, i f t [A] = e (s [B^{'}]), s \in Q_{1} (I), e x p r < e (s [B^{'}]) > = e x p r < B^{'} > \\ e x p r < O P r o j (V P E T (Q, s [B^{'}])) >, t [A] = e (s [B^{'}]), s \in Q_{i} (I), i = 1, 2 \end{matrix}$
$V P E T (O J o i n (Q_{1}, Q_{2}), t [A]) = {\begin{matrix} e x p r < V P E T (Q_{1}, u [B]) >, t [A] = u [B] \\ e x p r < V P E T (Q_{2}, v [C]) >, t [A] = v [C] \end{matrix}$

In Definition 7, for group and aggregation operation

O A g g r

, if

t [A]

only originates from the group fields, we can infer that the aggregation function contributes nothing to the result. In other words, the transformation of values is not the result of

O A g g r

operation.

Example 2.

Consider a query

Q_{2}

computing the total price after discount for each project (assuming the excess parts can enjoy 20% off when the number of parts exceeds 200).

As shown in Figure 3, the tuple-level where-prov of the tuple k₁ is

{h_{1}, h_{2}, p_{1}, p_{2}}

. Suppose

Q_{2}^{'}

is the CTE recursive subquery for

Q_{2}

. The provenance transformations for all output tuples of

Q_{2}^{'}

are

O P r o j (S h i p m e n t)

and

O P r o j {(O P r o j (S h i p m e n t))}^{*}

. According to Definitions 6 and 7, the provenance transformation of

k_{2}

is

P E T (Q_{2}, k_{2}) = O J o i n (O P r o j (S h i p m e n t), O A g g r (O P r o j (O J o i n {(O P r o j (P a r t s), O U n i o n (O P r o j (S h i p m e n t), O P r o j (O P r o j (S h i p m e n t))}^{*})))))

. The value provenance transformations of

k_{2}

on field

J n o

and

s m

are

V P E T (Q_{2}, J 2) = O P r o j (S h i p m e n t . J n o)

, and

V P E T (Q_{2}, 8139.6) = O A g g r (O P r o j (O J o i n (O U n i o n (S h i p m e n t . q u a n t i t y, O P r o j (S h i p m e n t . q u a n t i t y)), P a r t s . p r i c e)))

.

3.4.2. Representing Provenance Transformation in SQL with Meta-Transformation

We SQL-revised PET to support full-featured SQL, and we present how PET represents SQLs in this section. For the sake of discussion, assume that the main body of a query is the SQL of ASPJ-CTE-Win-Sub with selection and filter, projection, join, group aggregation, CTE (common table expression), window function, and set operation. Furthermore, PET supports the query language of small features (such as Sort, Limit, and Distinct). For each SQL feature, the representation method based on meta-transformations is described as follows. To illustrate this, we use

s u b E x p r_{i}, i = 1, 2, \dots, n

to represent the transformation expressions that function as the subexpressions of the operations discussed below.

Ordinary import. Ordinary import is treated as a query on constant relation tables and expressed as $O N u l l (⊥)$ .
Selection and filter. The filtering operation does not perform any transformation on tuples, so the corresponding transformation expression is $O N u l l (s u b E x p r)$ .
Group aggregation, projection, and union are expressed as $O A g g r$ , $O P r o j$ , and $O N u l l$ , respectively.
Joins (both input sets contribute to the output) include inner joins, left/right outer joins, and full outer joins. The Cartesian product operation can be treated as an inner join if join conditions always hold. The join transformation is expressed as:
Inner join: $O J o i n (s u b E x p r_{1}, s u b E x p r_{2})$ ;
Left outer join: $O J o i n (s u b E x p r_{1}, s u b E x p r_{2})$ or $O J o i n (s u b E x p r_{1}, ⊥)$ ;
Right outer join: $O J o i n (s u b E x p r_{1}, s u b E x p r_{2})$ or $O J o i n (⊥, s u b E x p r_{2})$ ;
Full outer join: $O J o i n (s u b E x p r_{1}, s u b E x p r_{2})$ , $O J o i n (s u b E x p r_{1}, ⊥)$ or $O J o i n (⊥, s u b E x p r_{2})$ .
Join (when only one of the two input sets contribute to the output). In this case, the other input set plays the role of a filter. Assume that only the left input set contributes, then the join transformation is expressed as $s u b E x p r_{1}$ or $⊥$ (right outer join and full outer join will make the left input set a constant relational table).
Window function. The window function is an extension of group aggregation in SQL and can be equivalently represented by group aggregation and join. Therefore, the window function is equivalent to $O J o i n (s u b E x p r, O A g g r (s u b E x p r))$ .
Except operation (Except, $Q_{1} - Q_{2}$ ). The output tuples only come from $Q_{1}$ , so the except operation is expressed as $s u b E x p r_{1}$ .
Intersect operation (Intersect, $Q_{1} \cap Q_{2}$ ). The output tuples are from $Q_{1}$ and $Q_{2}$ . In this case, there are two kinds of parallel provenance transformations, denoted as $s u b E x p r_{1}$ and $s u b E x p r_{2}$ .
CTE (Including recursive CTE and non-recursive CTE). Non-recursive CTE can be regarded as a nested subquery. Recursive CTE is divided into a base query and a recursive query on the language structure. The input of the recursive query may be the output of the base query or the output of the previous recursion. The output of the CTE can be regarded as the union of a base query, a recursive query with the base query as the input, and a recursive query with the output of the previous recursion as the input. Therefore, recursive CTE is expressed as $b a s e E x p r$ , $r e c u r s i v e E x p r {(b a s e E x p r)}^{*}$ (the asterisk indicates using the output of the last recursion as the input) or $O U n i o n (b a s e E x p r, r e c u r s i v e E x p r (b a s e E x p r), r e c u r s i v e E x p r {(b a s e E x p r)}^{*})$ among the third item that has the premise that the current expression works as a subexpression of the aggregation operation.

Example 3.

Suppose we have the following query

Q_{3}

calculating the revenue for suppliers with a window function. The query

Q_{3}

and its output are shown in Figure 4.

The first output column derives from the table

S h i p m e n t

, and the second column represents an aggregation of each supplier’s revenue on each project. For the above provenance transformation definition, it is easy to see that the PET representation for each tuple can be written as

O J o i n (O P r o j (S h i p m e n t), O A g g r (O P r o j (O J o i n (P a r t s, S h i p m e n t)))

). The value provenance transformations of

q_{1}

in field

S n o

and

r e v e n u e

are

O P r o j (S h i p m e n t . S n o)

and

O A g g r (O P r o j (O J o i n (P a r t s . p r i c e, S h i p m e n t . q u a n t i t y)))

.

3.4.3. Representing Provenance Transformation in Stored Procedures with Meta-Transformation

Database-stored procedures are similar to SQL-like languages or other scripting languages, such as Python and Perl. In this paper, we only focus on SQL-like procedural languages, such as T-SQL for SQL Server, PL/SQL for Oracle, and PL/pgSQL for PostgreSQL. The stored procedure covers two kinds of flow structure: data flow (assignment, SQL execution) and control flow (conditional control, loop control). Only the data flow structure addresses the copy transformation relation of the data, while the control flow structure controls the switch branch of transformation. Here, we introduce how to use basic operations to represent each language feature.

Constant. In procedural language, when statement blocks do not reference any relational table in the database, the data output by this program can be regarded as coming from a constant relational table. Therefore, any query referencing a constant is treated as an ordinary import and expressed as $O N u l l (⊥)$ .
Assignment. The assignment operation assigns the calculation result of an expression to a variable. If values of an output tuple on some fields come from the variable, then a projection operation is used to project the input variable to the output tuple on these fields. Suppose that $t [A] = e (u [B]), u \in Q (I), B \subseteq Q$ , when A covers all the fields of t, then the transformation expression is $O P r o j (e x p r_{Q})$ . Otherwise, tuple t must contain values of other field set A′ other than A, and the values are the result of other projection expressions. If there is $t [A^{'}] = e^{'} (v [C]), v \in Q^{'} (I)$ , $C \subseteq Q^{'}$ , then the transformation expression is $O P r o j (O J o i n (e x p r_{Q}, e x p r_{Q^{'}}))$ .
The SQL or dynamic SQL in stored procedures is equivalent to a nested subquery, and we convert it into a subexpression.
Conditional control statement block. A conditional control structure has multiple switch sub-statement blocks that cannot simultaneously execute, so the tuple can only be from one of the sub-statement blocks. Similar to a union operation, the transformation expressions of conditional control structure are $O U n i o n (s u b E x p r_{1}, s u b E x p r_{2}, \dots,, s u b E x p r_{n})$ or $s u b E x p r_{i}, i = 1, 2, \dots, n$ . $O U n i o n$ is a requisite only when the present transformation is a subexpression of $O A g g r$ . Therefore, a conditional control operation is equivalent to a union operation.
Loop control. Assume $t$ is a variable in the body of the loop statement. Before entering the loop statement, $P E T (Q, t) = e x p r_{1}$ , and the transformation expression of $t$ in the loop body is $e x p r_{2}$ . After the execution of the loop statement, we have four cases: (a) if the judgment condition of the loop does not meet at the beginning, the loop body statement will not execute the following: $P E T (Q^{'}, t) = e x p r_{1}$ ; (b) the loop body statement is executed only once as the following: $P E T (Q^{'}, t) = e x p r_{2}$ ; (c) the loop body statement is executed multiple times, but $e x p r_{2}$ does not reference the variable $t$ . Therefore, the expression $t = e x p r_{2}$ is non-recursive, and the value of $t$ is always the result of the last loop, namely $P E T (Q^{'}, t) = e x p r_{2}$ ; and (d) the statement of the loop body is executed multiple times, and $e x p r_{2}$ is recursive. Then, we have $P E T (Q^{'}, t) = e x p r_{2}^{*}$ , which shows that loop control is equivalent to one CTE.

Example 4.

Suppose function

f n (x, y)

returns the project number and the discount amount, where x and y are inputs representing the discount and the minimum quantity for the discount. The body of this stored procedure is as follows:

03 DECLARE

04 t DISCOUNT; -- DISCOUNT is a custom data type, namely the return type (Jno, amount)

05 r Shipment%rowtype; -- Declare r to be the same type as the tuple of the table Shipment

06 a NUMERIC;

07 p NUMERIC;

08 BEGIN

09 FOR r IN SELECT * FROM Shipment LOOP

10 SELECT price INTO p FROM Parts WHERE Pno = r.Pno;

11 IF r. quantity > x THEN

12 a:= x + (r.quantity − x) *y;

13 SELECT r.Jno, a*p INTO t;

14 ELSE SELECT r.Jno, r.quantity*p INTO t;

15 END IF;

16 RETURN NEXT t;

17 END LOOP;

18 RETURN;

19 END

Assuming the loop in line 9 runs multiple times and the condition in line 11 has been satisfied, the transformation expressions for lines 9 and 10 are

P E T (Q, p) = O P r o j (O J o i n (S h i p m e n t, P a r t s))

. Similarly, the transformation representations of lines 12–14 can be further summarized:

P E T (Q, a) = O P r o j (S h i p m e n t)

for line 12,

P E T (Q, t) = O J o i n (O P r o j (S h i p m e n t), P E T (Q, a), P E T (Q, p))

for line 13, and

P E T (Q, t) = O J o i n (O P r o j (S h i p m e n t), P E T (Q, p))

for line 14.

4. Representing Data Provenance in DBMS

Where-prov and PET can represent provenance data and transformation in a database. In this section, we propose a relational provenance representation model based on where-prov and PET to comprehensively interpret data provenance from multiple levels (relational schemas, attribute fields, tuples, values), especially the provenance transformation process.

4.1. Schema-Level Provenance Representation

Schema-level data provenance is coarser and more suitable for applications that do not focus on the details. In a transformation-intensive database environment, the relational table instances are the results of relation interactions, which reflect the dependencies between tables at the schema level. Since provenance data between schemas does not have a formal definition, this paper introduces a representation model for schema-level provenance.

Schema-level provenance is a directed graph, G = <V_R, V_O, E>, where V_R represents relation table nodes, V_O denotes the transformation operation nodes, and E is the directed edges set representing the dependencies among V_O, and the dependencies between V_R and V_O. Specifically, we have:

Definition 8.

(1) Tuple Provenance Graph (TupPGraph),

T u p P G r a p h (Q_{i}, I, t) = < V t_{r}, V t_{o}, E t >

(2) Schema Provenance Graph of T (SPGraph),

S P G r a p h (I, T) = < V_{r}, V_{o}, E >, w h e r e V_{r} = \cup V t_{r}, V_{o} = \cup V t_{o}, E = \cup E t, \forall t \in T, T u p P G r a p h (Q_{i}, I, t) = < V t_{r}, V t_{o}, E t >

(3) Schema Provenance Graph of database I (SPGraph),

S P G r a p h (I) = < V I_{r}, V I_{o}, E I >, w h e r e V I_{r} = \cup V_{r}, V I_{o} = \cup V_{o}, E I = \cup E, \forall T \in ℜ

.

From the above definition, the schema provenance graph not only includes the provenance transformation of each tuple but also explains the transformation process between the target relation schema and source relation schema. Similarly, the overall schema provenance graph SPGraph(I) merges all the schema provenance graphs of the database and can explain the specific dependencies between different relation schemas.

Every output tuple has a tuple provenance graph, and multiple output tuples may have the same tuple provenance graph for a query Q, indicating that output tuples can share a tuple provenance graph. Furthermore, two tuple provenance graphs in the same query Q may have the common subgraph, indicating that merging tuple provenance graphs is possible. Therefore, we keep one tuple provenance graph for multiple output tuples and combine different tuple provenance graphs to obtain the schema provenance graph. During the merging process, some nodes may lose information, such as the number of operations or operation objects of the lower nodes. To avoid this, we introduce the “Option’’ node without transformation meaning, which means multiple possible branches for this node.

For example, the tuple provenance graph of transformation expression

P E T (Q_{i}, t), t \in T

can be expressed as

G < R^{'}, O, η >

,

R^{'} = {R_{1}, R_{2}, \dots, R_{m}}

,

O = {O p_{1}, O p_{2}, \dots, O p_{n}}

, where

O p_{j}

is the

j^{t h}

basic transformation (

O p_{1}

is the top-level operation), and

η

is a set containing the relations among operations and the relations between operations and reference relation tables.

Figure 5 shows the possible tuple provenance graphs of the output tuples of table

Q T Y

in query

Q_{2}

in Example 2, which is output by the subquery

Q_{2}^{'}

. Figure 5a–c displays tuple provenance subgraphs of the tuple

k_{1}

and

k_{4}

, tuples

k_{3}

, and tuples

k_{2}

and

k_{5}

, respectively. Notably, the tuple provenance graph in Figure 5c is actually the SPGraph of the query

Q_{2}^{'}

.

According to the discussion in Example 2, we have

P E T (Q_{2}, k) = O J o i n (O P r o j (S h i p m e n t), O A g g r (O P r o j (O J o i n (O P r o j (P a r t s), P E T ({Q_{2}}^{'}, k)))))

. It is not difficult to infer that the remaining part of the SPGraph for

Q_{2}

is clear and can be merged with the SPGraph for

Q_{2}^{'}

to obtain the SPGraph for

Q_{2}

, as shown in Figure 6.

4.2. A Compact Representation of TupPGraph

As a subgraph of schema provenance graph SPGraph, each tuple provenance graph corresponds to one kind of schema transformation relations. To clarify the relationship between TupPGraph and SPGraph and reflect TupPGraph of each tuple in SPGraph (or every possibility in schema transformation relations), we put forward the concept of tuple provenance route (TupPRoute) to represent the provenance of tuples.

Definition 9.

Tuple Provenance Route(TupPRoute),

T u p P R o u t e (Q_{i}, I, t) = < S P G r a p h (I, T), B P >

, where

B P = {| | b_{r i d}, r i d = 1, \dots, M A X_{R I D}}

,

b_{r i d} = 0 i f V_{r i d} \notin T u p G r a p h

and

b_{r i d} = 1 V_{r i d} \in T u p G r a p h

.

From the description of the provenance graph, TupPGraph is an expression tree with several branches. Each node in SPGraph has a semantic parent and multiple child nodes in the tree structure. In Definition 9, BP is a bitmap array for the provenance route, and V_rid represents a vertex identified as rid. All nodes in a SPGraph (including relational table nodes and transformation nodes) have an identifier rid that the maximum identifier value is MAX_RID. The rid bit in the BP is set to 1 if a node belongs to a TupPGraph, and otherwise 0. Though rid is theoretically unique, the bitmap may be too large if each node is assigned an identifier bit. Therefore, this paper uses a more compact way to ascertain a route by preorder traversing the schema provenance graph and allocating the same rid for nodes on the same branch, meaning that the value of MAX_RID is small (Algorithm 1).

Algorithm 1.

Alg. generateRids(SPGraph):
01 begin
02 root := SPGraph.rootNode
03 generateRidRecursive(root, 0)
04 end
Alg. generateRidRecursive(eachNode, currRid):
05 begin
06 n := eachNode
07 if n.rid has not been assigned then
08 n.rid := currRid
09 end if
10 foreach cn in n.childNodes
11 if count(n.childNodes) = 1 then
12 generateRidRecursive(cn, currRid)
13 else
14 generateRidRecursive(cn, currRid + 1)
15 end if
16 end foreach
17 end

Figure 7 shows the bitmaps corresponding to the three provenance routes in the query Q₂ (base query, recursive query with base query as input, recursive query with the last recursion as input) are “101001”, “101111”, and “110000”. Combining the bitmap array and the schema provenance graph, we can deduce the transformation path of the tuple

k_{3}

in the provenance graph (the blue path in the figure).

4.3. Field and Value Level Provenance Representation

Value provenance transformation (VPET) derives from the provenance transformation of tuples. Using a similar way of representing schema provenance graph and tuple provenance route, we utilize the field provenance graph (PGraph of Field, FPGraph) and value provenance route (VPRoute) to represent field- and value-level provenance. For

V P e t (Q_{i}, I, t [A])

:

Definition 10.

(1) PGraph of Value (VPGraph), value provenance graph

V P G r a p h (Q_{i}, I, t [A]) = <

V v_{r}, V v_{o}, E v >

(2) PGraph of Field (FPGraph), provenance graph for field A,

F P G R a p h (I, A) = < V_{r}, V_{o}, E >, w h e r e, V_{r} = \cup V v_{r}, E = \cup E v, \forall t \in T V P G r a p h (Q_{i}, I, t [A]) = < V v_{r}, V v_{o}, E v >

. (3) PRoute of Value (VPRoute), value provenance route

V P R o u t (Q_{i}, I, t [A]) = < F P G r a p h (I, A), B P >, B P = {| | b_{r i d}, r i d = 1, \dots, M A X_{R I D}}, b_{r i d} = 0 i f V_{r i d} \notin T u p G r a p h (Q_{i}, I, t [A]), b_{r i d} = 1 i f V_{r i d} \in T u p G r a p h (Q_{i}, I, t [A])

.

Looking back to Example 2 in Figure 3, the VPET representation for each record in field

J n o

is

O P r o j (S h i p m e n t . J n)

. It is easy to represent the provenance graph for field

J n o

(Figure 8a). For field

s m

, the provenance transformation representations corresponding to each tuple are

O A g g r (O J o i n (O u n i o n (S h i p m e n t . q u a n t i t y, O P r o j (S h i p m e n t . q u a n t i t y)), P a r t s . p r i c e))

and

O A g g r (O J o i n (S h i p m e n t . q u a n t i t y, P a r t s . p r i c e))

, as shown in Figure 8b.

4.4. Implementation of Data Provenance in PostgreSQL

We implemented the model of provenance graph and provenance route in PostgreSQL and provided a client to query and visualized the provenance information. We refer to this improved system as ProvPg. ProvPg has a graphical client known as ProvPg-GUI.

The primary function modules added in ProvPg include the extraction and query of provenance information. The provenance information extraction is implemented by the provenance extraction subsystem. In addition, ProvPg provides a series of query interfaces in the client to support provenance information query and visualization.

ProvPg (Figure 9) is accomplished by using the Hook mechanism of PostgreSQL to add extended functions, and the source code of the Executor is modified. For the plan tree of a query, the processing flow is: (a) Before executing the query, a schema provenance graph (SPGraph) is extracted and produced with the help of the structural information of the plan tree. (b) During the execution of the plan tree, each time a tuple is an output, the corresponding provenance route (TupPRoute) is extracted in the form of a bitmap array. The bitmap array will be inserted into the output tuple as an additional field value. Finally, the client will receive two result sets. One is the output tuple set (with the provenance route), and the other is a provenance graph in the format of JSON. The graphical client ProvPg-GUI is used to receive the SQL input by the user, display the returned query results, and visualize the provenance graph and provenance route.

In order to evaluate the impact of ProvPg on PostgreSQL’s query engine, we conduct TEC Benchmark™ H [47] (TPC-H) test on ProvPg and PostgreSQL. The computation overload of ProvPg on PostgreSQL is analyzed. Our TPC-H instance contains eight relational tables, and we test 22 customized pieces of SQL on this instance (Standard SQL 92). Although the 22 SQLs are missing some features, such as CTE, and are not involved in the stored procedure, the performance of the query engine mainly depends on underlying physical operations, such as tuple scanning, join, and sort. The 22 SQLs have covered the primary calculation process of the query engine. The major significant load of stored procedure in PostgreSQL lies in the execution of a query plan. Therefore, the TPC-H criterion is sufficient to compare ProvPg with the original PostgreSQL.

We construct three datasets ranging from 1G to 4G and 10G for the query tests. The specific running environment is (a) CPU: 64-bit, Intel(R)Core (TM) i5-4460 3.20 GHz; (b) memory: 12G; (c) operating system: Ubuntu 12.04 for operating system; (d) kernel: Linux 3.5.0; (e) PostgreSQL: version 9.5.0. Table S1 shows the query execution time of PostgreSQL and ProvPg on the above dataset. Figure S1 shows the ratios of overhead imposed by the data provenance module. The experiment results indicate the following:

The overall extra cost is relatively small and within acceptable limits. The experiments show that, except for Q2, Q10, Q13, and Q18, the extra cost resulting from the provenance module for each SQL is less than 5%, and the average overhead for three kinds of load sizes is all below 3%. Taking the benefits from data provenance into consideration, we believe that the additional cost is acceptable.

There is no positive correlation between the extra cost and load size. For different-sized datasets, the extra cost of ProvPg on PostgreSQL is stable overall and does not dramatically fluctuate according to data scale. Therefore, we think the extra cost of a query is predictable and estimated.

We also analyze the correlation between the extra cost ratio and execution strategy of the database engine. The Pearson’s correlation coefficient of R₁ and R₂ demonstrate that the extra cost ratio is related to the materialization ratio when we increase the load (Tables S1 and S2), and the dataset scale is more than 10G. The correlation is more significant when we exclude Q8 and Q9 from the SQLs, indicating that materialization clearly prolongs the running time of the extended system. Furthermore, the underlying operation of the database system also affects the increment of the overhead. For example, as Q8 and Q9 mainly perform hash operations, the overhead caused by the provenance module is low even when their materialization ratios are high. As a contrast, Q18 needs to execute a sort operation, and the overhead ratio is high even if it has a small materialization ratio on 1G, 4G scale.

5. Conclusions and Future Work

The lack of support for full-featured SQL and procedural query languages in relational databases and, in particular, the absence of consideration for provenance transformation have seriously limited the application of data provenance. This paper proposes a provenance model and explains the application of our model in actual scenarios to resolve the above limitations. Firstly, we presented a provenance representation PET-based method on where-prov and further introduced the notion of a provenance graph and provenance route. Then, we described the provenance transformation process from relational tables, fields, tuples, and values, which are helpful for understanding and analyzing the dependence and transformation relationship between data. Furthermore, the provenance model is significant for implementing a fully automated debugging and elimination system or tool.

We also implemented our model and extended PostgreSQL to ProvPg with new features, such as database provenance information query and visualization. The implementation of ProvPg verifies the feasibility of applying our model and method and provides new implementation ideas and solutions for capturing provenance information in full-featured SQL and procedural languages. We are confident that our provenance model could be applied to specific business scenarios and make targeted optimizations in subsequent research, such as data debugging and exclusion.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app13010064/s1, Figure S1: The ratios of load caused by adding data provenance to PostgresSQL. In one particular test environment, a database of three sizes (1G, 4G, 10G) and 22 SQLs are used for query testing; Table S1: Runtime comparison of PostgreSQL and ProvPg on dataset with 1G, 4G, and 10G load. PG and PPG are the abbreviations of PostgreSQL and ProvPg. (millisecond); Table S2: Correlation of overhead ratio with materialized operation on a dataset with 1G, 4G, and 10G load. R1: the materialization ratio, R2: the extra cost RATIO, ρR1, R2 IS, the Pearson’s correlation coefficient of R1 and R2, the extra cost RATIO, ρR1, R2 IS, the Pearson’s correlation coefficient of R1 and R2.

Author Contributions

Conceptualization, D.T. and Y.L.; methodology, D.T., Y.L. and R.Z.; software, R.Z. and Y.L.; validation, D.T., Y.L. and R.Z.; formal analysis, D.T., Y.L. and R.Z.; investigation, D.T. and Y.L.; resources, D.T.; data curation, R.Z. writing—original draft preparation, D.T. and R.Z.; writing—review and editing, D.T. and R.Z.; visualization, R.Z. and Y.L.; supervision, P.Z.; project administration, T.Z., D.T. and P.Z.; funding acquisition, D.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Laboratory of Industrial Equipment Quality Big Data, MIIT, grant number 2021-IEQBD-04.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Al-Mamun, A.; Yan, F.; Zhao, D.F. IEEE SciChain: Blockchain-enabled Lightweight and Efficient Data Provenance for Reproducible Scientific Computing. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE 2021), Chania, Greece, 19–22 April 2021; pp. 1853–1858. [Google Scholar]
Chen, J.H.; Zhong, N.; Feng, J.H. Developing a Provenance Warehouse for the Systematic Brain. Int. J. Inf. Technol. Decis. Mak. 2017, 16, 1581–1609. [Google Scholar] [CrossRef]
Benabdelkader, A.; Santcroos, M.; Madougou, S.; Kampen, A.H.C.V.; Olabarriaga, S.D. A Provenance Approach to Trace Scientific Experiments on a Grid Infrastructure. In Proceedings of the 2011 IEEE Seventh International Conference on eScience, Stockholm, Sweden, 5–8 December 2011; pp. 134–141. [Google Scholar]
Buneman, P. Curated databases. In Data Integration in the Life Sciences, Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Vancouver, Canada, 9–12 June 2008; Ludascher, B., Raschid, L., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3615, p. 2. [Google Scholar]
Butt, A.S.; Fitch, P. A provenance model for control-flow driven scientific workflows. Data Knowl. Eng. 2021, 131–132, 101877. [Google Scholar] [CrossRef]
Jin, X.N.; Lee, Y.Y.; Fiscus, J.; Guan, H.Y.; Yates, A.N.; Delgado, A.; Zhou, D.F. MFC-Prov: Media forensics challenge image provenance evaluation and data analysis on large-scale datasets. Neurocomputing 2022, 470, 76–88. [Google Scholar] [CrossRef]
Barga, R.S.; Digiampietri, L.A. Automatic capture and efficient storage of e-Science experiment provenance. Concurr. Comput. Pract. Exp. 2008, 20, 419–429. [Google Scholar] [CrossRef]
Chacko, A.M.; Fairooz, M.; Kumar, S.D.M. Provenance-Aware NoSQL Databases. In Proceedings of the Security in Computing and Communications, SSCC 2016, Jaipur, India, 21–24 September 2016; pp. 152–160. [Google Scholar]
Camisetty, A.; Chandurkar, C.; Sun, M.; Koop, D. Enhancing Web-based Analytics Applications through Provenance. IEEE Trans. Vis. Comput. Graph. 2019, 25, 131–141. [Google Scholar] [CrossRef] [PubMed]
Freire, J.; Silva, C.T.; Callahan, S.P.; Santos, E.; Scheidegger, C.E.; Vo, H.T. Managing rapidly evolving scientific workflows. In Provenance and Annotation of Data; Moreau, L., Foster, I., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4145, pp. 10–18. [Google Scholar]
Sun, Q.; Liu, Y.; Tian, W.J.; Guo, Y.K. CF-PROV: A Content-Rich and Fine-Grained Scientific Workflow Provenance Model. IEEE Access 2019, 7, 30002–30016. [Google Scholar] [CrossRef]
Buneman, P.; Cheney, J.; Vansummeren, S. On the Expressiveness of Implicit Provenance in Query and Update Languages. ACM Trans. Database Syst. 2008, 33, 1–47. [Google Scholar] [CrossRef] [Green Version]
Bose, R.; Frew, J. Composing lineage metadata with XML for custom satellite-derived data products. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, Santorini, Greece, 23–23 June 2004; pp. 275–284. [Google Scholar]
Chiticariu, L.; Tan, W.-C.; Vijayvargiya, G. DBNotes: A post-it system for relational databases based on provenance. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, MD, USA, 14–16 June 2005; pp. 942–944. [Google Scholar]
Widom, J. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the 2005 CIDR Conference, Asilomar, CA, USA, 4–7 January 2005; Stanford InfoLab: Stanford, CA, USA, 2005. [Google Scholar]
Bhagwat, D.; Chiticariu, L.; Tan, W.C.; Vijayvargiya, G. An annotation management system for relational databases. Vldb. J. 2005, 14, 373–396. [Google Scholar] [CrossRef] [Green Version]
Karvounarakis, G.; Ives, Z.G.; Tannen, V. Querying data provenance. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, IN, USA, 6–10 June 2010; pp. 951–962. [Google Scholar]
Senellart, P.; Jachiet, L.; Maniu, S.; Ramusat, Y. ProvSQL: Provenance and Probability Management in PostgreSQL. Proc. VLDB Endow. 2018, 11, 2034–2037. [Google Scholar] [CrossRef] [Green Version]
Cui, Y.; Widom, J. Lineage tracing for general data warehouse transformations. VLDB J. Int. J. Very Large Data Bases 2003, 12, 41–58. [Google Scholar] [CrossRef]
Cui, Y.; Widom, J. Storing auxiliary data for efficient maintenance and lineage tracing of complex views. In Proceedings of the 2nd International Workshop on Design and Management of Data Warehouses (DMDW), Stockholm, Sweden, 5–6 June 2000. [Google Scholar]
Buneman, P.; Khanna, S.; Wang-Chiew, T. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory, London, UK, 4–6 January 2001; pp. 316–330. [Google Scholar]
Green, T.J.; Karvounarakis, G.; Tannen, V. Provenance semirings. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Beijing, China, 11–13 June 2007; pp. 31–40. [Google Scholar]
Glavic, B. Perm: Efficient Provenance Support for Relational Databases. Ph.D. Thesis, University of Zurich, Zurich, Switzerland, January 2010. [Google Scholar]
Glavic, B.; Alonso, G. Perm: Processing provenance and data on the same data model through query rewriting. In Proceedings of the 2009 IEEE 25th International Conference on Data Engineering, Shanghai, China, 29 March–2 April 2009; pp. 174–185. [Google Scholar]
Wang, Y.R.; Madnick, S.E. A polygen model for heterogeneous database systems: The source tagging perspective. In Proceedings of the 16th International Conference on Very Large Data Bases, Brisbane, Australia, 13–16 August 1990. [Google Scholar]
Hachem, N.I.; Qiu, K.; Gennert, M.; Ward, M. Managing derived data in the Gaea scientific DBMS. In Proceedings of the 19th International Conference on Very Large Data Bases, Dublin, Ireland, 24–27 August 1993. [Google Scholar]
Faloutsos, C.; Jagadish, H.; Sidiropoulos, N.D. Recovering information from summary data. In Proceedings of the 23rd International Conference on Very Large Data Bases, Athens, Greece, 26–29 August 1997; pp. 36–45. [Google Scholar]
Woodruff, A.; Stonebraker, M. Supporting fine-grained data lineage in a database visualization environment. In Proceedings of the 13th International Conference on Data Engineering, Birmingham, UK, 7–11 April 1997; pp. 91–102. [Google Scholar]
Amsterdamer, Y.; Deutch, D.; Tannen, V. Provenance for aggregate queries. In Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Athens, Greece, 12–16 June 2011; pp. 153–164. [Google Scholar]
Ives, Z.G.; Green, T.J.; Karvounarakis, G.; Taylor, N.E.; Tannen, V.; Talukdar, P.P.; Jacob, M.; Pereira, F. The ORCHESTRA collaborative data sharing system. ACM Sigmod Rec. 2008, 37, 26–32. [Google Scholar] [CrossRef] [Green Version]
Chiticariu, L.; Tan, W.-C. Debugging schema mappings with routes. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, 12–15 September 2006; pp. 79–90. [Google Scholar]
Alvaro, P.; Rosen, J.; Hellerstein, J.M. Lineage-driven fault injection. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Australia, 31 May–4 June 2015; pp. 331–346. [Google Scholar]
Zhou, W.; Mapara, S.; Ren, Y.; Li, Y.; Haeberlen, A.; Ives, Z.; Loo, B.T.; Sherr, M. Distributed time-aware provenance. Proc. VLDB Endow. 2012, 6, 49–60. [Google Scholar] [CrossRef] [Green Version]
Interlandi, M.; Shah, K.; Tetali, S.D.; Gulzar, M.A.; Yoo, S.; Kim, M.; Millstein, T.; Condie, T. Titian: Data provenance support in spark. Proc. VLDB Endow. 2015, 9, 216–227. [Google Scholar] [CrossRef]
Zhou, W.; Fei, Q.; Sun, S.; Tao, T.; Haeberlen, A.; Ives, Z.; Loo, B.T.; Sherr, M. NetTrails: A declarative platform for maintaining and querying provenance in distributed systems. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece, 12–16 June 2011; pp. 1323–1326. [Google Scholar]
Ikeda, R.; Park, H.; Widom, J. Provenance for generalized map and reduce workflows. In Proceedings of the Fifth Biennial Conference on Innovative Data Systems Research, CIDR 2011, Asilomar, CA, USA, 9–12 January 2011. [Google Scholar]
Ainy, E.; Bourhis, P.; Davidson, S.B.; Deutch, D.; Milo, T. Approximated summarization of data provenance. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 19–23 October 2015; pp. 483–492. [Google Scholar]
Ainy, E.; Bourhis, P.; Davidson, S.B.; Deutch, D.; Milo, T. PROX: Approximated Summarization of Data Provenance. In Advances in Database Technology: Proceedings of the International Conference on Extending Database Technology, Bordeaux, France, 15–18 March 2016; OpenProceedings.org: Konstanz, Germany, 2016; p. 620. [Google Scholar]
Deutch, D.; Gilad, A.; Moskovitch, Y. selp: Selective tracking and presentation of data provenance. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering, Seoul, Korea, 13–17 April 2015; pp. 1484–1487. [Google Scholar]
Deutch, D.; Gilad, A.; Moskovitch, Y. Selective provenance for datalog programs using top-k queries. Proc. VLDB Endow. 2015, 8, 1394–1405. [Google Scholar] [CrossRef] [Green Version]
Deutch, D.; Milo, T.; Roy, S.; Tannen, V. Circuits for Datalog Provenance. In Proceedings of the ICDT, Athens, Greece, 24–28 March 2014; pp. 201–212. [Google Scholar]
Deutch, D.; Frost, N.; Gilad, A. Nlprov: Natural language provenance. Proc. VLDB Endow. 2016, 9, 1537–1540. [Google Scholar] [CrossRef]
Huq, M.R.; Apers, P.M.; Wombacher, A. ProvenanceCurious: A tool to infer data provenance from scripts. In Proceedings of the 16th International Conference on Extending Database Technology, Genoa, Italy, 18–22 March 2013; pp. 765–768. [Google Scholar]
Müller, T.; Grust, T. Provenance for SQL through abstract interpretation: Value-less, but worthwhile. Proc. VLDB Endow. 2015, 8, 1872–1875. [Google Scholar] [CrossRef]
Dietrich, B.; Müller, T.; Grust, T. The Best Bang for Your Bu (ck) g. In Proceedings of the EDBT, Bordeaux, France, 15–16 March 2016; pp. 674–675. [Google Scholar]
Buneman, P.; Khanna, S.; Tan, W.-C. On propagation of deletions and annotations through views. In Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, WI, USA, 3–5 June 2002. [Google Scholar]
Transaction Processing Performance Council (TPC). 2018. TPC Benchmark H(Decision Support) Standard Specification, Revision 2.18.0. Available online: https://www.tpc.org (accessed on 19 October 2022).

Figure 1. A sample database. Suppliers, parts, and projects are uniquely identified by supplier number (Sno), part number (Pno), and project number (Jno), respectively. The shipment relation stores the number of parts a supplier has supplied for a project, where the primary key is the combination of {Sno, Pno, Jno}.

Figure 2. The query Q₁ and its output.

Figure 3. The query

Q_{2}

and its output tagged with tuple-level where-prov.

Figure 3. The query

Q_{2}

and its output tagged with tuple-level where-prov.

Figure 4. The query

Q_{3}

and its output.

Figure 4. The query

Q_{3}

and its output.

Figure 5. (a) Tuple provenance subgraphs of the tuple

k_{1}

and

k_{4}

. (b) Tuple provenance subgraphs of the tuple

k_{3}

. (c) Tuple provenance subgraphs of the tuple

k_{2}

and

k_{5}

. Nodes contain actual parameter information. The relational table node generally contains metadata, such as field names, and the operation node includes the expression and a reference to the input values. “$.k” refers to the kth field of the tuple, and in multivariate operations, “$i” and “$i.k” refer to the ith tuple and the kth field of the tuple, respectively. Hereafter, we use the same notation related to provenance graphs.

Figure 5. (a) Tuple provenance subgraphs of the tuple

k_{1}

and

k_{4}

. (b) Tuple provenance subgraphs of the tuple

k_{3}

. (c) Tuple provenance subgraphs of the tuple

k_{2}

and

k_{5}

. Nodes contain actual parameter information. The relational table node generally contains metadata, such as field names, and the operation node includes the expression and a reference to the input values. “$.k” refers to the kth field of the tuple, and in multivariate operations, “$i” and “$i.k” refer to the ith tuple and the kth field of the tuple, respectively. Hereafter, we use the same notation related to provenance graphs.

Figure 6. SPGraph of the query

Q_{2}

.

Figure 6. SPGraph of the query

Q_{2}

.

Figure 7. SPGraph and TupPRoute of the query

Q_{2}

.

Figure 7. SPGraph and TupPRoute of the query

Q_{2}

.

Figure 8. The field provenance graphs of the query

Q_{2}

. (a) and (b) correspond to

F P G r a p h (Q_{2} . J n o)

and

F P G r a p h (Q_{2} . s m)

, respectively.

Figure 8. The field provenance graphs of the query

Q_{2}

. (a) and (b) correspond to

F P G r a p h (Q_{2} . J n o)

and

F P G r a p h (Q_{2} . s m)

, respectively.

Figure 9. The architecture of Provpg.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, D.; Zhao, R.; Lin, Y.; Zhang, T.; Zhang, P. Modeling the Data Provenance of Relational Databases Supporting Full-Featured SQL and Procedural Languages. Appl. Sci. 2023, 13, 64. https://doi.org/10.3390/app13010064

AMA Style

Tang D, Zhao R, Lin Y, Zhang T, Zhang P. Modeling the Data Provenance of Relational Databases Supporting Full-Featured SQL and Procedural Languages. Applied Sciences. 2023; 13(1):64. https://doi.org/10.3390/app13010064

Chicago/Turabian Style

Tang, Deyou, Rong Zhao, Yuebang Lin, Tangqing Zhang, and Pingjian Zhang. 2023. "Modeling the Data Provenance of Relational Databases Supporting Full-Featured SQL and Procedural Languages" Applied Sciences 13, no. 1: 64. https://doi.org/10.3390/app13010064

APA Style

Tang, D., Zhao, R., Lin, Y., Zhang, T., & Zhang, P. (2023). Modeling the Data Provenance of Relational Databases Supporting Full-Featured SQL and Procedural Languages. Applied Sciences, 13(1), 64. https://doi.org/10.3390/app13010064

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modeling the Data Provenance of Relational Databases Supporting Full-Featured SQL and Procedural Languages

Abstract

1. Introduction

2. Related Research

3. Semantic Expression of Relational Data Provenance

3.1. Contribution Semantics

3.2. Terminology and Symbols

3.3. Representation of Provenance Data

3.4. Representation of Provenance Transformation

3.4.1. Tuple and Value Provenance Transformation

3.4.2. Representing Provenance Transformation in SQL with Meta-Transformation

3.4.3. Representing Provenance Transformation in Stored Procedures with Meta-Transformation

4. Representing Data Provenance in DBMS

4.1. Schema-Level Provenance Representation

4.2. A Compact Representation of TupPGraph

4.3. Field and Value Level Provenance Representation

4.4. Implementation of Data Provenance in PostgreSQL

5. Conclusions and Future Work

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI